Smart Stylist: A Fashion Recommender System powered by Computer Vision

Dec 6, 20233 min read

The live demo of this project is available below:

https://smartstylist.streamlit.app

Introduction

The objective of this project is to create a content-based fashion recommender system by utilizing vector embeddings of images for ranking and retrieval. When shopping for clothing online, the appearance of a garment piece is often an important factor in the decision making process.

With the availability of visual data online, visual search has become increasingly important to look for information that may be difficult to communicate through text.

How It Works

I developed this project through a three-phase approach:

first, the data collection stage, which involves preparing the data utilized in training the machine learning models.
Following that, the machine learning (ML) modeling stage focuses on training models to detect fashion items within an image and creating embeddings of detected fashion items. These embeddings ensure that visually similar images have a shorter distance between them compared to dissimilar images.
The last phase, the serving stage, employs the trained models to crop fashion objects in the query image and use the visual embeddings to locate the most closely related product candidates.

Dataset

The dataset used to train the models is sourced from the publicly available dataset provided here. It comprises style board images, each accompanied by complete bounding boxes serving as ground truth labels for individual fashion objects.

For training the embedding model, I utilized a set of up to 50,000 crop images, where these crops were obtained by isolating all detected fashion objects within the style board images. Among these, more than 12,000 images were used for the vector index.

In the training of the object detection model, a dataset of over 12,000 style board images, each featuring complete bounding boxes for individual fashion objects as ground truth labels, was utilized.

For more details on dataset preparation please take a look at the following:

ML Modeling

In this stage, I trained a YOLOv5 detection model that detect fashion objects in images and crop these objects into individual images . I also trained an embedding model to represent the detected fashion objects. The embedding learns from visual similarity across the vector index.

I outlined my process in training the models and the analysis in the following blog posts:

Object Detection Model

I achieved the following result by using a pre-trained YOLOv5 model:

mAP	Precision	Recall
60.9	54.8	64.8

Embedding Model

I used an AutoEncoder network architecture, as featured above, for the embedding model, incorporating convolutional layers in both the encoder and decoder structures. From the bottleneck layer, I obtained a 512-dimensional embedding, yielding the following result:

Model	Index Size	mAP
Baseline (4096-d)	163.8 MB	0.44
+Layer (512-d)	25 MB	0.53

The model used for this project is the +Layer model, which uses the same structure as the baseline model with an updated architecture.

Demo

Below is two short video demos of the system:

Results

Sample results taken from the app gallery:

Recommendations and Use Case

The analysis and recommendations for the individual models were covered in their individual blog posts, please refer to the blog post linked in the ML modeling section above to check them out.

To improve the recommender system as a whole, I would recommend the following:

The current models are trained on images with exclusively white background, consider incorporating images with backgrounds or include variability in the dataset.
The current models are mostly trained on female fashion, consider adding male and/or gender neutral fashion as well.
Consider intergrating a semantic search capability in the system to refine recommendations, in which users can upload images and specify preferences in text, creating a more personalized and detailed search experience.
Due to memory constraints, the current index comprises only of vector embeddings of image and image paths. However, to enhance recommendations, additional attributes can be incorporated. These may include color, pattern, category, product description (which also aligns with adding semantic search ability), price, and more

The visual search system can be customized to meet other requirements. In this project, users are required to upload an image for the system to generate recommendations. However, it is also capable of providing recommendations for different scenarios, such as, an E-Commerce system, given that fashion experiences trends throughout the seasons and new items are frequently introduced regularly, the system can be utilized to recommend these new items to current customers based on their historical data.

References

The workflow for this project was inspired by the following paper:

https://arxiv.org/pdf/2006.10866.pdf