Fashion Recommender System with Visual Similarity Search

Oct 31, 20238 min read

Introduction

The objective of this project is to create a content-based fashion recommender system by utilizing vector embeddings of images for ranking and retrieval. When shopping for clothing online, the appearance of a garment piece is often an important factor in the decision making process.

With the availability of visual data online, visual search has become increasingly important to look for information that may be difficult to communicate through text.

The project is separated into 3 different notebooks, each item will go to its respective notebooks

To go to the full folder containing the notebooks and sample results, click here

This post contains the analysis, recommendations, and a high-level overview of each step.

The following is the flowchart for this project:

Data Preparation and Exploration

For this project, I utilized the public dataset for the paper, Bootstrapping Complete The Look at Pinterest, available here.

Their paper is available here

The public dataset consists of over 100,000 polyvore-style images (fashion boards) with white background and bounding boxes as the ground-truth label for fashion categories, with over 400,000 fashion objects identified in these images.

I cropped each fashion objects from these fashion board images into individual images with Python for this project and ended with a total of 50,000 object images. Let's take a look at some of them:

Class Distribution of the dataset:

Shoes and handbags categories made up around 40% of the total dataset, with very little data on some categories, such as stockings and neckties, this may affect how the model learn features of these product categories later.

Scatter Plot of Fashion Categories in 2D

The plot below was obtained through K-Nearest Neighbor algorithm with dimensionality reduction.

Looking at the clusters, the handbags category is more spread out compared to the other categories, this may indicate that the features of the handbags are more different compared to the other categories that are clustered in the middle, such as shirts, which may have more generic features.

Shoes and pants categories are also other distinct categories that are spread further away from the middle clusters. There are also a few outliers in some categories, for instance, the 2 yellow points that are further away from the other yellow cluster (dresses) in the pants, this may signify similar features this particular two dots share with most pants items, or they may be mislabelled data points.

Model Training

To get visually similar product recommendations for the product query, I first have to extract visual embedding of the object before fetching its nearest neighbor results. I explored and experimented with multiple neural network architectures on fashion image datasets to extract their visual features, particularly scene images and product shots.

The notebooks for these experiments are available in my github repo, available here

Product shots are fashion product images with white background, such as the crop image of an object in the dataset mentioned above. Scene images are clothing articles in real-world scenes, let's take a look at an example for comparison:

For this project, I trained a ResNet18 CNN architecture using Fast.ai framework with a T4 GPU for 10 epochs. Let's take a look at a couple training plots:

On the left is the train and validation loss, they are both decreasing during training but have yet to reach convergence on epoch 10 (or 9 if we start counting from 0, as seen in the plot). On the right is the top 1 and top 5 accuracy plot, top-5 consistently reached 99% accuracy, and top 1 can be seen to be increasing its accuracy from 88% - 94% from epoch 0-9.

Validation loss is consistently lower than training loss in the plot, which may be due to dropout layers used in the model architecture. As model hasn't reached convergence at the last epoch, it is difficult to make exact changes and identify the issue. Training for more epochs may allow for better analysis.

Top-1 is the conventional accuracy score as it measures the proportion of examples for which the predicted label matches the ground-truth label, while Top-5 considers a classification to be correct as long as any one of the 5 predictions matches the ground-truth label.

For future training, I would recommend training for more epochs so that the model can reach convergence.

Now, Let's also take a look at some of it's actual predictions:

Let's also take a look at the confusion matrix for the model's classifications, there happen to be some wrong predictions, mostly on categories that have similar features, such as, shirts vs coats, dresses vs skirts. Some items with less data are also not shown in the matrix below, such as, gloves, jumpsuits, neckties, and stockings.

Feature Extraction

To extract the features of the images in the dataset, I took the second last linear layer of the model that stores the embeddings of these images and saves the numerical representations into a data frame:

Similarity Search

For this project, I explored a few different Nearest Neighbor approaches, such as:

FAISS (Flat Index)
FAISS (IVF Index with PQ)
ANNOY (Tree-based algorithm)
NMSLib (HNSW)

Let's take a look at the time to build the index and the average search time for the four indexes for this project:

The left plot is the time to build the index for each method, and the right plot is the time to search (k=5) for each method. Let's also look at the table to see the millisecond details:

It takes the least amount of time for flat index method to build the index, but flat index also takes the second longest time to search during inference. In contrast, it takes the least amount of time for HNSW, with ANNOY at a close second, to search during inference, but HNSW does take almost 2.5x longer, compared to ANNOY, to build the index.

With these information alone, it seems like, ANNOY would be a good choice, considering the tradeoff between index build time and search time, with HNSW coming close behind. Now, let's take a look at their memory consumption:

Looking at the plot above, Flat Index uses the most amount of memory to build the index, with ANNOY coming in second, while HNSW uses the least amount of memory at 4.95 MB, compared to ANNOY at 84.25 MB, which uses significantly more memory in this case.

The above uses different parameters for each method that can be fine-tuned for different method. The general recommendation for the dataset in this project (50,000 data, using 512-dimensional embeddings) is to use HNSW or Inverted File Index with Product Quantization.

Limitations

When using ANN for similarity search, different parameters of each similarity search method will affect the end result, these include search time, index build time, memory usage, and accuracy. I used significantly more trees during the ANNOY method (ntrees=50) compared to significantly less edges/connections (posts=2) for HNSW method, which affect the end result, and may not be the best exact comparison for each method.

Take a look at the following data for more accurate ANN comparison/benchmarks:

New approximate nearest neighbors benchmarks by Erik Berhnhardsson

Results

Going back to the purpose of this project, which is to recommend top-k similar items to a query item, let's take a look at some result based on its similarity index used, query item is subtitled "item selected", recommended items are subtitled "Recommended item":

Flat Index (FAISS)

The flat index is the exhaustive search (or brute force approach) method, I used FAISS library for the implementation in this project. In general, Flat Index will produce the most accurate result but is not scalable for large datasets. I used L2 distance for the implementation in this project, with k=5 and k=6 respectively.

Inverted File Index with PQ (FAISS)

Composite Index such as IVF_PQ is suitable for memory-constrained applications with huge datasets (~100M vectors). A few parameters control the build method (index-creation). For this project, I used to following parameters and its values:

Sub-vector size: 8
Number of Partitions: 8
Search in x partitions: 2

Let's take a look at some of its' results, I used k=5 and k=10 to see if there's a difference in accuracy:

Tree-based algorithm (ANNOY)

I used the ANNOY library for this implementation, with ntrees=50. I also tried k=5, k=6, and k=10, let's look at some of its' results:

Less accurate result was also found (query item was shoes, but recommendations returned 70% non-shoes options):

HNSW (NMSLib)

For my implementation in this project, I used the NMSLib library, but there are also other well known HNSW implementations, such as FAISS. I used a maximum of 2 connection/ edges, with cosine similarity as the distance metric.

I also tried a same query item on 2 different methods to see if they recommend different items, and both of them do recommend the same product categories at k=5, but a couple different items were recommended in place of the other:

(left is recommendations with HNSW and right is recommendations with Flat Index)

(left is recommendations with ANNOY k=6, right is recommendations with FAISS IVPQ k=8)

Evaluation

For this project, I used offline evaluation with Mean Average Precision (mAP) as its metric, let's first take a look at the AP formula:

Relevance is signified by the same category for recommendation output as the input query.

To calculate the mAP, average the AP by the total number of output lists produced by the model, let's take a look at a sample figure from ByteByteGo:

image source: ByteByteGo - Visual Search System

The below result is the mAP of each method from the experimentations, with Flat Index achieving the highest accuracy.

Looking at the mAP, I wanted to take a look at the categories that's bringing down the AP, let's take a look at some of them (for the purpose of identifying the categories that's bringing down the AP, I'm showing the outputs from Flat Index as Flat Index in general will produce the most accurate results):

Based on some sample predictions obtained, there doesn't seem to be any correlation between class category with false positives. Further analysis need to be done to improve mAP across categories (see recommendations).

Recommendations

The objective of the project is to recommend similar items by looking at the visual representations of the product, thus the recommendations would be focused on the following:

Improving the accuracy of each product classification during model training and selection
Improving the accuracy of recommendations while maintaining a certain level of search-speed, index build-time, and memory consumption of ANN index choice
Look for a satisficing threshold for certain metrics such as, search speed, index build-time, and maximum RAM usage.

To improve the accuracy of product classification during model training and selection, I would recommend the following:

Compile a classification report of precision, recall, and F1 score for each class category during model training and set a threshold for each category to further improve upon the accuracy category classification.
Based on low-accuracy categories, conduct an error analysis report to see if more data is needed, more regularization, different model, more or less epochs is needed, etc.
With the current result as seen above in Model Training, using the current ResNet18 model seems promising but would need additional epoch training until it reaches convergence to provide further analysis.

To improve accuracy of recommendations while maintaining other measures as mentioned above, I would recommend the following:

During the experimentations as noted above in Similarity Search, the different index methods weren't necessarily compared with comparable methods, for example, ANNOY index used 50 trees, while HNSW index only used 2 edges, which disproportionately increases the contrast between the memory consumption between these two indexes. Thus, the recommendation for next step is to compare between more comparable parameters between different indexes, which would depend on the number of the datasets and the dimensionality of these data.
Based on the recommendations here and here, a standard file index with IVF may work best for the current dataset size and vector dimensions, so the next approach would be to try the flat index with IVF in place of IVF with PQ, which may reduce index build-time.

Other things to consider

When building a visual search or recommender systems, apart from the precision and accuracy of the model and recommendations, it is also important to consider the real life use-cases, such as, the purpose of recommending similar items in a business context, or the different scenarios this recommendations would be deployed.

So far, I have only talked about offline metrics, but if something were to be deployed in real life for production, It may be useful to conduct online evaluation methods if the model made to production, and to employ other metrics, such as, Click-Through Rate and conversion-rate as noted here in their online evaluation method.

If the model and index is built to recommend items to increase Unit per Transaction in an e-commerce store, it may be useful to also include data of customers who have purchased from clicked item. It may also be useful to add semantic data, such as price, product descriptions, customer reviews, etc. which may help in not only recommending more relevant products, but also help in increasing the business metrics, such as Average Unit Retail, Average Ticket, Up-selling, and Cross-selling.