Measuring Image Similarity with Neural Nets

Automatically finding similar and duplicate images can be very useful as a quick way to show similar products or items from a collection of images. For example, I was shopping for a phone case and the online store had many many interesting designs but they were hard to navigate. Once I found a case that I kind of liked I wanted to see other similar cases to find one that I really liked. Unfortunately they only showed other popular cases that were not at all similar to the one I was considering.

You could try to do this by hand or by analyzing meta-data but it would be difficult with a large number of images. This article shows a way to use a neural net to examine and group similar images automatically. These art images below were grouped together with no external / meta information or human input at all.

The general approach

If you have meta-data, tags or a text description, of the images you could try to use that information to group similar images together. For example you might group all the "dogs" together or the "women laughing alone with salad" images together. However that would require that someone go through and manually add tags or descriptions to the images since they are not always available.

You may also be able to use "collaborative filtering" if you have a likes or ratings for the images. This approach would group images together that people tend to like or rate similarly and would give you a "people who liked this image also liked this other image" type of experience. Again you would need to have a way to gather those likes and ratings.

For fully automated approaches you'd want to find a way to create a smaller mathematical representation that you could use for comparisons. These representations are often called fingerprints, embeddings or features. Popular representations include statistical approaches such as principle component analysis (PCA), color histogram representations, hand written, custom or standard image processing features such as Histogram of Oriented Gradients (HOG) and more modern deep learning neural net approaches such as auto-encoders, deep metric learning, custom neural nets and pre-trained neural nets.

In this example we use a pre-trained neural net designed to do image classification for a once popular competition called ImageNet. The network we'll use is VGG16 (I still find it funny that the networks get their own name) but you could experiment with other networks and see what kind of different results you get.

ImageNet networks are designed to receive a (small) image and then output one of thousand different classes. The classes are all general purpose every day things like dogs, cats, birds, car, school bus, etc. You could use the final prediction to auto tag your images but we'll use the output from the next to last layer. The output from this layer does not have semantic meaning like dog or plane but feeds features into the final layer that then makes that classification. We'll use those features in this example. You can also experiment with layers earlier in the network to see what kind of results you get there.

The output of this second to last layer can be a large single or multiple dimensional matrix depending on the network so you may have to flatten the output to treat it as a long vector. This vector can be then interpreted as a point in a highly dimensional space. We can then interpret points (images) that are close together to be similar (in some network specified way) and their corresponding images should be similar also. Rather than euclidean distance it is common to use a cosine distance metric but many different distance metrics are possible. Again you can experiment to see what kind of results you get on your data set.

Given these points in space you typically want to find other points that are close by (similar images) and sometimes points that are far away (dissimilar images). This general problem is often called 'n-nearest neighbors'. In an offline / batch application you can do this using brute force and compare each point to every other point and then sort the distances and take the n smallest or largest as your answer. This is likely be too slow for an interactive app so you may want to use an approximate nearest neighbor algorithm such as Annoy in that case. There are a variety of algorithms and libraries available and you can review benchmarks to pick one that would work for you. Also, there are even hardware accelerators that can very quickly find exact nearest neighbors.

The images

For this example, I wanted to use images that would be interesting to look at so downloaded 29,232 images from the Cleveland Museum of Art Open Access. I want to thank the museum for making the images available and to be clear that the Museum is not responsible or directly endorse this project.

Overall I was impressed by the results of this simple approach. Sometimes it is hard to tell why the network grouped some images together but many times it absolutely did a great job finding groups of images that were very similar. You can explore the results yourself at https://image-similarity.e-string.com

Also, I have a github repo at https://github.com/JulioBarros/image-similarity that you could use as a start for your own image similarity experiments as I did for this project.

What do you think?

Want to get notified of new articles and insights?