Finding Similar Text with Machine Learning and Natural Language Processing
I've been working with a client on analyzing some text documents and wanted to share a bit of what has been working for us. I can't share the data or the exact project details but it entails, finding similar text documents from a large collection of other documents a specific query example. Imagine searching a database of company statements, product descriptions, articles, contracts, emails, support/trouble tickets, etc. not by keyword but by 'meaning' and 'similarity'.
For this article I thought about using the Enron email or a financial filings dataset but wanted something with more general interest and relatability so instead found something more fun and accessible on Kaggle. The Wikipedia Movie Plots dataset has ~34k records with movie titles, release year, genre, etc. and most importantly for us 'Plot' descriptions. It seems like the perfect dataset set to experiment with and to use when exploring similarity/semantic search engines, content based recommenders, movie plot classifiers and many other interesting projects. In this example we're going to use the data in the plot field to find other 'similar' plots.
Keep in mind that were going to look for similar plot (write ups) and not similar movie styles, or similar genres, etc. That is, a (simple) plot, "boy meets girl, boy loses girl", could be rendered in many different styles and genres and with many different character archetypes. That said, the plots were written by people after having seen the movie and people will tend to include specific details, such as actors, in the 'plot' so they are not completely independent of other characteristics of the movie ... which can make for some other interesting future experiments such as can we predict the genre or year from the plot, etc.
Business Goal and Metric
Analyzing the plot field highlights the first issue any machine learning project: Identifying the business goal and finding an appropriate metric. This is the heart of every project and an important and often tricky first step. This is particularly challenging in similarity or clustering based (ie. unsupervised) experiments as there is no canonical correct answer and the utility and validity is very context dependent.
The point of this article is to give a high level overview of some techniques and approaches not to propose as specific solution to a specific problem ... so, I'm just going to punt on it here and say that my business goal is to 'explore similarity in various ways' and my metric is 'do I get answers that feel right'. If this were are recommender system we might want to measure if the similar movies were on users' 'favorites' list or how often people added recommended movies to their queue and so on. If you have a more specific business need I'd be happy to discuss it.
Feature Generation
Once you have a goal and metric in mind the next step is to develop ways of generating features that you feel may be useful to achieve those goals. Simply put, the algorithms need numbers, we have text. What numbers are we going to generate from our text? For example, we could use the length of each plot write up, the number of vowels, the number of large words, etc. or more advanced techniques like TF-IDF or neural net transformers. Some of these features will work better than others in different applications so feature selection is part art and part science.
For this example I could have used a classic technique, TF-IDF, and perhaps we'll revisit that in a future article, or a hot new BERT based model. I chose instead to use the standby Universal Sentence Encoder (USE) because it is easy to set up and use and works really well. TF-IDF is probably easier if you don't have any TensorFlow experience and a fine-tuned BERT based model would probably work better if you wanted to invest the time but USE is a great place to start and in a few lines of code you can get a meaningful numeric 512 dimensional representation (embedding) of short text passages.
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")
embeddings = embed(my_text)
Measuring Similarity with Cosine
Ok, great, so I have a 34886 x 512 matrix now what? Well the idea with turning the text into an embedding vector is that the embedding now represents a point in a semantic space and other points 'near' it represent other similar embedded items (plot descriptions).
The next decision is to formalize what we mean by 'near'. In two or three dimensional spaces we have intimate ingrained knowledge and a firm concept of 'nearness' and we call that the Euclidean distance. Unfortunately Euclidean distance does not work well in high dimensional spaces because of the Curse of dimensionality. The bottom line is that we'll be better off using angular (cosine) distance in high dimensions.
Luckily that is easy to do with Python and scikit-learn. Given a point/embedding represented by v
and a the matrix of embeddings
we can calculate all the angular distances (similarities actually) and then find the top n
similarities with this rough pseudo-code
from sklearn.metrics.pairwise import cosine_similarity
d = cosine_similarity(v, embeddings).ravel() #find similarity and flatten results
a = d.argsort()[::-1] # find the indexes of the sorted similarities and reverse them
return a[:n] # return the top n
Results
Using these embeddings lets take a look at a couple of examples from movies that I have watched and enjoyed 'There Will Be Blood', 'The Big Lebowski', and 'Monty Python and the Holy Grail'. Take a moment to think about how you'd describe their plots.
The tables below give the 10 most similar plots to each of the movies based on the cosine similarity of the USE embeddings. Note the plots here have been truncated to fit in the table. The index isn't too useful if you don't have the exact shuffling of the dataset that I used but it helps me keep track of the movies. Similarity relates to the cosine of the angle between the two embeddings/vectors. An angle of 0 has a cosine of 1 so that is most similar. IOU is intersection over union which I added as a measure of the commonality of the words in the plot descriptions between two movies. Genre is specified in the dataset and is a very messy feature. Though the genre could be useful for some experiments there are ~2265 unique values and would need to be cleaned up. Title is self explanatory and the Plot column has the first few words of the plot description so we can get a feel for the text.
Index | Similarity | IOU | Genre | Title | Plot |
---|---|---|---|---|---|
2028 | 1.000 | 1.000 | drama | There Will Be Blood | In 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti |
34283 | 0.655 | 0.085 | drama | Kidnapped | Young David Balfour arrives at a bleak Scottish house, the House of Sh |
24308 | 0.631 | 0.089 | drama | Silver Dollar | Kansas farmer Yates Martin (Edward G. Robinson) uproots his uncomplain |
6424 | 0.620 | 0.094 | adventure | Doc Savage: The Man of Br | In 1936, Doc Savage (Ron Ely) returns to New York City following a vis |
28162 | 0.611 | 0.088 | western | Campbell's Kingdom | Recently diagnosed with a terminal disease, Bruce Campbell (Dirk Bogar |
32776 | 0.603 | 0.108 | comedy wes | Waterhole No. 3 | In Arizona, a shipment of gold bullion is stolen in an inside job by a |
8907 | 0.602 | 0.101 | western co | The Dude Goes West | A gunsmith and a marksman, Daniel Bone closes up his Brooklyn, New Yor |
30885 | 0.600 | 0.088 | family, fa | Jumanji | In 1869, near Brantford, New Hampshire, two brothers bury a chest and |
2721 | 0.597 | 0.102 | drama | The Kidnappers | In the early 1900s, two young orphaned brothers, eight year old Harry |
33689 | 0.595 | 0.076 | drama | The Hanging Tree | Joseph Frail (Gary Cooper)—doctor, gambler, gunslinger—rides into the |
I don't know the other movies but note that most have to do with family, land and greed and there are no romance or space related movies. Do you know the movies? Do you feel they are similar in some ways?
Index | Similarity | IOU | Genre | Title | Plot |
---|---|---|---|---|---|
1375 | 1.000 | 1.000 | comedy | The Big Lebowski | In 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in |
12795 | 0.592 | 0.084 | comedy | That's My Boy | In 1984, middle school student Donny Berger is in detention and begins |
31057 | 0.570 | 0.079 | comedy | Jeff, Who Lives at Home | Jeff (Segel) is a 30-year-old unemployed stoner living in his mother S |
4955 | 0.562 | 0.088 | unknown | The Nine Lives of Fritz t | It is the 1970s; Fritz the Cat is now married, on welfare, and has a c |
31531 | 0.553 | 0.078 | comedy | Little Monsters | Brian's family has moved to a new town, and he feels isolated in his n |
16555 | 0.553 | 0.087 | drama | T.R. Baskin | When Jack Mitchell (Peter Boyle), a married middle-aged salesman with |
25180 | 0.548 | 0.106 | musical dr | Inside Llewyn Davis | In February 1961, Llewyn Davis is a struggling folk singer in New York |
18782 | 0.548 | 0.106 | unknown | Inside Llewyn Davis | In February 1961, Llewyn Davis is a struggling folk singer in New York |
31802 | 0.548 | 0.068 | drama, rom | The Cooler | Unlucky Bernie Lootz (William H. Macy) has little positive going for h |
803 | 0.548 | 0.051 | unknown | Harry and Tonto | Harry Coombes (Art Carney) is an elderly widower and retired teacher w |
I've seen 'Inside Llewyn Davis' but not the others. The plots seem to be related to 'lazy' or 'unsuccessful' people. Also note the Llewyn Davis is shows up twice even though this is supposed to be a de-duped list. I found several duplicates (not just duplicate titles) while exploring dataset with the embeddings.
Index | Similarity | IOU | Genre | Title | Plot |
---|---|---|---|---|---|
22511 | 1.000 | 1.000 | comedy | Monty Python and the Holy | In 932 AD, King Arthur and his squire, Patsy, travel throughout Britai |
5905 | 0.754 | 0.146 | romance | Lancelot and Guinevere | Lancelot is King Arthur's most valued Knight of the Round Table and a |
16210 | 0.749 | 0.106 | adventure | Siege of the Saxons | King Arthur learns one of his knights is plotting to take over and mar |
26822 | 0.742 | 0.114 | serial | Adventures of Sir Galahad | The Arthurian film cycle started with the Adventures of Sir Galahad se |
12475 | 0.721 | 0.120 | animated | Knighty Knight Bugs | King Arthur is sitting with his Knights of the Round Table, complainin |
11730 | 0.719 | 0.093 | musical | Camelot | King Arthur is preparing for a great battle against his friend, Sir La |
23032 | 0.711 | 0.098 | fantasy | First Knight | The film's opening text establishes that King Arthur (Sean Connery) of |
11201 | 0.704 | 0.089 | unknown | King Arthur: Legend of th | Mordred, an iron-fisted warlock, and his armies lay siege to Camelot, |
19146 | 0.704 | 0.089 | action, ad | King Arthur: Legend of th | Mordred, an iron-fisted warlock, and his armies lay siege to Camelot, |
27971 | 0.690 | 0.100 | animated | The Sword in the Stone | After the King of England, Uther Pendragon, dies, leaving no heir to t |
These all seem to relate to King Arthur and Camelot and none seem to be (absurdist) comedies and there are no westerns or space adventures.
Dimensionality Reduction
It can often be useful to visualize our collection to get a feeling for how the items 'cluster' or relate to each other. The embedding vectors we used have 512 dimensions so to visualize them we're going to need to reduce them to 2 or 3 dimensions. The challenge is doing that while keeping as much useful information as possible.
Principal Component Analysis (PCA)
A classic approach to dimensionality reduction is Principle Components Analysis (PCA) which we'll use to find two axis through the data that show the most information (variance). Imagine trying to draw a loaf of French bread. You can rotate it to find an angle that shows the most information and you are likely to chose an angle that shows the full length and width which sacrifices the information in the height. You are less likely to draw it directly from the point where you highlight the width and height but not the length. Python has PCA functionality that is fast and easy and can be done in a few lines of code.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(embeddings)
pX = pca.transform(embeddings)
In the chart of the transformed PCA components we can see 3 or 4 main groups but not a lot of structure. This is not too surprising given what we now know about high dimensional spaces. It turns out we're not trying to layout a loaf of bread but rather a high dimensional basketball. If we inspect the PCA model we see that the first two components explain only about 10.8% of the variance.
Still, lets look at the most similar images for our 3 examples. And since we've now reduced the dimensions we can use the Euclidean distance.
from sklearn.metrics.pairwise import euclidean_distances
v = np.array([x, y])
v = v.reshape(1,-1)
d = euclidean_distances(v, emb).ravel()
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
2028 | 0.000 | (-0.137,-0.163) | drama | There Will Be Blood | In 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti |
7432 | 0.001 | (-0.137,-0.164) | drama | Secret Nation | The film opens with the death of the elderly and wealthy Leo Cryptus ( |
18122 | 0.001 | (-0.138,-0.163) | drama | Snow Cake | When the eccentric drifter Vivienne Freeman gets a ride from a relucta |
32180 | 0.001 | (-0.137,-0.164) | war | Men in War | On 6 September 1950, an isolated and exhausted platoon of the 24th Inf |
34366 | 0.001 | (-0.137,-0.165) | horror | The Wolf Man | Sometime in the early twentieth century, after learning of the death o |
28397 | 0.001 | (-0.136,-0.164) | fantasy | Osmosis Jones | Frank Detorre (Bill Murray) is an unkempt, slovenly zookeeper at the S |
33059 | 0.001 | (-0.136,-0.163) | drama | Better Times | As described in a film magazine,[2] the plot of the film is as follows |
20301 | 0.001 | (-0.136,-0.165) | adventure | Flowing Gold | Oilfield worker John Alexander (John Garfield) is on the run from a mu |
33034 | 0.002 | (-0.135,-0.163) | crime | The Mystery Man | A newspaper man, Larry Doyle and a young woman, Anne Olgivie, meet by |
22271 | 0.002 | (-0.139,-0.163) | action | Striking Distance | Thomas Hardy, a Pittsburgh Police homicide detective, has broken the r |
A few of the plots seem to deal with death and oil but they don't really seem as relevant now. We may have lost too much information in this transformation. To actually judge that we'd need a more concrete metric and an actual business goal.
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
1375 | 0.000 | (0.028,-0.142) | comedy | The Big Lebowski | In 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in |
27885 | 0.002 | (0.028,-0.144) | revenge, t | 22 Female Kottayam | Tessa (Rima Kallingal) is a nursing student in Bangalore with plans of |
1643 | 0.002 | (0.026,-0.143) | action / c | Let's Go! | Siu Sheung (Juno Mak) is a solitary and frustrated young man. He works |
33881 | 0.003 | (0.028,-0.139) | tokusatsu | Cho Kamen Rider Den-O & D | Taking place after the events of Kamen Rider Decade episode 15, under |
7507 | 0.003 | (0.031,-0.142) | horror | Dark Tales of Japan | Introduction: Would You Like to Hear a Scary Tale? (Intorodakushon: Ko |
28542 | 0.004 | (0.031,-0.139) | drama | End of Summer !The End of | Manbei Kohayagawa (Ganjirō Nakamura) is the head of a small sake brewe |
12552 | 0.005 | (0.033,-0.141) | drama | Confession of Pain | Police inspectors Lau Ching-hei and Yau Kin-bong arrest a rapist in 20 |
16316 | 0.005 | (0.033,-0.142) | thriller | Ice Cream 2 | A crew of eight amateur film makers approach a noted film producer (Ra |
24757 | 0.006 | (0.022,-0.145) | unknown | Battles Without Honor and | In Kure, Hiroshima 1946, when Shinichi Yamagata gets into a scuffle wi |
28203 | 0.007 | (0.022,-0.146) | horror | Dark Water | Yoshimi Matsubara, in the midst of a divorce mediation, rents a run-do |
The Lebowski movies seem even less related.
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
22511 | 0.000 | (-0.119,-0.198) | comedy | Monty Python and the Holy | In 932 AD, King Arthur and his squire, Patsy, travel throughout Britai |
6112 | 0.001 | (-0.118,-0.197) | thriller | The Strangler | Leo Kroll (Buono) is a mother-fixated lab technician who collects doll |
29395 | 0.001 | (-0.120,-0.200) | western | Outlaw's Son | Twelve-year-old Jeff Blaine lives in the small western town of Plainsv |
4072 | 0.002 | (-0.121,-0.199) | comedy-dra | Lady Bird | Christine "Lady Bird" McPherson is a senior student at a Catholic high |
8614 | 0.002 | (-0.119,-0.197) | unknown | The Snowman | At a remote cabin amidst heavy snowfall, a man named Jonas (Peter Dall |
10040 | 0.002 | (-0.121,-0.197) | western | Death of a Gunfighter | In the town of Cottonwood Springs, Texas at the turn of the century, M |
28972 | 0.002 | (-0.119,-0.196) | comedy | Bridesmaids | Annie Walker (Kristen Wiig) is a single woman in her late 30s. Followi |
686 | 0.002 | (-0.121,-0.199) | drama | Menace II Society | Caine Lawson and his best friend Kevin "O-Dog" Anderson enter a local |
24142 | 0.002 | (-0.121,-0.197) | comedy | It's a Boy | On the eve of his society wedding, Dudley Leake and his best man James |
24744 | 0.002 | (-0.120,-0.201) | film noir | Ministry of Fear | In wartime England during the Blitz, Stephen Neale (Ray Milland) is re |
Same for the Monty Python movies.
Overall this does not seem like a successful strategy as we have lost too much useful information.
T-Distributed Stochastic Neighbor Embedding (T-SNE)
So, going back to flattening a basketball, or perhaps a globe, you can try to choose a rotation that highlights the features (continents) you are most interested in ... or you can abandon the idea of a linear transformation and try to peel the surface off and lay it out on a flat surface. We all know that causes some distortions (ie. Greenland looks way bigger than it actually is on many maps) but is still a useful technique.
And what if you were free to cut and stretch that surface so that you could keep local information by sacrificing some global information. That is something like what T-SNE and UMAP try to do. And luckily they are both is easy to use from Python.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, n_jobs=-1)
tX = tsne.fit_transform(embeddings)
T-SNE attempts to model the items in such away that local distances are close in 2 or 3 dimensions at the expense of global (or far away distances). And we can see in the plot that clusters (groups of highly similar items) start to appear.
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
2028 | 0.000 | (1.423,30.830) | drama | There Will Be Blood | In 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti |
19221 | 0.107 | (1.438,30.724) | drama | The Wonder Kid | Sebastian Giro is a ten-year-old French boy and child musical prodigy |
8722 | 0.154 | (1.300,30.738) | unknown | Grimsby | "Nobby" Butcher has been separated from his little brother Sebastian f |
25748 | 0.271 | (1.164,30.751) | drama | The Power and the Prize | Although he is scheduled to wed his boss George Salt's niece that week |
13408 | 0.319 | (1.742,30.844) | fantasy | The Devil and Daniel Webs | In 1840 New Hampshire, Jabez Stone (James Craig), a poor kindhearted f |
2721 | 0.366 | (1.057,30.849) | drama | The Kidnappers | In the early 1900s, two young orphaned brothers, eight year old Harry |
34283 | 0.444 | (0.987,30.746) | drama | Kidnapped | Young David Balfour arrives at a bleak Scottish house, the House of Sh |
27595 | 0.669 | (0.756,30.773) | drama | Kidnapped | Scotland, 1751: At a stately manor near Edinburgh, the young David Bal |
17046 | 0.913 | (1.575,31.730) | adventure | Manfish | Inspector Warren of Scotland Yard flies into Jamaica and is taken to t |
6402 | 0.952 | (0.477,30.725) | drama | Tol'able David | David Kinemon, youngest son of West Virginia tenant farmers, longs to |
The similar movies for There Will Be Blood do seem to be be related to family and land again and we see 'The Kidnappers' and 'Kidnapped' appear on the list again.
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
1375 | 0.000 | (32.053,22.814) | comedy | The Big Lebowski | In 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in |
12795 | 0.149 | (32.200,22.842) | comedy | That's My Boy | In 1984, middle school student Donny Berger is in detention and begins |
23018 | 0.555 | (32.590,22.955) | horror | April Fool's Day | On the weekend leading up to April Fools' Day, a group of college frie |
25133 | 0.618 | (32.660,22.929) | horror | Terror Train | At a college pre-med student fraternity New Year's Eve party, a reluct |
455 | 0.648 | (31.544,23.215) | comedy | Adventureland | In 1987, James Brennan plans to have a summer vacation in Europe after |
5560 | 0.731 | (32.353,23.481) | comedy | The House | During their visit to Bucknell University, husband and wife Scott (Fer |
24944 | 0.789 | (31.915,23.591) | comedy | This Is the End | Jay Baruchel arrives in Los Angeles to visit old friend and fellow Can |
15475 | 0.795 | (31.958,23.603) | comedy | Superbad | Seth (Jonah Hill) and Evan (Michael Cera) are two high school seniors |
12949 | 0.797 | (32.346,22.072) | comedy | Class Act | Genius high school student Duncan Pinderhughes is getting ready for g |
1297 | 0.821 | (31.817,23.601) | adventure, | 30 Minutes or Less | Marijuana-smoking, Grand Rapids slacker pizza delivery driver Nick (Je |
For Lebowski we see 'That's My Boy' appear again along with a movie about a pot smoking slacker.
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
22511 | 0.000 | (-10.799,1.842) | comedy | Monty Python and the Holy | In 932 AD, King Arthur and his squire, Patsy, travel throughout Britai |
26822 | 0.015 | (-10.784,1.847) | serial | Adventures of Sir Galahad | The Arthurian film cycle started with the Adventures of Sir Galahad se |
5905 | 0.031 | (-10.771,1.830) | romance | Lancelot and Guinevere | Lancelot is King Arthur's most valued Knight of the Round Table and a |
31622 | 0.068 | (-10.793,1.910) | musical co | A Connecticut Yankee in K | Hank Martin (Bing Crosby), an American mechanic, is knocked out and wa |
27786 | 0.095 | (-10.757,1.927) | animated | Quest for Camelot | Sir Lionel is one of the knights of the Round Table, and his daughter |
23032 | 0.103 | (-10.712,1.787) | fantasy | First Knight | The film's opening text establishes that King Arthur (Sean Connery) of |
12475 | 0.106 | (-10.793,1.949) | animated | Knighty Knight Bugs | King Arthur is sitting with his Knights of the Round Table, complainin |
11730 | 0.121 | (-10.717,1.931) | musical | Camelot | King Arthur is preparing for a great battle against his friend, Sir La |
24511 | 0.136 | (-10.912,1.918) | adventure | Knights of the Round Tabl | With the land in anarchy, warring overlords, Arthur Pendragon (Mel Fer |
11201 | 0.136 | (-10.864,1.962) | unknown | King Arthur: Legend of th | Mordred, an iron-fisted warlock, and his armies lay siege to Camelot, |
The Monty Python movies make a lot more sense and we see movies about King Arthur, Camelot and Knights.
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
29543 | 0.021 | (45.612,37.416) | comedy sho | Three Little Sew and Sews | The Stooges are sailors employed in the tailor shop of a naval base. A |
9368 | 0.042 | (45.559,37.389) | comedy sho | Saved by the Belle | The Stooges are traveling salesmen stranded in Valeska, a fictional So |
13167 | 0.052 | (45.551,37.382) | comedy sho | Oily to Bed, Oily to Rise | The Stooges are three hapless tramps. After nearly destroying a farmer |
33192 | 0.059 | (45.626,37.453) | comedy sho | Booby Dupes | The Stooges are fish peddlers (similar to their roles in Cookoo Cavali |
11028 | 0.067 | (45.618,37.336) | short subj | Rhythm and Weep | The Stooges play the roles of unsuccessful actors who have decided to |
29448 | 0.082 | (45.527,37.437) | comedy sho | Rockin' Thru the Rockies | The Stooges are guides (circa late 1800s), who are helping a trio chri |
7845 | 0.090 | (45.530,37.344) | comedy sho | No Dough Boys | The Stooges are dressed as Japanese soldiers for a photo shoot; their |
34365 | 0.094 | (45.562,37.314) | comedy | Self-Made Maids | The Stooges are artists who fall in love with three models, Larraine, |
10009 | 0.097 | (45.655,37.480) | comedy | The Three Stooges in Orbi | The Stooges are TV actors who are trying to sell ideas for their anima |
29375 | 0.104 | (45.685,37.340) | comedy sho | Calling All Curs | The Stooges are skilled veterinarians at a pet hospital who are the pr |
And just cause I'm curious, I took a look at a small cluster near (45.6, 37.4). These turned out to all be short movies staring The Stooges. This is the strongest indication that T-SNE is doing something interesting we've investigated so far.
Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)
UMAP is a newer technique and like T-SNE, tries to preserve local structure but also most of the global structure in the data. They both depend a lot on the data and the particular parameters you choose so if you are interested on using it in your application you'll have to explore to find settings that work well for you. These algorithms may not be as straight forward as PCA but they seem to yield more interesting results.
We can run the UMAP algorithm by using the umap-learn package which uses the scikit-learn API of fit and transform. By plotting the transformed points you can see there is a lot more structure and clusters with similar movies.
import umap
um = umap.UMAP()
mapper = um.fit(embeddings)
uX = um.transform(embeddings)
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
2028 | 0.000 | (4.217,7.186) | drama | There Will Be Blood | In 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti |
29294 | 0.003 | (4.220,7.188) | comedy | Man of the Year | Tom Dobbs is host of a satirical news program, where he taps into peop |
1638 | 0.031 | (4.196,7.164) | western | Rocky Mountain Mystery | Mining engineer Larry Sutton (Randolph Scott) arrives at the Ballard r |
6539 | 0.034 | (4.199,7.214) | western | The Baron of Arizona | The notorious attempt by swindler James Reavis to claim the entire ter |
28287 | 0.037 | (4.239,7.215) | unknown | High Rolling | Tex (Bottoms) is an American working at a carnival in Queensland. At t |
30306 | 0.037 | (4.192,7.158) | western | Forty Guns | In the 1880s, Griff Bonnell, and his brothers, Wes and Chico, arrive i |
23225 | 0.038 | (4.228,7.150) | western | Something Big | In the frontier of New Mexico Territory, Joe Baker is an aging, restle |
22058 | 0.044 | (4.209,7.228) | western | Shoot Out | Clay Lomax is released from prison after serving nearly eight years fo |
14374 | 0.044 | (4.229,7.143) | drama | Hallelujah! | Sharecroppers Zeke and Spunk Johnson sell their family's portion of th |
8983 | 0.048 | (4.223,7.233) | drama | Black Legion | When passed over for promotion at work in favor of a foreign-born frie |
For There Will Be Blood we're back among the western, family and land themes.
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
1375 | 0.000 | (4.300,4.571) | comedy | The Big Lebowski | In 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in |
5758 | 0.007 | (4.307,4.567) | comedy | Lost Honeymoon | Soon after the end of World War II a young English woman, Amy Atkins ( |
5544 | 0.019 | (4.319,4.572) | romance | Breaking and Entering | Will Francis (Jude Law), a young Englishman, is a landscape architect |
10408 | 0.028 | (4.303,4.543) | drama | The Divorcee | Ted (Chester Morris), Jerry (Norma Shearer), Paul (Conrad Nagel), and |
23006 | 0.034 | (4.331,4.585) | comedy | Rumor Has It... | In 1997, Sarah Huttinger, an obituary and wedding announcement writer |
11587 | 0.040 | (4.337,4.554) | drama | Sarah Prefers to Run (Sar | After performing well on her school's track team, Sarah (Sophie Desmar |
16408 | 0.042 | (4.286,4.610) | romantic c | The Back-up Plan | Zoe (Jennifer Lopez) gives up on finding the man of her dreams, decide |
18164 | 0.043 | (4.343,4.568) | comedy | Lazybones | Sir Reginald Ford (Ian Hunter), known as "Lazybones", is an idle baron |
27509 | 0.046 | (4.256,4.582) | comedy-dra | Dan in Real Life | Dan Burns is a newspaper advice columnist, a widower, and single-paren |
23018 | 0.048 | (4.257,4.592) | horror | April Fool's Day | On the weekend leading up to April Fools' Day, a group of college frie |
Lebowski's nearest films seem to be about lost and lazy under achievers.
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
22511 | 0.000 | (4.587,8.317) | comedy | Monty Python and the Holy | In 932 AD, King Arthur and his squire, Patsy, travel throughout Britai |
19324 | 0.004 | (4.584,8.314) | adventure | The Son of Monte Cristo | In 1865 the proletarian General Gurko Lanen (George Sanders) becomes t |
9939 | 0.013 | (4.577,8.308) | animated | Pound Puppies and the Leg | Whopper is taking his niece and nephew to the museum. Along the way, h |
16210 | 0.013 | (4.578,8.308) | adventure | Siege of the Saxons | King Arthur learns one of his knights is plotting to take over and mar |
11730 | 0.015 | (4.571,8.316) | musical | Camelot | King Arthur is preparing for a great battle against his friend, Sir La |
15982 | 0.021 | (4.598,8.300) | adventure | King Arthur | Arthur (Clive Owen) is portrayed as a Roman cavalry officer, also know |
16691 | 0.031 | (4.565,8.340) | romance/dr | Mayerling | In the 1880s, Crown Prince Rudolf of Austria (Sharif) clashes with his |
23032 | 0.032 | (4.594,8.349) | fantasy | First Knight | The film's opening text establishes that King Arthur (Sean Connery) of |
20179 | 0.038 | (4.559,8.343) | fantasy | Jack the Giant Killer | In the Duchy of Cornwall of fairy tale days, an evil sorcerer named Pe |
26822 | 0.038 | (4.592,8.355) | serial | Adventures of Sir Galahad | The Arthurian film cycle started with the Adventures of Sir Galahad se |
An Python is still among the knight and King Arthur themes. Perhaps this is an 'outlier' movie that is easier to classify.
Index | Distance | (X,Y) | Genre | Title | Plot |
---|---|---|---|---|---|
26451 | 0.115 | (5.651,2.703) | animated s | Fit to Be Tied | Spike is happily prancing along the backyard. He steps on a splinter a |
8706 | 0.130 | (5.654,2.718) | animated | Cat Fishin' | Spike is shown guarding a lake fence while asleep. Tom shows up with h |
24737 | 0.134 | (5.649,2.725) | animation | Barbecue Brawl | Spike and Tyke walk into the backyard to have a barbecue. The first at |
11349 | 0.140 | (5.636,2.736) | animated s | The Dog House | Spike is busy building the doghouse of his dreams when Jerry suddenly |
8374 | 0.141 | (5.640,2.735) | animated s | Hic-cup Pup | Spike is putting his son, Tyke, to bed. When a bird flies by to chirp, |
15666 | 0.143 | (5.639,2.737) | romantic c | A Girl in Every Port | Spike (McLaglen) travels the world as the mate of a schooner. He has a |
12045 | 0.143 | (5.640,2.737) | animated | Love That Pup | Spike is sleeping beside his son Tyke when Tyke suddenly wakes up afte |
13246 | 0.144 | (5.637,2.739) | animation | Slicked-up Pup | Spike has bathed Tyke to make sure he is nice and clean, but is horrif |
24762 | 0.146 | (5.640,2.741) | animated | Quiet Please! | Tom's nemesis, Spike, is trying to take a nap, but is awoken by Tom Ca |
8924 | 0.151 | (5.632,2.748) | animated | Tops with Pops | Spike is sleeping beside his son Tyke when he suddenly wakes up from a |
And just cause I'm curious I looked into another small cluster and found a group of Spike and Tyke animations. Another pleasant surprise.
Conclusions
Wow, this has gotten a lot longer than I ever expected so lets end here for now. I hope you are forgiving of the super lax 'metric' I used and understanding of the reasons why and that it sparked some ideas for your own analysis.
There are still lots of different and interesting experiments and analysis we can do on this data set which I leave for future articles.
If there is a specific use case you'd like to explore please get in touch or reach out on twitter.
Thanks for your time.
Julio