Finding Similar Text with Machine Learning and Natural Language Processing

I've been working with a client on analyzing some text documents and wanted to share a bit of what has been working for us. I can't share the data or the exact project details but it entails, finding similar text documents from a large collection of other documents a specific query example. Imagine searching a database of company statements, product descriptions, articles, contracts, emails, support/trouble tickets, etc. not by keyword but by 'meaning' and 'similarity'.

For this article I thought about using the Enron email or a financial filings dataset but wanted something with more general interest and relatability so instead found something more fun and accessible on Kaggle. The Wikipedia Movie Plots dataset has ~34k records with movie titles, release year, genre, etc. and most importantly for us 'Plot' descriptions. It seems like the perfect dataset set to experiment with and to use when exploring similarity/semantic search engines, content based recommenders, movie plot classifiers and many other interesting projects. In this example we're going to use the data in the plot field to find other 'similar' plots.

Keep in mind that were going to look for similar plot (write ups) and not similar movie styles, or similar genres, etc. That is, a (simple) plot, "boy meets girl, boy loses girl", could be rendered in many different styles and genres and with many different character archetypes. That said, the plots were written by people after having seen the movie and people will tend to include specific details, such as actors, in the 'plot' so they are not completely independent of other characteristics of the movie ... which can make for some other interesting future experiments such as can we predict the genre or year from the plot, etc.

Business Goal and Metric

Analyzing the plot field highlights the first issue any machine learning project: Identifying the business goal and finding an appropriate metric. This is the heart of every project and an important and often tricky first step. This is particularly challenging in similarity or clustering based (ie. unsupervised) experiments as there is no canonical correct answer and the utility and validity is very context dependent.

The point of this article is to give a high level overview of some techniques and approaches not to propose as specific solution to a specific problem ... so, I'm just going to punt on it here and say that my business goal is to 'explore similarity in various ways' and my metric is 'do I get answers that feel right'. If this were are recommender system we might want to measure if the similar movies were on users' 'favorites' list or how often people added recommended movies to their queue and so on. If you have a more specific business need I'd be happy to discuss it.

Feature Generation

Once you have a goal and metric in mind the next step is to develop ways of generating features that you feel may be useful to achieve those goals. Simply put, the algorithms need numbers, we have text. What numbers are we going to generate from our text? For example, we could use the length of each plot write up, the number of vowels, the number of large words, etc. or more advanced techniques like TF-IDF or neural net transformers. Some of these features will work better than others in different applications so feature selection is part art and part science.

An aside: People want to believe that this is objective and say 'let the data speak for itself' but I don't fully agree. Feature selection is combinatorialy explosive and we'd be foolish not to use subject matter expertise and previous machine learning experience along with experimentation and optimization strategies to guide that process.

For this example I could have used a classic technique, TF-IDF, and perhaps we'll revisit that in a future article, or a hot new BERT based model. I chose instead to use the standby Universal Sentence Encoder (USE) because it is easy to set up and use and works really well. TF-IDF is probably easier if you don't have any TensorFlow experience and a fine-tuned BERT based model would probably work better if you wanted to invest the time but USE is a great place to start and in a few lines of code you can get a meaningful numeric 512 dimensional representation (embedding) of short text passages.

import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")
embeddings = embed(my_text)

Measuring Similarity with Cosine

Ok, great, so I have a 34886 x 512 matrix now what? Well the idea with turning the text into an embedding vector is that the embedding now represents a point in a semantic space and other points 'near' it represent other similar embedded items (plot descriptions).

The next decision is to formalize what we mean by 'near'. In two or three dimensional spaces we have intimate ingrained knowledge and a firm concept of 'nearness' and we call that the Euclidean distance. Unfortunately Euclidean distance does not work well in high dimensional spaces because of the Curse of dimensionality. The bottom line is that we'll be better off using angular (cosine) distance in high dimensions.

Luckily that is easy to do with Python and scikit-learn. Given a point/embedding represented by v and a the matrix of embeddings we can calculate all the angular distances (similarities actually) and then find the top n similarities with this rough pseudo-code

from sklearn.metrics.pairwise import cosine_similarity

d = cosine_similarity(v, embeddings).ravel() #find similarity and flatten results
a = d.argsort()[::-1] # find the indexes of the sorted similarities and reverse them
return a[:n] # return the top n

Results

Using these embeddings lets take a look at a couple of examples from movies that I have watched and enjoyed 'There Will Be Blood', 'The Big Lebowski', and 'Monty Python and the Holy Grail'. Take a moment to think about how you'd describe their plots.

The tables below give the 10 most similar plots to each of the movies based on the cosine similarity of the USE embeddings. Note the plots here have been truncated to fit in the table. The index isn't too useful if you don't have the exact shuffling of the dataset that I used but it helps me keep track of the movies. Similarity relates to the cosine of the angle between the two embeddings/vectors. An angle of 0 has a cosine of 1 so that is most similar. IOU is intersection over union which I added as a measure of the commonality of the words in the plot descriptions between two movies. Genre is specified in the dataset and is a very messy feature. Though the genre could be useful for some experiments there are ~2265 unique values and would need to be cleaned up. Title is self explanatory and the Plot column has the first few words of the plot description so we can get a feel for the text.

There Will Be Blood - Most similar based on cosine similarity of Universal Sentence Encoder embeddings.
IndexSimilarityIOUGenreTitlePlot
20281.0001.000dramaThere Will Be BloodIn 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti
342830.6550.085dramaKidnappedYoung David Balfour arrives at a bleak Scottish house, the House of Sh
243080.6310.089dramaSilver DollarKansas farmer Yates Martin (Edward G. Robinson) uproots his uncomplain
64240.6200.094adventureDoc Savage: The Man of BrIn 1936, Doc Savage (Ron Ely) returns to New York City following a vis
281620.6110.088westernCampbell's KingdomRecently diagnosed with a terminal disease, Bruce Campbell (Dirk Bogar
327760.6030.108comedy wesWaterhole No. 3In Arizona, a shipment of gold bullion is stolen in an inside job by a
89070.6020.101western coThe Dude Goes WestA gunsmith and a marksman, Daniel Bone closes up his Brooklyn, New Yor
308850.6000.088family, faJumanjiIn 1869, near Brantford, New Hampshire, two brothers bury a chest and
27210.5970.102dramaThe KidnappersIn the early 1900s, two young orphaned brothers, eight year old Harry
336890.5950.076dramaThe Hanging TreeJoseph Frail (Gary Cooper)—doctor, gambler, gunslinger—rides into the

I don't know the other movies but note that most have to do with family, land and greed and there are no romance or space related movies. Do you know the movies? Do you feel they are similar in some ways?

The Big Lebowski - Most similar based on cosine similarity of Universal Sentence Encoder embeddings.
IndexSimilarityIOUGenreTitlePlot
13751.0001.000comedyThe Big LebowskiIn 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in
127950.5920.084comedyThat's My BoyIn 1984, middle school student Donny Berger is in detention and begins
310570.5700.079comedyJeff, Who Lives at HomeJeff (Segel) is a 30-year-old unemployed stoner living in his mother S
49550.5620.088unknownThe Nine Lives of Fritz tIt is the 1970s; Fritz the Cat is now married, on welfare, and has a c
315310.5530.078comedyLittle MonstersBrian's family has moved to a new town, and he feels isolated in his n
165550.5530.087dramaT.R. BaskinWhen Jack Mitchell (Peter Boyle), a married middle-aged salesman with
251800.5480.106musical drInside Llewyn DavisIn February 1961, Llewyn Davis is a struggling folk singer in New York
187820.5480.106unknownInside Llewyn DavisIn February 1961, Llewyn Davis is a struggling folk singer in New York
318020.5480.068drama, romThe CoolerUnlucky Bernie Lootz (William H. Macy) has little positive going for h
8030.5480.051unknownHarry and TontoHarry Coombes (Art Carney) is an elderly widower and retired teacher w

I've seen 'Inside Llewyn Davis' but not the others. The plots seem to be related to 'lazy' or 'unsuccessful' people. Also note the Llewyn Davis is shows up twice even though this is supposed to be a de-duped list. I found several duplicates (not just duplicate titles) while exploring dataset with the embeddings.

Monty Python and The Holy Grail - Most similar based on cosine similarity of Universal Sentence Encoder embeddings.
IndexSimilarityIOUGenreTitlePlot
225111.0001.000comedyMonty Python and the HolyIn 932 AD, King Arthur and his squire, Patsy, travel throughout Britai
59050.7540.146romanceLancelot and GuinevereLancelot is King Arthur's most valued Knight of the Round Table and a
162100.7490.106adventureSiege of the SaxonsKing Arthur learns one of his knights is plotting to take over and mar
268220.7420.114serialAdventures of Sir GalahadThe Arthurian film cycle started with the Adventures of Sir Galahad se
124750.7210.120animatedKnighty Knight BugsKing Arthur is sitting with his Knights of the Round Table, complainin
117300.7190.093musicalCamelotKing Arthur is preparing for a great battle against his friend, Sir La
230320.7110.098fantasyFirst KnightThe film's opening text establishes that King Arthur (Sean Connery) of
112010.7040.089unknownKing Arthur: Legend of thMordred, an iron-fisted warlock, and his armies lay siege to Camelot,
191460.7040.089action, adKing Arthur: Legend of thMordred, an iron-fisted warlock, and his armies lay siege to Camelot,
279710.6900.100animatedThe Sword in the StoneAfter the King of England, Uther Pendragon, dies, leaving no heir to t

These all seem to relate to King Arthur and Camelot and none seem to be (absurdist) comedies and there are no westerns or space adventures.

Dimensionality Reduction

It can often be useful to visualize our collection to get a feeling for how the items 'cluster' or relate to each other. The embedding vectors we used have 512 dimensions so to visualize them we're going to need to reduce them to 2 or 3 dimensions. The challenge is doing that while keeping as much useful information as possible.

Principal Component Analysis (PCA)

A classic approach to dimensionality reduction is Principle Components Analysis (PCA) which we'll use to find two axis through the data that show the most information (variance). Imagine trying to draw a loaf of French bread. You can rotate it to find an angle that shows the most information and you are likely to chose an angle that shows the full length and width which sacrifices the information in the height. You are less likely to draw it directly from the point where you highlight the width and height but not the length. Python has PCA functionality that is fast and easy and can be done in a few lines of code.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(embeddings)
pX = pca.transform(embeddings)

2D PCA
2D PCA

In the chart of the transformed PCA components we can see 3 or 4 main groups but not a lot of structure. This is not too surprising given what we now know about high dimensional spaces. It turns out we're not trying to layout a loaf of bread but rather a high dimensional basketball. If we inspect the PCA model we see that the first two components explain only about 10.8% of the variance.

Still, lets look at the most similar images for our 3 examples. And since we've now reduced the dimensions we can use the Euclidean distance.

from sklearn.metrics.pairwise import euclidean_distances

v = np.array([x, y])
v = v.reshape(1,-1)
d = euclidean_distances(v, emb).ravel()
There Will Be Blood - Most similar based on euclidean distance of the 2D PCA of the embeddings.
IndexDistance(X,Y)GenreTitlePlot
20280.000(-0.137,-0.163)dramaThere Will Be BloodIn 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti
74320.001(-0.137,-0.164)dramaSecret NationThe film opens with the death of the elderly and wealthy Leo Cryptus (
181220.001(-0.138,-0.163)dramaSnow CakeWhen the eccentric drifter Vivienne Freeman gets a ride from a relucta
321800.001(-0.137,-0.164)warMen in WarOn 6 September 1950, an isolated and exhausted platoon of the 24th Inf
343660.001(-0.137,-0.165)horrorThe Wolf ManSometime in the early twentieth century, after learning of the death o
283970.001(-0.136,-0.164)fantasyOsmosis JonesFrank Detorre (Bill Murray) is an unkempt, slovenly zookeeper at the S
330590.001(-0.136,-0.163)dramaBetter TimesAs described in a film magazine,[2] the plot of the film is as follows
203010.001(-0.136,-0.165)adventureFlowing GoldOilfield worker John Alexander (John Garfield) is on the run from a mu
330340.002(-0.135,-0.163)crimeThe Mystery ManA newspaper man, Larry Doyle and a young woman, Anne Olgivie, meet by
222710.002(-0.139,-0.163)actionStriking DistanceThomas Hardy, a Pittsburgh Police homicide detective, has broken the r

A few of the plots seem to deal with death and oil but they don't really seem as relevant now. We may have lost too much information in this transformation. To actually judge that we'd need a more concrete metric and an actual business goal.

The Big Lebowski - Most similar based on euclidean distance of the 2D PCA of the embeddings.
IndexDistance(X,Y)GenreTitlePlot
13750.000(0.028,-0.142)comedyThe Big LebowskiIn 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in
278850.002(0.028,-0.144)revenge, t22 Female KottayamTessa (Rima Kallingal) is a nursing student in Bangalore with plans of
16430.002(0.026,-0.143)action / cLet's Go!Siu Sheung (Juno Mak) is a solitary and frustrated young man. He works
338810.003(0.028,-0.139)tokusatsuCho Kamen Rider Den-O & DTaking place after the events of Kamen Rider Decade episode 15, under
75070.003(0.031,-0.142)horrorDark Tales of JapanIntroduction: Would You Like to Hear a Scary Tale? (Intorodakushon: Ko
285420.004(0.031,-0.139)dramaEnd of Summer !The End ofManbei Kohayagawa (Ganjirō Nakamura) is the head of a small sake brewe
125520.005(0.033,-0.141)dramaConfession of PainPolice inspectors Lau Ching-hei and Yau Kin-bong arrest a rapist in 20
163160.005(0.033,-0.142)thrillerIce Cream 2A crew of eight amateur film makers approach a noted film producer (Ra
247570.006(0.022,-0.145)unknownBattles Without Honor andIn Kure, Hiroshima 1946, when Shinichi Yamagata gets into a scuffle wi
282030.007(0.022,-0.146)horrorDark WaterYoshimi Matsubara, in the midst of a divorce mediation, rents a run-do

The Lebowski movies seem even less related.

Monty Python and The Holy Grail - Most similar based on euclidean distance of the 2D PCA of the embeddings.
IndexDistance(X,Y)GenreTitlePlot
225110.000(-0.119,-0.198)comedyMonty Python and the HolyIn 932 AD, King Arthur and his squire, Patsy, travel throughout Britai
61120.001(-0.118,-0.197)thrillerThe StranglerLeo Kroll (Buono) is a mother-fixated lab technician who collects doll
293950.001(-0.120,-0.200)westernOutlaw's SonTwelve-year-old Jeff Blaine lives in the small western town of Plainsv
40720.002(-0.121,-0.199)comedy-draLady BirdChristine "Lady Bird" McPherson is a senior student at a Catholic high
86140.002(-0.119,-0.197)unknownThe SnowmanAt a remote cabin amidst heavy snowfall, a man named Jonas (Peter Dall
100400.002(-0.121,-0.197)westernDeath of a GunfighterIn the town of Cottonwood Springs, Texas at the turn of the century, M
289720.002(-0.119,-0.196)comedyBridesmaidsAnnie Walker (Kristen Wiig) is a single woman in her late 30s. Followi
6860.002(-0.121,-0.199)dramaMenace II SocietyCaine Lawson and his best friend Kevin "O-Dog" Anderson enter a local
241420.002(-0.121,-0.197)comedyIt's a BoyOn the eve of his society wedding, Dudley Leake and his best man James
247440.002(-0.120,-0.201)film noirMinistry of FearIn wartime England during the Blitz, Stephen Neale (Ray Milland) is re

Same for the Monty Python movies.

Overall this does not seem like a successful strategy as we have lost too much useful information.

T-Distributed Stochastic Neighbor Embedding (T-SNE)

So, going back to flattening a basketball, or perhaps a globe, you can try to choose a rotation that highlights the features (continents) you are most interested in ... or you can abandon the idea of a linear transformation and try to peel the surface off and lay it out on a flat surface. We all know that causes some distortions (ie. Greenland looks way bigger than it actually is on many maps) but is still a useful technique.

And what if you were free to cut and stretch that surface so that you could keep local information by sacrificing some global information. That is something like what T-SNE and UMAP try to do. And luckily they are both is easy to use from Python.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, n_jobs=-1)
tX = tsne.fit_transform(embeddings)

T-SNE attempts to model the items in such away that local distances are close in 2 or 3 dimensions at the expense of global (or far away distances). And we can see in the plot that clusters (groups of highly similar items) start to appear.

2D T-SNE
2D T-SNE

There Will Be Blood - Most similar based on euclidean distance of the 2D T-SNE of the embeddings.
IndexDistance(X,Y)GenreTitlePlot
20280.000(1.423,30.830)dramaThere Will Be BloodIn 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti
192210.107(1.438,30.724)dramaThe Wonder KidSebastian Giro is a ten-year-old French boy and child musical prodigy
87220.154(1.300,30.738)unknownGrimsby"Nobby" Butcher has been separated from his little brother Sebastian f
257480.271(1.164,30.751)dramaThe Power and the PrizeAlthough he is scheduled to wed his boss George Salt's niece that week
134080.319(1.742,30.844)fantasyThe Devil and Daniel WebsIn 1840 New Hampshire, Jabez Stone (James Craig), a poor kindhearted f
27210.366(1.057,30.849)dramaThe KidnappersIn the early 1900s, two young orphaned brothers, eight year old Harry
342830.444(0.987,30.746)dramaKidnappedYoung David Balfour arrives at a bleak Scottish house, the House of Sh
275950.669(0.756,30.773)dramaKidnappedScotland, 1751: At a stately manor near Edinburgh, the young David Bal
170460.913(1.575,31.730)adventureManfishInspector Warren of Scotland Yard flies into Jamaica and is taken to t
64020.952(0.477,30.725)dramaTol'able DavidDavid Kinemon, youngest son of West Virginia tenant farmers, longs to

The similar movies for There Will Be Blood do seem to be be related to family and land again and we see 'The Kidnappers' and 'Kidnapped' appear on the list again.

The Big Lebowski - Most similar based on euclidean distance of the 2D T-SNE of the embeddings.
IndexDistance(X,Y)GenreTitlePlot
13750.000(32.053,22.814)comedyThe Big LebowskiIn 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in
127950.149(32.200,22.842)comedyThat's My BoyIn 1984, middle school student Donny Berger is in detention and begins
230180.555(32.590,22.955)horrorApril Fool's DayOn the weekend leading up to April Fools' Day, a group of college frie
251330.618(32.660,22.929)horrorTerror TrainAt a college pre-med student fraternity New Year's Eve party, a reluct
4550.648(31.544,23.215)comedyAdventurelandIn 1987, James Brennan plans to have a summer vacation in Europe after
55600.731(32.353,23.481)comedyThe HouseDuring their visit to Bucknell University, husband and wife Scott (Fer
249440.789(31.915,23.591)comedyThis Is the EndJay Baruchel arrives in Los Angeles to visit old friend and fellow Can
154750.795(31.958,23.603)comedySuperbadSeth (Jonah Hill) and Evan (Michael Cera) are two high school seniors
129490.797(32.346,22.072)comedyClass ActGenius high school student Duncan Pinderhughes is getting ready for g
12970.821(31.817,23.601)adventure,30 Minutes or LessMarijuana-smoking, Grand Rapids slacker pizza delivery driver Nick (Je

For Lebowski we see 'That's My Boy' appear again along with a movie about a pot smoking slacker.

Monty Python and The Holy Grail - Most similar based on euclidean distance of the 2D T-SNE of the embeddings.
IndexDistance(X,Y)GenreTitlePlot
225110.000(-10.799,1.842)comedyMonty Python and the HolyIn 932 AD, King Arthur and his squire, Patsy, travel throughout Britai
268220.015(-10.784,1.847)serialAdventures of Sir GalahadThe Arthurian film cycle started with the Adventures of Sir Galahad se
59050.031(-10.771,1.830)romanceLancelot and GuinevereLancelot is King Arthur's most valued Knight of the Round Table and a
316220.068(-10.793,1.910)musical coA Connecticut Yankee in KHank Martin (Bing Crosby), an American mechanic, is knocked out and wa
277860.095(-10.757,1.927)animatedQuest for CamelotSir Lionel is one of the knights of the Round Table, and his daughter
230320.103(-10.712,1.787)fantasyFirst KnightThe film's opening text establishes that King Arthur (Sean Connery) of
124750.106(-10.793,1.949)animatedKnighty Knight BugsKing Arthur is sitting with his Knights of the Round Table, complainin
117300.121(-10.717,1.931)musicalCamelotKing Arthur is preparing for a great battle against his friend, Sir La
245110.136(-10.912,1.918)adventureKnights of the Round TablWith the land in anarchy, warring overlords, Arthur Pendragon (Mel Fer
112010.136(-10.864,1.962)unknownKing Arthur: Legend of thMordred, an iron-fisted warlock, and his armies lay siege to Camelot,

The Monty Python movies make a lot more sense and we see movies about King Arthur, Camelot and Knights.

Small cluster near (45.6, 37.4)
IndexDistance(X,Y)GenreTitlePlot
295430.021(45.612,37.416)comedy shoThree Little Sew and SewsThe Stooges are sailors employed in the tailor shop of a naval base. A
93680.042(45.559,37.389)comedy shoSaved by the BelleThe Stooges are traveling salesmen stranded in Valeska, a fictional So
131670.052(45.551,37.382)comedy shoOily to Bed, Oily to RiseThe Stooges are three hapless tramps. After nearly destroying a farmer
331920.059(45.626,37.453)comedy shoBooby DupesThe Stooges are fish peddlers (similar to their roles in Cookoo Cavali
110280.067(45.618,37.336)short subjRhythm and WeepThe Stooges play the roles of unsuccessful actors who have decided to
294480.082(45.527,37.437)comedy shoRockin' Thru the RockiesThe Stooges are guides (circa late 1800s), who are helping a trio chri
78450.090(45.530,37.344)comedy shoNo Dough BoysThe Stooges are dressed as Japanese soldiers for a photo shoot; their
343650.094(45.562,37.314)comedySelf-Made MaidsThe Stooges are artists who fall in love with three models, Larraine,
100090.097(45.655,37.480)comedyThe Three Stooges in OrbiThe Stooges are TV actors who are trying to sell ideas for their anima
293750.104(45.685,37.340)comedy shoCalling All CursThe Stooges are skilled veterinarians at a pet hospital who are the pr

And just cause I'm curious, I took a look at a small cluster near (45.6, 37.4). These turned out to all be short movies staring The Stooges. This is the strongest indication that T-SNE is doing something interesting we've investigated so far.

Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)

UMAP is a newer technique and like T-SNE, tries to preserve local structure but also most of the global structure in the data. They both depend a lot on the data and the particular parameters you choose so if you are interested on using it in your application you'll have to explore to find settings that work well for you. These algorithms may not be as straight forward as PCA but they seem to yield more interesting results.

We can run the UMAP algorithm by using the umap-learn package which uses the scikit-learn API of fit and transform. By plotting the transformed points you can see there is a lot more structure and clusters with similar movies.

import umap

um =  umap.UMAP()
mapper = um.fit(embeddings)
uX = um.transform(embeddings)

2D UMAP
2D UMAP

There Will Be Blood - Most similar based on euclidean distance of the 2D UMAP of the embeddings.
IndexDistance(X,Y)GenreTitlePlot
20280.000(4.217,7.186)dramaThere Will Be BloodIn 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti
292940.003(4.220,7.188)comedyMan of the YearTom Dobbs is host of a satirical news program, where he taps into peop
16380.031(4.196,7.164)westernRocky Mountain MysteryMining engineer Larry Sutton (Randolph Scott) arrives at the Ballard r
65390.034(4.199,7.214)westernThe Baron of ArizonaThe notorious attempt by swindler James Reavis to claim the entire ter
282870.037(4.239,7.215)unknownHigh RollingTex (Bottoms) is an American working at a carnival in Queensland. At t
303060.037(4.192,7.158)westernForty GunsIn the 1880s, Griff Bonnell, and his brothers, Wes and Chico, arrive i
232250.038(4.228,7.150)westernSomething BigIn the frontier of New Mexico Territory, Joe Baker is an aging, restle
220580.044(4.209,7.228)westernShoot OutClay Lomax is released from prison after serving nearly eight years fo
143740.044(4.229,7.143)dramaHallelujah!Sharecroppers Zeke and Spunk Johnson sell their family's portion of th
89830.048(4.223,7.233)dramaBlack LegionWhen passed over for promotion at work in favor of a foreign-born frie

For There Will Be Blood we're back among the western, family and land themes.

The Big Lebowski - Most similar based on euclidean distance of the 2D UMAP of the embeddings.
IndexDistance(X,Y)GenreTitlePlot
13750.000(4.300,4.571)comedyThe Big LebowskiIn 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in
57580.007(4.307,4.567)comedyLost HoneymoonSoon after the end of World War II a young English woman, Amy Atkins (
55440.019(4.319,4.572)romanceBreaking and EnteringWill Francis (Jude Law), a young Englishman, is a landscape architect
104080.028(4.303,4.543)dramaThe DivorceeTed (Chester Morris), Jerry (Norma Shearer), Paul (Conrad Nagel), and
230060.034(4.331,4.585)comedyRumor Has It...In 1997, Sarah Huttinger, an obituary and wedding announcement writer
115870.040(4.337,4.554)dramaSarah Prefers to Run (SarAfter performing well on her school's track team, Sarah (Sophie Desmar
164080.042(4.286,4.610)romantic cThe Back-up PlanZoe (Jennifer Lopez) gives up on finding the man of her dreams, decide
181640.043(4.343,4.568)comedyLazybonesSir Reginald Ford (Ian Hunter), known as "Lazybones", is an idle baron
275090.046(4.256,4.582)comedy-draDan in Real LifeDan Burns is a newspaper advice columnist, a widower, and single-paren
230180.048(4.257,4.592)horrorApril Fool's DayOn the weekend leading up to April Fools' Day, a group of college frie

Lebowski's nearest films seem to be about lost and lazy under achievers.

Monty Python and The Holy Grail - Most similar based on euclidean distance of the 2D UMAP of the embeddings.
IndexDistance(X,Y)GenreTitlePlot
225110.000(4.587,8.317)comedyMonty Python and the HolyIn 932 AD, King Arthur and his squire, Patsy, travel throughout Britai
193240.004(4.584,8.314)adventureThe Son of Monte CristoIn 1865 the proletarian General Gurko Lanen (George Sanders) becomes t
99390.013(4.577,8.308)animatedPound Puppies and the LegWhopper is taking his niece and nephew to the museum. Along the way, h
162100.013(4.578,8.308)adventureSiege of the SaxonsKing Arthur learns one of his knights is plotting to take over and mar
117300.015(4.571,8.316)musicalCamelotKing Arthur is preparing for a great battle against his friend, Sir La
159820.021(4.598,8.300)adventureKing ArthurArthur (Clive Owen) is portrayed as a Roman cavalry officer, also know
166910.031(4.565,8.340)romance/drMayerlingIn the 1880s, Crown Prince Rudolf of Austria (Sharif) clashes with his
230320.032(4.594,8.349)fantasyFirst KnightThe film's opening text establishes that King Arthur (Sean Connery) of
201790.038(4.559,8.343)fantasyJack the Giant KillerIn the Duchy of Cornwall of fairy tale days, an evil sorcerer named Pe
268220.038(4.592,8.355)serialAdventures of Sir GalahadThe Arthurian film cycle started with the Adventures of Sir Galahad se

An Python is still among the knight and King Arthur themes. Perhaps this is an 'outlier' movie that is easier to classify.

Small cluster near (5.6,2.6)
IndexDistance(X,Y)GenreTitlePlot
264510.115(5.651,2.703)animated sFit to Be TiedSpike is happily prancing along the backyard. He steps on a splinter a
87060.130(5.654,2.718)animatedCat Fishin'Spike is shown guarding a lake fence while asleep. Tom shows up with h
247370.134(5.649,2.725)animationBarbecue BrawlSpike and Tyke walk into the backyard to have a barbecue. The first at
113490.140(5.636,2.736)animated sThe Dog HouseSpike is busy building the doghouse of his dreams when Jerry suddenly
83740.141(5.640,2.735)animated sHic-cup PupSpike is putting his son, Tyke, to bed. When a bird flies by to chirp,
156660.143(5.639,2.737)romantic cA Girl in Every PortSpike (McLaglen) travels the world as the mate of a schooner. He has a
120450.143(5.640,2.737)animatedLove That PupSpike is sleeping beside his son Tyke when Tyke suddenly wakes up afte
132460.144(5.637,2.739)animationSlicked-up PupSpike has bathed Tyke to make sure he is nice and clean, but is horrif
247620.146(5.640,2.741)animatedQuiet Please!Tom's nemesis, Spike, is trying to take a nap, but is awoken by Tom Ca
89240.151(5.632,2.748)animatedTops with PopsSpike is sleeping beside his son Tyke when he suddenly wakes up from a

And just cause I'm curious I looked into another small cluster and found a group of Spike and Tyke animations. Another pleasant surprise.

Conclusions

Wow, this has gotten a lot longer than I ever expected so lets end here for now. I hope you are forgiving of the super lax 'metric' I used and understanding of the reasons why and that it sparked some ideas for your own analysis.

There are still lots of different and interesting experiments and analysis we can do on this data set which I leave for future articles.

If there is a specific use case you'd like to explore please get in touch or reach out on twitter.

Thanks for your time.

Julio

Want to get notified of new articles and projects?

Get an occasional email with AI/ML and project info.

© 2020 E-String Technologies, Inc. | Privacy