Finding Similar Text with Machine Learning and Natural Language Processing

I've been working with a client on analyzing some text documents and wanted to share a bit of what has been working for us. I can't share the data or the exact project details but it entails, finding similar text documents from a large collection of other documents a specific query example. Imagine searching a database of company statements, product descriptions, articles, contracts, emails, support/trouble tickets, etc. not by keyword but by 'meaning' and 'similarity'.

For this article I thought about using the Enron email or a financial filings dataset but wanted something with more general interest and relatability so instead found something more fun and accessible on Kaggle. The Wikipedia Movie Plots dataset has ~34k records with movie titles, release year, genre, etc. and most importantly for us 'Plot' descriptions. It seems like the perfect dataset set to experiment with and to use when exploring similarity/semantic search engines, content based recommenders, movie plot classifiers and many other interesting projects. In this example we're going to use the data in the plot field to find other 'similar' plots.

Keep in mind that were going to look for similar plot (write ups) and not similar movie styles, or similar genres, etc. That is, a (simple) plot, "boy meets girl, boy loses girl", could be rendered in many different styles and genres and with many different character archetypes. That said, the plots were written by people after having seen the movie and people will tend to include specific details, such as actors, in the 'plot' so they are not completely independent of other characteristics of the movie ... which can make for some other interesting future experiments such as can we predict the genre or year from the plot, etc.

Business Goal and Metric

Analyzing the plot field highlights the first issue any machine learning project: Identifying the business goal and finding an appropriate metric. This is the heart of every project and an important and often tricky first step. This is particularly challenging in similarity or clustering based (ie. unsupervised) experiments as there is no canonical correct answer and the utility and validity is very context dependent.

The point of this article is to give a high level overview of some techniques and approaches not to propose as specific solution to a specific problem ... so, I'm just going to punt on it here and say that my business goal is to 'explore similarity in various ways' and my metric is 'do I get answers that feel right'. If this were are recommender system we might want to measure if the similar movies were on users' 'favorites' list or how often people added recommended movies to their queue and so on. If you have a more specific business need I'd be happy to discuss it.

Feature Generation

Once you have a goal and metric in mind the next step is to develop ways of generating features that you feel may be useful to achieve those goals. Simply put, the algorithms need numbers, we have text. What numbers are we going to generate from our text? For example, we could use the length of each plot write up, the number of vowels, the number of large words, etc. or more advanced techniques like TF-IDF or neural net transformers. Some of these features will work better than others in different applications so feature selection is part art and part science.

For this example I could have used a classic technique, TF-IDF, and perhaps we'll revisit that in a future article, or a hot new BERT based model. I chose instead to use the standby Universal Sentence Encoder (USE) because it is easy to set up and use and works really well. TF-IDF is probably easier if you don't have any TensorFlow experience and a fine-tuned BERT based model would probably work better if you wanted to invest the time but USE is a great place to start and in a few lines of code you can get a meaningful numeric 512 dimensional representation (embedding) of short text passages.

import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")
embeddings = embed(my_text)

Measuring Similarity with Cosine

Ok, great, so I have a 34886 x 512 matrix now what? Well the idea with turning the text into an embedding vector is that the embedding now represents a point in a semantic space and other points 'near' it represent other similar embedded items (plot descriptions).

The next decision is to formalize what we mean by 'near'. In two or three dimensional spaces we have intimate ingrained knowledge and a firm concept of 'nearness' and we call that the Euclidean distance. Unfortunately Euclidean distance does not work well in high dimensional spaces because of the Curse of dimensionality. The bottom line is that we'll be better off using angular (cosine) distance in high dimensions.

Luckily that is easy to do with Python and scikit-learn. Given a point/embedding represented by v and a the matrix of embeddings we can calculate all the angular distances (similarities actually) and then find the top n similarities with this rough pseudo-code

from sklearn.metrics.pairwise import cosine_similarity

d = cosine_similarity(v, embeddings).ravel() #find similarity and flatten results
a = d.argsort()[::-1] # find the indexes of the sorted similarities and reverse them
return a[:n] # return the top n

Results

Using these embeddings lets take a look at a couple of examples from movies that I have watched and enjoyed 'There Will Be Blood', 'The Big Lebowski', and 'Monty Python and the Holy Grail'. Take a moment to think about how you'd describe their plots.

The tables below give the 10 most similar plots to each of the movies based on the cosine similarity of the USE embeddings. Note the plots here have been truncated to fit in the table. The index isn't too useful if you don't have the exact shuffling of the dataset that I used but it helps me keep track of the movies. Similarity relates to the cosine of the angle between the two embeddings/vectors. An angle of 0 has a cosine of 1 so that is most similar. IOU is intersection over union which I added as a measure of the commonality of the words in the plot descriptions between two movies. Genre is specified in the dataset and is a very messy feature. Though the genre could be useful for some experiments there are ~2265 unique values and would need to be cleaned up. Title is self explanatory and the Plot column has the first few words of the plot description so we can get a feel for the text.

Index Similarity IOU Genre Title Plot
2028 1.000 1.000 drama There Will Be Blood In 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti
34283 0.655 0.085 drama Kidnapped Young David Balfour arrives at a bleak Scottish house, the House of Sh
24308 0.631 0.089 drama Silver Dollar Kansas farmer Yates Martin (Edward G. Robinson) uproots his uncomplain
6424 0.620 0.094 adventure Doc Savage: The Man of Br In 1936, Doc Savage (Ron Ely) returns to New York City following a vis
28162 0.611 0.088 western Campbell's Kingdom Recently diagnosed with a terminal disease, Bruce Campbell (Dirk Bogar
32776 0.603 0.108 comedy wes Waterhole No. 3 In Arizona, a shipment of gold bullion is stolen in an inside job by a
8907 0.602 0.101 western co The Dude Goes West A gunsmith and a marksman, Daniel Bone closes up his Brooklyn, New Yor
30885 0.600 0.088 family, fa Jumanji In 1869, near Brantford, New Hampshire, two brothers bury a chest and
2721 0.597 0.102 drama The Kidnappers In the early 1900s, two young orphaned brothers, eight year old Harry
33689 0.595 0.076 drama The Hanging Tree Joseph Frail (Gary Cooper)—doctor, gambler, gunslinger—rides into the

I don't know the other movies but note that most have to do with family, land and greed and there are no romance or space related movies. Do you know the movies? Do you feel they are similar in some ways?

Index Similarity IOU Genre Title Plot
1375 1.000 1.000 comedy The Big Lebowski In 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in
12795 0.592 0.084 comedy That's My Boy In 1984, middle school student Donny Berger is in detention and begins
31057 0.570 0.079 comedy Jeff, Who Lives at Home Jeff (Segel) is a 30-year-old unemployed stoner living in his mother S
4955 0.562 0.088 unknown The Nine Lives of Fritz t It is the 1970s; Fritz the Cat is now married, on welfare, and has a c
31531 0.553 0.078 comedy Little Monsters Brian's family has moved to a new town, and he feels isolated in his n
16555 0.553 0.087 drama T.R. Baskin When Jack Mitchell (Peter Boyle), a married middle-aged salesman with
25180 0.548 0.106 musical dr Inside Llewyn Davis In February 1961, Llewyn Davis is a struggling folk singer in New York
18782 0.548 0.106 unknown Inside Llewyn Davis In February 1961, Llewyn Davis is a struggling folk singer in New York
31802 0.548 0.068 drama, rom The Cooler Unlucky Bernie Lootz (William H. Macy) has little positive going for h
803 0.548 0.051 unknown Harry and Tonto Harry Coombes (Art Carney) is an elderly widower and retired teacher w

I've seen 'Inside Llewyn Davis' but not the others. The plots seem to be related to 'lazy' or 'unsuccessful' people. Also note the Llewyn Davis is shows up twice even though this is supposed to be a de-duped list. I found several duplicates (not just duplicate titles) while exploring dataset with the embeddings.

Index Similarity IOU Genre Title Plot
22511 1.000 1.000 comedy Monty Python and the Holy In 932 AD, King Arthur and his squire, Patsy, travel throughout Britai
5905 0.754 0.146 romance Lancelot and Guinevere Lancelot is King Arthur's most valued Knight of the Round Table and a
16210 0.749 0.106 adventure Siege of the Saxons King Arthur learns one of his knights is plotting to take over and mar
26822 0.742 0.114 serial Adventures of Sir Galahad The Arthurian film cycle started with the Adventures of Sir Galahad se
12475 0.721 0.120 animated Knighty Knight Bugs King Arthur is sitting with his Knights of the Round Table, complainin
11730 0.719 0.093 musical Camelot King Arthur is preparing for a great battle against his friend, Sir La
23032 0.711 0.098 fantasy First Knight The film's opening text establishes that King Arthur (Sean Connery) of
11201 0.704 0.089 unknown King Arthur: Legend of th Mordred, an iron-fisted warlock, and his armies lay siege to Camelot,
19146 0.704 0.089 action, ad King Arthur: Legend of th Mordred, an iron-fisted warlock, and his armies lay siege to Camelot,
27971 0.690 0.100 animated The Sword in the Stone After the King of England, Uther Pendragon, dies, leaving no heir to t

These all seem to relate to King Arthur and Camelot and none seem to be (absurdist) comedies and there are no westerns or space adventures.

Dimensionality Reduction

It can often be useful to visualize our collection to get a feeling for how the items 'cluster' or relate to each other. The embedding vectors we used have 512 dimensions so to visualize them we're going to need to reduce them to 2 or 3 dimensions. The challenge is doing that while keeping as much useful information as possible.

Principal Component Analysis (PCA)

A classic approach to dimensionality reduction is Principle Components Analysis (PCA) which we'll use to find two axis through the data that show the most information (variance). Imagine trying to draw a loaf of French bread. You can rotate it to find an angle that shows the most information and you are likely to chose an angle that shows the full length and width which sacrifices the information in the height. You are less likely to draw it directly from the point where you highlight the width and height but not the length. Python has PCA functionality that is fast and easy and can be done in a few lines of code.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(embeddings)
pX = pca.transform(embeddings)

2D PCA

In the chart of the transformed PCA components we can see 3 or 4 main groups but not a lot of structure. This is not too surprising given what we now know about high dimensional spaces. It turns out we're not trying to layout a loaf of bread but rather a high dimensional basketball. If we inspect the PCA model we see that the first two components explain only about 10.8% of the variance.

Still, lets look at the most similar images for our 3 examples. And since we've now reduced the dimensions we can use the Euclidean distance.

from sklearn.metrics.pairwise import euclidean_distances

v = np.array([x, y])
v = v.reshape(1,-1)
d = euclidean_distances(v, emb).ravel()
Index Distance (X,Y) Genre Title Plot
2028 0.000 (-0.137,-0.163) drama There Will Be Blood In 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti
7432 0.001 (-0.137,-0.164) drama Secret Nation The film opens with the death of the elderly and wealthy Leo Cryptus (
18122 0.001 (-0.138,-0.163) drama Snow Cake When the eccentric drifter Vivienne Freeman gets a ride from a relucta
32180 0.001 (-0.137,-0.164) war Men in War On 6 September 1950, an isolated and exhausted platoon of the 24th Inf
34366 0.001 (-0.137,-0.165) horror The Wolf Man Sometime in the early twentieth century, after learning of the death o
28397 0.001 (-0.136,-0.164) fantasy Osmosis Jones Frank Detorre (Bill Murray) is an unkempt, slovenly zookeeper at the S
33059 0.001 (-0.136,-0.163) drama Better Times As described in a film magazine,[2] the plot of the film is as follows
20301 0.001 (-0.136,-0.165) adventure Flowing Gold Oilfield worker John Alexander (John Garfield) is on the run from a mu
33034 0.002 (-0.135,-0.163) crime The Mystery Man A newspaper man, Larry Doyle and a young woman, Anne Olgivie, meet by
22271 0.002 (-0.139,-0.163) action Striking Distance Thomas Hardy, a Pittsburgh Police homicide detective, has broken the r

A few of the plots seem to deal with death and oil but they don't really seem as relevant now. We may have lost too much information in this transformation. To actually judge that we'd need a more concrete metric and an actual business goal.

Index Distance (X,Y) Genre Title Plot
1375 0.000 (0.028,-0.142) comedy The Big Lebowski In 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in
27885 0.002 (0.028,-0.144) revenge, t 22 Female Kottayam Tessa (Rima Kallingal) is a nursing student in Bangalore with plans of
1643 0.002 (0.026,-0.143) action / c Let's Go! Siu Sheung (Juno Mak) is a solitary and frustrated young man. He works
33881 0.003 (0.028,-0.139) tokusatsu Cho Kamen Rider Den-O & D Taking place after the events of Kamen Rider Decade episode 15, under
7507 0.003 (0.031,-0.142) horror Dark Tales of Japan Introduction: Would You Like to Hear a Scary Tale? (Intorodakushon: Ko
28542 0.004 (0.031,-0.139) drama End of Summer !The End of Manbei Kohayagawa (Ganjirō Nakamura) is the head of a small sake brewe
12552 0.005 (0.033,-0.141) drama Confession of Pain Police inspectors Lau Ching-hei and Yau Kin-bong arrest a rapist in 20
16316 0.005 (0.033,-0.142) thriller Ice Cream 2 A crew of eight amateur film makers approach a noted film producer (Ra
24757 0.006 (0.022,-0.145) unknown Battles Without Honor and In Kure, Hiroshima 1946, when Shinichi Yamagata gets into a scuffle wi
28203 0.007 (0.022,-0.146) horror Dark Water Yoshimi Matsubara, in the midst of a divorce mediation, rents a run-do

The Lebowski movies seem even less related.

Index Distance (X,Y) Genre Title Plot
22511 0.000 (-0.119,-0.198) comedy Monty Python and the Holy In 932 AD, King Arthur and his squire, Patsy, travel throughout Britai
6112 0.001 (-0.118,-0.197) thriller The Strangler Leo Kroll (Buono) is a mother-fixated lab technician who collects doll
29395 0.001 (-0.120,-0.200) western Outlaw's Son Twelve-year-old Jeff Blaine lives in the small western town of Plainsv
4072 0.002 (-0.121,-0.199) comedy-dra Lady Bird Christine "Lady Bird" McPherson is a senior student at a Catholic high
8614 0.002 (-0.119,-0.197) unknown The Snowman At a remote cabin amidst heavy snowfall, a man named Jonas (Peter Dall
10040 0.002 (-0.121,-0.197) western Death of a Gunfighter In the town of Cottonwood Springs, Texas at the turn of the century, M
28972 0.002 (-0.119,-0.196) comedy Bridesmaids Annie Walker (Kristen Wiig) is a single woman in her late 30s. Followi
686 0.002 (-0.121,-0.199) drama Menace II Society Caine Lawson and his best friend Kevin "O-Dog" Anderson enter a local
24142 0.002 (-0.121,-0.197) comedy It's a Boy On the eve of his society wedding, Dudley Leake and his best man James
24744 0.002 (-0.120,-0.201) film noir Ministry of Fear In wartime England during the Blitz, Stephen Neale (Ray Milland) is re

Same for the Monty Python movies.

Overall this does not seem like a successful strategy as we have lost too much useful information.

T-Distributed Stochastic Neighbor Embedding (T-SNE)

So, going back to flattening a basketball, or perhaps a globe, you can try to choose a rotation that highlights the features (continents) you are most interested in ... or you can abandon the idea of a linear transformation and try to peel the surface off and lay it out on a flat surface. We all know that causes some distortions (ie. Greenland looks way bigger than it actually is on many maps) but is still a useful technique.

And what if you were free to cut and stretch that surface so that you could keep local information by sacrificing some global information. That is something like what T-SNE and UMAP try to do. And luckily they are both is easy to use from Python.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, n_jobs=-1)
tX = tsne.fit_transform(embeddings)

T-SNE attempts to model the items in such away that local distances are close in 2 or 3 dimensions at the expense of global (or far away distances). And we can see in the plot that clusters (groups of highly similar items) start to appear.

2D T-SNE

Index Distance (X,Y) Genre Title Plot
2028 0.000 (1.423,30.830) drama There Will Be Blood In 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti
19221 0.107 (1.438,30.724) drama The Wonder Kid Sebastian Giro is a ten-year-old French boy and child musical prodigy
8722 0.154 (1.300,30.738) unknown Grimsby "Nobby" Butcher has been separated from his little brother Sebastian f
25748 0.271 (1.164,30.751) drama The Power and the Prize Although he is scheduled to wed his boss George Salt's niece that week
13408 0.319 (1.742,30.844) fantasy The Devil and Daniel Webs In 1840 New Hampshire, Jabez Stone (James Craig), a poor kindhearted f
2721 0.366 (1.057,30.849) drama The Kidnappers In the early 1900s, two young orphaned brothers, eight year old Harry
34283 0.444 (0.987,30.746) drama Kidnapped Young David Balfour arrives at a bleak Scottish house, the House of Sh
27595 0.669 (0.756,30.773) drama Kidnapped Scotland, 1751: At a stately manor near Edinburgh, the young David Bal
17046 0.913 (1.575,31.730) adventure Manfish Inspector Warren of Scotland Yard flies into Jamaica and is taken to t
6402 0.952 (0.477,30.725) drama Tol'able David David Kinemon, youngest son of West Virginia tenant farmers, longs to

The similar movies for There Will Be Blood do seem to be be related to family and land again and we see 'The Kidnappers' and 'Kidnapped' appear on the list again.

Index Distance (X,Y) Genre Title Plot
1375 0.000 (32.053,22.814) comedy The Big Lebowski In 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in
12795 0.149 (32.200,22.842) comedy That's My Boy In 1984, middle school student Donny Berger is in detention and begins
23018 0.555 (32.590,22.955) horror April Fool's Day On the weekend leading up to April Fools' Day, a group of college frie
25133 0.618 (32.660,22.929) horror Terror Train At a college pre-med student fraternity New Year's Eve party, a reluct
455 0.648 (31.544,23.215) comedy Adventureland In 1987, James Brennan plans to have a summer vacation in Europe after
5560 0.731 (32.353,23.481) comedy The House During their visit to Bucknell University, husband and wife Scott (Fer
24944 0.789 (31.915,23.591) comedy This Is the End Jay Baruchel arrives in Los Angeles to visit old friend and fellow Can
15475 0.795 (31.958,23.603) comedy Superbad Seth (Jonah Hill) and Evan (Michael Cera) are two high school seniors
12949 0.797 (32.346,22.072) comedy Class Act Genius high school student Duncan Pinderhughes is getting ready for g
1297 0.821 (31.817,23.601) adventure, 30 Minutes or Less Marijuana-smoking, Grand Rapids slacker pizza delivery driver Nick (Je

For Lebowski we see 'That's My Boy' appear again along with a movie about a pot smoking slacker.

Index Distance (X,Y) Genre Title Plot
22511 0.000 (-10.799,1.842) comedy Monty Python and the Holy In 932 AD, King Arthur and his squire, Patsy, travel throughout Britai
26822 0.015 (-10.784,1.847) serial Adventures of Sir Galahad The Arthurian film cycle started with the Adventures of Sir Galahad se
5905 0.031 (-10.771,1.830) romance Lancelot and Guinevere Lancelot is King Arthur's most valued Knight of the Round Table and a
31622 0.068 (-10.793,1.910) musical co A Connecticut Yankee in K Hank Martin (Bing Crosby), an American mechanic, is knocked out and wa
27786 0.095 (-10.757,1.927) animated Quest for Camelot Sir Lionel is one of the knights of the Round Table, and his daughter
23032 0.103 (-10.712,1.787) fantasy First Knight The film's opening text establishes that King Arthur (Sean Connery) of
12475 0.106 (-10.793,1.949) animated Knighty Knight Bugs King Arthur is sitting with his Knights of the Round Table, complainin
11730 0.121 (-10.717,1.931) musical Camelot King Arthur is preparing for a great battle against his friend, Sir La
24511 0.136 (-10.912,1.918) adventure Knights of the Round Tabl With the land in anarchy, warring overlords, Arthur Pendragon (Mel Fer
11201 0.136 (-10.864,1.962) unknown King Arthur: Legend of th Mordred, an iron-fisted warlock, and his armies lay siege to Camelot,

The Monty Python movies make a lot more sense and we see movies about King Arthur, Camelot and Knights.

Index Distance (X,Y) Genre Title Plot
29543 0.021 (45.612,37.416) comedy sho Three Little Sew and Sews The Stooges are sailors employed in the tailor shop of a naval base. A
9368 0.042 (45.559,37.389) comedy sho Saved by the Belle The Stooges are traveling salesmen stranded in Valeska, a fictional So
13167 0.052 (45.551,37.382) comedy sho Oily to Bed, Oily to Rise The Stooges are three hapless tramps. After nearly destroying a farmer
33192 0.059 (45.626,37.453) comedy sho Booby Dupes The Stooges are fish peddlers (similar to their roles in Cookoo Cavali
11028 0.067 (45.618,37.336) short subj Rhythm and Weep The Stooges play the roles of unsuccessful actors who have decided to
29448 0.082 (45.527,37.437) comedy sho Rockin' Thru the Rockies The Stooges are guides (circa late 1800s), who are helping a trio chri
7845 0.090 (45.530,37.344) comedy sho No Dough Boys The Stooges are dressed as Japanese soldiers for a photo shoot; their
34365 0.094 (45.562,37.314) comedy Self-Made Maids The Stooges are artists who fall in love with three models, Larraine,
10009 0.097 (45.655,37.480) comedy The Three Stooges in Orbi The Stooges are TV actors who are trying to sell ideas for their anima
29375 0.104 (45.685,37.340) comedy sho Calling All Curs The Stooges are skilled veterinarians at a pet hospital who are the pr

And just cause I'm curious, I took a look at a small cluster near (45.6, 37.4). These turned out to all be short movies staring The Stooges. This is the strongest indication that T-SNE is doing something interesting we've investigated so far.

Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)

UMAP is a newer technique and like T-SNE, tries to preserve local structure but also most of the global structure in the data. They both depend a lot on the data and the particular parameters you choose so if you are interested on using it in your application you'll have to explore to find settings that work well for you. These algorithms may not be as straight forward as PCA but they seem to yield more interesting results.

We can run the UMAP algorithm by using the umap-learn package which uses the scikit-learn API of fit and transform. By plotting the transformed points you can see there is a lot more structure and clusters with similar movies.

import umap

um =  umap.UMAP()
mapper = um.fit(embeddings)
uX = um.transform(embeddings)

2D UMAP

Index Distance (X,Y) Genre Title Plot
2028 0.000 (4.217,7.186) drama There Will Be Blood In 1898, Daniel Plainview, a prospector in New Mexico, mines a potenti
29294 0.003 (4.220,7.188) comedy Man of the Year Tom Dobbs is host of a satirical news program, where he taps into peop
1638 0.031 (4.196,7.164) western Rocky Mountain Mystery Mining engineer Larry Sutton (Randolph Scott) arrives at the Ballard r
6539 0.034 (4.199,7.214) western The Baron of Arizona The notorious attempt by swindler James Reavis to claim the entire ter
28287 0.037 (4.239,7.215) unknown High Rolling Tex (Bottoms) is an American working at a carnival in Queensland. At t
30306 0.037 (4.192,7.158) western Forty Guns In the 1880s, Griff Bonnell, and his brothers, Wes and Chico, arrive i
23225 0.038 (4.228,7.150) western Something Big In the frontier of New Mexico Territory, Joe Baker is an aging, restle
22058 0.044 (4.209,7.228) western Shoot Out Clay Lomax is released from prison after serving nearly eight years fo
14374 0.044 (4.229,7.143) drama Hallelujah! Sharecroppers Zeke and Spunk Johnson sell their family's portion of th
8983 0.048 (4.223,7.233) drama Black Legion When passed over for promotion at work in favor of a foreign-born frie

For There Will Be Blood we're back among the western, family and land themes.

Index Distance (X,Y) Genre Title Plot
1375 0.000 (4.300,4.571) comedy The Big Lebowski In 1991 Los Angeles, slacker Jeff "the Dude" Lebowski is assaulted in
5758 0.007 (4.307,4.567) comedy Lost Honeymoon Soon after the end of World War II a young English woman, Amy Atkins (
5544 0.019 (4.319,4.572) romance Breaking and Entering Will Francis (Jude Law), a young Englishman, is a landscape architect
10408 0.028 (4.303,4.543) drama The Divorcee Ted (Chester Morris), Jerry (Norma Shearer), Paul (Conrad Nagel), and
23006 0.034 (4.331,4.585) comedy Rumor Has It... In 1997, Sarah Huttinger, an obituary and wedding announcement writer
11587 0.040 (4.337,4.554) drama Sarah Prefers to Run (Sar After performing well on her school's track team, Sarah (Sophie Desmar
16408 0.042 (4.286,4.610) romantic c The Back-up Plan Zoe (Jennifer Lopez) gives up on finding the man of her dreams, decide
18164 0.043 (4.343,4.568) comedy Lazybones Sir Reginald Ford (Ian Hunter), known as "Lazybones", is an idle baron
27509 0.046 (4.256,4.582) comedy-dra Dan in Real Life Dan Burns is a newspaper advice columnist, a widower, and single-paren
23018 0.048 (4.257,4.592) horror April Fool's Day On the weekend leading up to April Fools' Day, a group of college frie

Lebowski's nearest films seem to be about lost and lazy under achievers.

Index Distance (X,Y) Genre Title Plot
22511 0.000 (4.587,8.317) comedy Monty Python and the Holy In 932 AD, King Arthur and his squire, Patsy, travel throughout Britai
19324 0.004 (4.584,8.314) adventure The Son of Monte Cristo In 1865 the proletarian General Gurko Lanen (George Sanders) becomes t
9939 0.013 (4.577,8.308) animated Pound Puppies and the Leg Whopper is taking his niece and nephew to the museum. Along the way, h
16210 0.013 (4.578,8.308) adventure Siege of the Saxons King Arthur learns one of his knights is plotting to take over and mar
11730 0.015 (4.571,8.316) musical Camelot King Arthur is preparing for a great battle against his friend, Sir La
15982 0.021 (4.598,8.300) adventure King Arthur Arthur (Clive Owen) is portrayed as a Roman cavalry officer, also know
16691 0.031 (4.565,8.340) romance/dr Mayerling In the 1880s, Crown Prince Rudolf of Austria (Sharif) clashes with his
23032 0.032 (4.594,8.349) fantasy First Knight The film's opening text establishes that King Arthur (Sean Connery) of
20179 0.038 (4.559,8.343) fantasy Jack the Giant Killer In the Duchy of Cornwall of fairy tale days, an evil sorcerer named Pe
26822 0.038 (4.592,8.355) serial Adventures of Sir Galahad The Arthurian film cycle started with the Adventures of Sir Galahad se

An Python is still among the knight and King Arthur themes. Perhaps this is an 'outlier' movie that is easier to classify.

Index Distance (X,Y) Genre Title Plot
26451 0.115 (5.651,2.703) animated s Fit to Be Tied Spike is happily prancing along the backyard. He steps on a splinter a
8706 0.130 (5.654,2.718) animated Cat Fishin' Spike is shown guarding a lake fence while asleep. Tom shows up with h
24737 0.134 (5.649,2.725) animation Barbecue Brawl Spike and Tyke walk into the backyard to have a barbecue. The first at
11349 0.140 (5.636,2.736) animated s The Dog House Spike is busy building the doghouse of his dreams when Jerry suddenly
8374 0.141 (5.640,2.735) animated s Hic-cup Pup Spike is putting his son, Tyke, to bed. When a bird flies by to chirp,
15666 0.143 (5.639,2.737) romantic c A Girl in Every Port Spike (McLaglen) travels the world as the mate of a schooner. He has a
12045 0.143 (5.640,2.737) animated Love That Pup Spike is sleeping beside his son Tyke when Tyke suddenly wakes up afte
13246 0.144 (5.637,2.739) animation Slicked-up Pup Spike has bathed Tyke to make sure he is nice and clean, but is horrif
24762 0.146 (5.640,2.741) animated Quiet Please! Tom's nemesis, Spike, is trying to take a nap, but is awoken by Tom Ca
8924 0.151 (5.632,2.748) animated Tops with Pops Spike is sleeping beside his son Tyke when he suddenly wakes up from a

And just cause I'm curious I looked into another small cluster and found a group of Spike and Tyke animations. Another pleasant surprise.

Conclusions

Wow, this has gotten a lot longer than I ever expected so lets end here for now. I hope you are forgiving of the super lax 'metric' I used and understanding of the reasons why and that it sparked some ideas for your own analysis.

There are still lots of different and interesting experiments and analysis we can do on this data set which I leave for future articles.

If there is a specific use case you'd like to explore please get in touch or reach out on twitter.

Thanks for your time.

Julio

Want to get notified of new articles and insights?