How machines can help us discover overlooked films

Feeling like I’d burned through my standard sources for movie recommendations, I recently decided to turn to box office failures. I was seeking out an automated way to explore the world of such movies and find “overlooked” films that are actually very good, but were ignored in theaters and dismissed by critics.

Using Nathan Rabin’s popular “My Year of Flops” series on The AV Club and follow-up book as a starting point, I designed an algorithm to predict whether a box office failure is actually a film worth seeing. The algorithm examines multiple aspects of a movie’s cultural response to make its prediction – such as applying sentiment analysis to capture the tone of reviews, and understanding whether critics and audiences responded differently to a movie. The output is a list of 100+ movies released over the past decade with high likelihood of being quality, “overlooked” films.

Here’s how it works…


In 1994, Forrest Gump made over $300M at the domestic box office, won six Oscars, and spawned a murderer’s row of pop culture references.

The Shawshank Redemption also came out that year. It had a confusing name, won exactly zero Oscars, and made only $16M in its initial run – an amount outdistanced by House Party 3, Kid ‘n Play’s capstone installment in their “living-situation-oriented festival” trilogy.

Yet flip on TNT on a random Saturday night, and you’re more likely to be greeted by Andy and Red than by Forrest and Jenny.

Because it flopped in theaters, people had to discover Shawshank organically on video. And not only did its reputation grow, but fans felt a sense of personal ownership and evangelism. Nearly everyone I know who’s seen the movie first watched it because of a recommendation, and fiercely loyal IMDb users have even rated it the best movie of all time.

One of the earliest customer reviews for The Shawshank Redemption.
One of the earliest customer reviews for The Shawshank Redemption.

There’s a special power to great films that people have to discover – to the quality movies that bombed (or even simply disappointed) at the box office while still in theaters.

People want to talk about these types of movies, often even more so than movies that are equally good but more well-known. Unlike hit movies that already have high awareness, “overlooked” movies make you want to tell your friends with a sense of urgency  – “You have to check this out, trust me.” When you told your friends to see Office Space, or Zoolander, you were helping someone find something great that they never would have known about. You were growing the pop culture pie.

So why is it so hard to actually discover these overlooked films? Well, the reality is that most flops are flops for a reason – they aren’t very good. Who wants to spend an afternoon gutting through scenes like this to find out if Catwoman is actually a worthwhile film?

(Spoiler alert: No, it isn’t. It’s a terrible, terrible film.)


Fortunately, one man has given us a head start. In his “My Year of Flops” AV Club columns (and follow-up book), Nathan Rabin gave a second chance to forgotten, ignored, and/or reviled films throughout cinematic history. Our Waterworlds, our Giglis, our Battlefield Earths.

But rather than simply snark his way through each movie, his goal was to honestly re-evaluate each film on its own merits, and even identify flashes of greatness, if any exist.

At the conclusion of each review, a film is rated into one of three categories:

  • “Secret Success” – a genuinely good movie.
  • “Failure” – a genuinely bad movie.
  • “Fiasco” – a movie that can’t really be called “good”, but at least reaches, epically and messily, for goodness. In Rabin’s estimation, a Fiasco is actually preferable to and more enjoyable than a pure Failure.

This categorization ends up producing something more complex and subtle than simply looking for good movies by, say, Rotten Tomatoes score alone.

Median Rotten Tomatoes Critics Score among films in each category. Note that simply guessing a flop’s quality by this metric alone would miss many Secret Successes.
Median Rotten Tomatoes Critics Score among films in each category. Note that simply guessing a flop’s quality by this metric alone would miss many Secret Successes.

Note that while Secret Successes are, naturally, reviewed the most favorably, the average Secret Success would still receive a “Rotten” rating on Rotten Tomatoes. Thus, using Rotten Tomatoes freshness rating alone would miss a great number of movies that Rabin’s re-evaluation judged to be quite good.

(The way to interpret this is that cases like Shawshank are actually relatively rare – a movie that highly regarded [90% on Rotten Tomatoes] usually gets quite a bit of attention, and soon ceases to be “overlooked”. Much more common is the case when a genuinely enjoyable movie either hits a cultural tidal wave, or comes with critical baggage that weighs down its score.)

(Zoolander is a case of the former; its 64% Rotten Tomatoes score and disappointing box office performance are understandable given its release in late September 2001The Hudsucker Proxy is an example of the latter, suffering in comparison to the rest of the Coen brothers’ magnificent body of work, as the reviews accompanying its 56% rating make clear.)

Also, note that Fiascoes are rated more highly than Failures — an interesting validation of Rabin’s rationale for splitting the two.

The problem, of course, is that Nathan Rabin is just one man. Even after 200+ films and a spinoff  “My World of Flops” series, he isn’t capable of seeing and judging all the box office disappointments out there.

In startup parlance, he “doesn’t scale.”

To evaluate the films that he wasn’t able to, I used the films in the “My Year of Flops” series as an initial dataset to train a machine learning classifier that could “teach” itself patterns that separate a Secret Success from a Failure from a Fiasco. With such a system, the thankless task of evaluating movies and fetching high-potential candidates can now be outsourced to a machine. 


One of the most useful aspects of supervised machine learning was that I could feed in dozens of pieces of information about 180+ movies from the “My Year of Flops” series, and let the system figure out for itself which information was actually predictive of a film’s rating.

And as it turns out, if you want to know if a movie is a Secret Success, a Fiasco, or a Failure, then the intrinsic qualities of a film aren’t all that helpful. You don’t need to know a movie’s stars, or the director, or the genre or runtime or budget or amount of pre-release press. Instead, if you’re seeking patterns among Secret Successes, Fiascoes, and Failures, you’re better off looking at how critics and audiences described a movie.

In retrospect, this makes sense. After all, the intrinsic aspects of a movie are incredibly complex, interrelated, and unpredictable. Movies can’t be reduced to a basic set of building blocks that ensure a quality output — in the words of legendary screenwriter William Goldman, in the movie industry “nobody knows anything.” External responses, however, present a richer, more coherent set of data to mine through – hundreds of reviews vs a handful of actors, the aggregate opinion of thousands of casual movie viewers vs a single director – and is a more logical place to search for useful patterns.

Another classic William Goldman-penned line. Not really germane, but pretty great!
Another classic William Goldman-penned line. Not really germane, but pretty great!

Three aspects in particular emerged as useful:

  • Critical Response – Was the movie regarded well or poorly by critics?
    • This is captured by the normalized measure of a film’s Rotten Tomatoes Critics Score. All else equal, a high score indicates a Secret Success and a low score indicates a Failure.
  • Audience Differential – Was the movie liked more by critics than audiences, liked more by audiences than critics, or liked by both about equally?
    • This is captured by the normalized difference between a movie’s Flixster Audience Score and its Rotten Tomatoes Critics Score. All else equal, a large differential skewed towards preference by critics indicates a Secret Success.
  • Critical Polarity – When critics didn’t like a film, was it more common to see descriptions like “boring” and “terrible”, or descriptions like “didn’t meet its potential”? The distinction between very negative phrases and more measured phrases ends up being quite useful for distinguishing lame Failures from messy-but-aspirational Fiascoes.
    • For this, natural language processing (teaching computers to understand language) was used to programmatically “read” key sentences from Rotten reviews and derive their sentiment – positive, negative, or neutral. The movie is assigned a score based on the frequency of strongly negative statements, from 0 (more negative statements) to 1 (fewer negative statements). All else equal, a higher frequency of negative sentences indicates a Failure.

These three pieces of information “separate” each rating well when making predictions, which is something we can confirm visually.

For instance, in each chart below, the x-axis is the Critical Polarity score. The vertical bars represent the number of films with that rating around each Critical Polarity score. And the line represents the estimated probability of that rating for different values of Critical Polarity, which is what the classifier ultimately uses to help make its decisions. (The three probabilities sum to 1.0 for each value of Critical Polarity.)

This feature “separates” Failures from Fiascoes well; films with lower Critical Polarity scores (mostly harsh reviews) are more likely to be Failures, while films with higher values (more moderate reviews) are more likely to be Fiascoes or Secret Successes.
This feature “separates” Failures from Fiascoes well; films with lower Critical Polarity scores (mostly harsh reviews) are more likely to be Failures. (Click to enlarge)

At a Polarity score of 0.2 (mostly harsh sentences), for instance, there is a significant probability of a Failure, but quite a low probability of a Fiasco or Secret Success.

Note that even though we are restricting ourselves to Rotten reviews when measuring Critical Polarity, these reviews are still much kinder to Fiascoes than Failures. Once again, the split between Failures and Fiascoes, which are entirely subjective labels that Nathan Rabin came up with out of intuition, can actually be borne out in data.

Having confirmed that these three characteristics of a film separate the different ratings well, we can train a classifier by feeding in the movies from the “My Year of Flops” series, along with these three pieces of information and the movie’s actual rating. When complete, the classifier can guess whether an unknown movie is mostly likely a Secret Success, a Fiasco, or a Failure – and recommend only those movies that are most likely Secret Successes.

How accurate is it? Overall, it works surprisingly well considering what a complex task it’s being asked to mimic: a human reviewer’s thought process.

As a first test of accuracy, the classifier was trained on a part of the MYOF series, and then asked to predict which of the remaining movies were Secret Successes. (Specifically, a Naïve-Bayes classifier was tested via 10-fold cross-validation.) The baseline I measured against was a system where any film with a “Fresh” rating on Rotten Tomatoes was assumed to be a Secret Success. Accuracy was measured by F1 score, which rewards systems that make both accurate recommendations as well as accurate rejections.

The F1 score of the baseline was 0.33; the F1 score of our classifier was 0.65, meaning our classifier represents a significant boost above the baseline of using a movie’s “Fresh” or “Rotten” rating to make recommendations.

The second test of accuracy was making sure the system wasn’t “overfitting” to the training set and could generalize well to new movies. After all, an election forecast can be tested against past elections during development, but the ultimate test is its performance on the real, unseen Election Day; similarly, the ultimate test for our system was to make successful predictions on new films that weren’t part of the training set. To that end, I asked the system to predict ratings for ten movies from Rabin’s follow-on “My World of Flops” series, and only recommend the movies it predicted to be Secret Successes.

Ten movies from the spinoff “My World of Flops” series were “held back” from model development, and used as a final test of accuracy. The classifier made accurate recommendations on 9 of 10 movies, including identifying the lone Secret Success.
Ten movies from the spinoff “My World of Flops” series were “held back” from model development, and used as a final test of accuracy. The classifier made accurate recommendations on 9 of 10 movies, including identifying the lone Secret Success.

The classifier made accurate recommendations for 9 of the 10 movies, including recommending the only Secret Success in the group (despite its 42% Critics Score on Rotten Tomatoes). This gives us confidence that the patterns found in the MYOF dataset can generalize well to a new list of movies.


With the classifier working, I began searching for new “candidates” to feed in and evaluate. As a start, I used data from Box Office Mojo to find movies released between 2003 and 2011 that earned back less than 50% of their budget at the domestic box office. Since I wanted to exclude small films that never saw wide release, I only included films that cost at least $7.5M to produce.

This produced about 250 candidates released between 2003 and 2011. Of that set, about 100 were identified and recommended as likely overlooked films, enough for two years of weekly viewing on your own. 

A Better Life
A Very Long Engagement
American Dreamz
Another Year
Attack the Block
Bad Lieutenant: Port of Call New Orleans
Barney’s Version
Bee Season
Brighton Rock
Buffalo Soldiers
Charlie Bartlett
Children of Men
City of Ember
City of Life and Death
Coco Before Chanel
Fair Game
Fantastic Mr. Fox
Find Me Guilty
Flags of Our Fathers
Flushed Away
Home on the Range
How Do You Know
I Capture the Castle
I Love You Phillip Morris
Lars and the Real Girl
Leaves of Grass
Little Children
Looney Tunes: Back in Action
Lord of War
Lords of Dogtown
Lust, Caution
Marie Antoinette
Married Life
Me and Orson Welles
Moonlight Mile
Never Let Me Go
Owning Mahowny
Peter Pan
Pirate Radio (The Boat That Rocked)
Red Riding Trilogy
Shoot ‘Em Up
Solitary Man
Straw Dogs
Synecdoche, New York
Taking Woodstock
The Assassination of Jesse James by the Coward Robert Ford
The Big Year
The Company Men
The Conspirator
The Door in the Floor
The Express
The Ghost Writer
The Good Thief
The Good, the Bad, and the Weird
The Hoax
The Human Stain
The Imaginarium of Doctor Parnassus
The Joneses
The Killer Inside Me
The Last Station
The Life Aquatic with Steve Zissou
The Lookout
The New World
The Pianist
The Prize Winner of Defiance, Ohio
The Quiet American
The Road
The Rum Diary
The Runaways
The Secret of Kells
The Three Burials of Melquiades Estrada
The Tree of Life
The World’s Fastest Indian
The Yellow Handkerchief
The Young Victoria
Treasure Planet
William Shakespeare’s The Merchant of Venice
You Will Meet a Tall Dark Stranger
5 Days Of War
A Good Year
A Sound of Thunder
Aeon Flux
After the Sunset
Alex & Emma
An Unfinished Life
Anything Else
Around the World in 80 Days
Arthur and the Invisibles
Astro Boy
Babylon A.D.
Bangkok Dangerous
Beyond Borders
Boat Trip
Bobby Jones: Stroke of Genius
Brooklyn Rules
Bulletproof Monk
Casino Jack
Cirque du Freak: The Vampire’s Assistant
Code Name: The Cleaner
Conan the Barbarian
Connie And Carla
Dragon Wars
Dream House
Duplex (Our House)
Flight of the Phoenix
Fly Me to the Moon
From Paris with Love
Furry Vengeance
Gentlemen Broncos
Gods and Generals
Green Zone
Gulliver’s Travels
Hollywood Homicide
Hoodwinked Too! Hood vs Evil
How to Lose Friends & Alienate People
I Don’t Know How She Does It
I Spy
Imagine That
Jonah Hex
Journeys In Black: Luther Vandross
Killer Elite
King Arthur
Kingdom of Heaven
Lions for Lambs
Love Ranch
Machine Gun Preacher
Marci X
Mars Needs Moms
Meet Dave
Miami Vice
Miracle at St. Anna
Mozart and the Whale
New York Minute
Prince of Persia: The Sands of Time
Princess Ka’iulani
Repo Men
Running Scared
Sex Drive
Sinbad – Legend of the Seven Seas
Son of the Mask
Suspect Zero
The Alamo
The Black Dahlia
The Brothers Grimm
The City of Your Final Destination
The Devil’s Double
The Game of Their Lives
The Golden Compass
The Good Guy
The Great Raid
The Invasion
The Island
The Life Before Her Eyes
The Lost City
The Producers
The Rocker
The Sorcerer’s Apprentice
The Statement
The Tempest
The Thing
The Three Musketeers
The Warrior’s Way
The Whole Ten Yards
The Wolfman
The Work and the Glory
There Be Dragons
Veronica Guerin
Welcome To Mooseport
Wild Target

There are likely more candidates that can be evaluated; the biggest limitation was finding movies with a known production budget. As a future effort, I may scan IMDb and Wikipedia to find more movies.

And while pleased with this start, I also wonder what additional refinements to the algorithm can be considered.

In particular, a movie like Dirty Work stymied the classifier, continuously coming up as a Failure despite its true Secret Success rating.

The model was never really able to grasp the concept that, yes, Dirty Work can simultaneously be a terrible film and still so damn funny. But while self-driving cars and election predictions are making impressive strides, I don’t know if Bayesian inference will ever be able to square that circle.

(Questions, comments, or suggestions? @ajaymkalia on Twitter or e-mail me at skynetandebert at gmail dot com)

2 thoughts on “How machines can help us discover overlooked films

Leave a reply.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s