Is Fandango Still Inflating Ratings?¶

In this project, we are revisiting the Fandango rating project and trying to answer if the ratings have now changed

Let's take a little revisit to the problem. In October 2015, a data journalist named Walt Hickey analyzed movie ratings data and found strong evidence to suggest that Fandango's rating system was biased and dishonest (Fandango is an online movie ratings aggregator). He published his analysis in this article — a great piece of data journalism that's totally worth reading.

In this project, we'll analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.

One of the best ways to figure out whether there has been any change in Fandango's rating system after Hickey's analysis is to compare the system's characteristics previous and after the analysis. Fortunately, we have ready-made data for both these periods of time:

  • Walt Hickey made the data he analyzed publicly available on GitHub. We'll use the data he collected to analyze the characteristics of Fandango's rating system previous to his analysis.

  • Also a github user account collected movie ratings data for movies released in 2016 and 2017. The data is publicly available on GitHub and we'll use it to analyze the rating system's characteristics after Hickey's analysis.

We will use these datasets to answer our question of interest : Is Fandango Still Inflating Ratings?

In [1]:
import pandas as pd
fandango_before = pd.read_csv('fandango_score_comparison.csv')
fandango_16_17 = pd.read_csv('movie_ratings_16_17.csv')


print(fandango_before.head())
                             FILM  RottenTomatoes  RottenTomatoes_User  \
0  Avengers: Age of Ultron (2015)              74                   86   
1               Cinderella (2015)              85                   80   
2                  Ant-Man (2015)              80                   90   
3          Do You Believe? (2015)              18                   84   
4   Hot Tub Time Machine 2 (2015)              14                   28   

   Metacritic  Metacritic_User  IMDB  Fandango_Stars  Fandango_Ratingvalue  \
0          66              7.1   7.8             5.0                   4.5   
1          67              7.5   7.1             5.0                   4.5   
2          64              8.1   7.8             5.0                   4.5   
3          22              4.7   5.4             5.0                   4.5   
4          29              3.4   5.1             3.5                   3.0   

   RT_norm  RT_user_norm         ...           IMDB_norm  RT_norm_round  \
0     3.70           4.3         ...                3.90            3.5   
1     4.25           4.0         ...                3.55            4.5   
2     4.00           4.5         ...                3.90            4.0   
3     0.90           4.2         ...                2.70            1.0   
4     0.70           1.4         ...                2.55            0.5   

   RT_user_norm_round  Metacritic_norm_round  Metacritic_user_norm_round  \
0                 4.5                    3.5                         3.5   
1                 4.0                    3.5                         4.0   
2                 4.5                    3.0                         4.0   
3                 4.0                    1.0                         2.5   
4                 1.5                    1.5                         1.5   

   IMDB_norm_round  Metacritic_user_vote_count  IMDB_user_vote_count  \
0              4.0                        1330                271107   
1              3.5                         249                 65709   
2              4.0                         627                103660   
3              2.5                          31                  3136   
4              2.5                          88                 19560   

   Fandango_votes  Fandango_Difference  
0           14846                  0.5  
1           12640                  0.5  
2           12055                  0.5  
3            1793                  0.5  
4            1021                  0.5  

[5 rows x 22 columns]
In [2]:
print(fandango_16_17.head())
                     movie  year  metascore  imdb  tmeter  audience  fandango  \
0      10 Cloverfield Lane  2016         76   7.2      90        79       3.5   
1                 13 Hours  2016         48   7.3      50        83       4.5   
2      A Cure for Wellness  2016         47   6.6      40        47       3.0   
3          A Dog's Purpose  2017         43   5.2      33        76       4.5   
4  A Hologram for the King  2016         58   6.1      70        57       3.0   

   n_metascore  n_imdb  n_tmeter  n_audience  nr_metascore  nr_imdb  \
0         3.80    3.60      4.50        3.95           4.0      3.5   
1         2.40    3.65      2.50        4.15           2.5      3.5   
2         2.35    3.30      2.00        2.35           2.5      3.5   
3         2.15    2.60      1.65        3.80           2.0      2.5   
4         2.90    3.05      3.50        2.85           3.0      3.0   

   nr_tmeter  nr_audience  
0        4.5          4.0  
1        2.5          4.0  
2        2.0          2.5  
3        1.5          4.0  
4        3.5          3.0  

We will first isolate the columns that offer information about Fandango's ratings for both of the datasets and then proceed with our analysis.

In [3]:
fd_rev = fandango_before[['FILM','Fandango_Stars','Fandango_Ratingvalue','Fandango_votes','Fandango_Difference']]
fd_rev.head()
Out[3]:
FILM Fandango_Stars Fandango_Ratingvalue Fandango_votes Fandango_Difference
0 Avengers: Age of Ultron (2015) 5.0 4.5 14846 0.5
1 Cinderella (2015) 5.0 4.5 12640 0.5
2 Ant-Man (2015) 5.0 4.5 12055 0.5
3 Do You Believe? (2015) 5.0 4.5 1793 0.5
4 Hot Tub Time Machine 2 (2015) 3.5 3.0 1021 0.5
In [4]:
fd_16_17_rev = fandango_16_17[['movie','year','fandango']]
fd_16_17_rev.head()
Out[4]:
movie year fandango
0 10 Cloverfield Lane 2016 3.5
1 13 Hours 2016 4.5
2 A Cure for Wellness 2016 3.0
3 A Dog's Purpose 2017 4.5
4 A Hologram for the King 2016 3.0

Now that we have isolated the columns that offer information about Fandango's ratings in seperate variables, we will define the inital goal - whether Fandango's ratings change after Hickey's analysis. The population of interest for our goal is then all the movie along with their ratings irrespective of the release year date.

Now we will look at the data currently we have. The datasets we have are samples collected at different periods of time - one before the Hickey's Analysis and other after Hickey's Analysis. These two samples, however, are not representive of the population.

Firstly, looking at the README.md of the first dataset, we observe that the sample is not random because:

a) The sample drawn contains film that have atleast 30 fan reviews on Fandango

b) The sample is collected on August 24, 2015.

Similarly, looking at the README.md of the second dataset, we observe that the sample is not random because:

a) The sample drawn contains only popular movies with a significant number of votes released in 2016 and 2017 - unclear what the significant value is?

b) The sample contains data till March 22, 2017.

At this point, we have at least two alternatives: either we collect new data, either we change the goal of our analysis by placing some limitations on it.

Tweaking our goal seems a much faster choice compared to collecting new data. Also, it's quasi-impossible to collect a new sample previous to Hickey's analysis at this moment in time.

Changing the Goal of our Analysis¶

Thus, we will modify slightly the goal of our interest slightly - is there any difference in Fandango Ratings for movies released in 2015 and the ones released in 2016.

The samples thus, will be - 1) All popular movies released in 2015 2) All popular movies released in 2016

The term "popular" is vague and we need to define it with precision before continuing. We'll use Hickey's benchmark of 30 fan ratings and consider a movie as "popular" only if it has 30 fan ratings or more on Fandango's website.

Now, we will check if both samples contain popular movies — that is, check whether all (or at least most) sample points are movies with over 30 fan ratings on Fandango's website.One of the data sets doesn't provide information about the number of fan ratings, and this raises representativity issues once again

To check this, we will draw a random sample from the second dataset to see if the sample has over 30 fan ratings or more.

In [5]:
fd_16_17_rev.sample(10, random_state = 1)
Out[5]:
movie year fandango
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5
51 Fantastic Beasts and Where to Find Them 2016 4.5
33 Cell 2016 3.0
59 Genius 2016 3.5
152 Sully 2016 4.5
4 A Hologram for the King 2016 3.0
31 Captain America: Civil War 2016 4.5

Thus, 8 out of 10 rows have more than 30 fan ratings or more. We can now ascertain that the second dataset has indeed popular movies.

Now, we will check the first dataset to see if it has fan ratings less than 30 i.e. if it contains non-popular movies.

In [6]:
fd_rev[fd_rev['Fandango_votes']<30].shape[0]
Out[6]:
0

Thus, we can confirm that the first dataset has also popular movies.Now, both the datasets contain movies released in various years. We will now isolate the samples seperately so that they contain movies released in 2015 and 2016 only.

Isolating the Samples We Need¶

We will now :

  • Isolate the movies released in 2015 in a seperate data set.
  • Isolate the movies released in 2016 in a seperate data set.

The first dataset contains movies for 2015. We will now filter it on 2015. Since there is no specific column for release year, we can see that the FILM column contains the year when movie was released.We can use it to extract the release year.

In [7]:
fd_2015 = fd_rev[fd_rev['FILM'].str.contains('2015')]
fd_2015
Out[7]:
FILM Fandango_Stars Fandango_Ratingvalue Fandango_votes Fandango_Difference
0 Avengers: Age of Ultron (2015) 5.0 4.5 14846 0.5
1 Cinderella (2015) 5.0 4.5 12640 0.5
2 Ant-Man (2015) 5.0 4.5 12055 0.5
3 Do You Believe? (2015) 5.0 4.5 1793 0.5
4 Hot Tub Time Machine 2 (2015) 3.5 3.0 1021 0.5
5 The Water Diviner (2015) 4.5 4.0 397 0.5
6 Irrational Man (2015) 4.0 3.5 252 0.5
8 Shaun the Sheep Movie (2015) 4.5 4.0 896 0.5
9 Love & Mercy (2015) 4.5 4.0 864 0.5
10 Far From The Madding Crowd (2015) 4.5 4.0 804 0.5
11 Black Sea (2015) 4.0 3.5 218 0.5
15 Taken 3 (2015) 4.5 4.1 6757 0.4
16 Ted 2 (2015) 4.5 4.1 6437 0.4
17 Southpaw (2015) 5.0 4.6 5597 0.4
19 Pixels (2015) 4.5 4.1 3886 0.4
20 McFarland, USA (2015) 5.0 4.6 3364 0.4
21 Insidious: Chapter 3 (2015) 4.5 4.1 3276 0.4
22 The Man From U.N.C.L.E. (2015) 4.5 4.1 2686 0.4
23 Run All Night (2015) 4.5 4.1 2066 0.4
24 Trainwreck (2015) 4.5 4.1 8381 0.4
26 Ex Machina (2015) 4.5 4.1 3458 0.4
27 Still Alice (2015) 4.5 4.1 1258 0.4
29 The End of the Tour (2015) 4.5 4.1 121 0.4
30 Red Army (2015) 4.5 4.1 54 0.4
31 When Marnie Was There (2015) 4.5 4.1 46 0.4
32 The Hunting Ground (2015) 4.5 4.1 42 0.4
33 The Boy Next Door (2015) 4.0 3.6 2800 0.4
34 Aloha (2015) 3.5 3.1 2284 0.4
35 The Loft (2015) 4.0 3.6 811 0.4
36 5 Flights Up (2015) 4.0 3.6 79 0.4
... ... ... ... ... ...
115 While We're Young (2015) 3.0 2.9 449 0.1
116 Clouds of Sils Maria (2015) 3.5 3.4 162 0.1
117 Testament of Youth (2015) 4.0 3.9 127 0.1
118 Infinitely Polar Bear (2015) 4.0 3.9 124 0.1
119 Phoenix (2015) 3.5 3.4 70 0.1
120 The Wolfpack (2015) 3.5 3.4 66 0.1
121 The Stanford Prison Experiment (2015) 4.0 3.9 51 0.1
122 Tangerine (2015) 4.0 3.9 36 0.1
123 Magic Mike XXL (2015) 4.5 4.4 9363 0.1
124 Home (2015) 4.5 4.4 7705 0.1
125 The Wedding Ringer (2015) 4.5 4.4 6506 0.1
126 Woman in Gold (2015) 4.5 4.4 2435 0.1
127 The Last Five Years (2015) 4.5 4.4 99 0.1
128 Mission: Impossible – Rogue Nation (2015) 4.5 4.4 8357 0.1
129 Amy (2015) 4.5 4.4 729 0.1
130 Jurassic World (2015) 4.5 4.5 34390 0.0
131 Minions (2015) 4.0 4.0 14998 0.0
132 Max (2015) 4.5 4.5 3412 0.0
133 Paul Blart: Mall Cop 2 (2015) 3.5 3.5 3054 0.0
134 The Longest Ride (2015) 4.5 4.5 2603 0.0
135 The Lazarus Effect (2015) 3.0 3.0 1651 0.0
136 The Woman In Black 2 Angel of Death (2015) 3.0 3.0 1333 0.0
137 Danny Collins (2015) 4.0 4.0 531 0.0
138 Spare Parts (2015) 4.5 4.5 450 0.0
139 Serena (2015) 3.0 3.0 50 0.0
140 Inside Out (2015) 4.5 4.5 15749 0.0
141 Mr. Holmes (2015) 4.0 4.0 1348 0.0
142 '71 (2015) 3.5 3.5 192 0.0
144 Gett: The Trial of Viviane Amsalem (2015) 3.5 3.5 59 0.0
145 Kumiko, The Treasure Hunter (2015) 3.5 3.5 41 0.0

129 rows × 5 columns

Now, we will isolate the movies released in 2016 in other dataset.

In [8]:
fd_2016 = fd_16_17_rev[fd_16_17_rev['year'] == 2016]
fd_2016['year'].value_counts()
Out[8]:
2016    191
Name: year, dtype: int64
In [9]:
fd_2016
Out[9]:
movie year fandango
0 10 Cloverfield Lane 2016 3.5
1 13 Hours 2016 4.5
2 A Cure for Wellness 2016 3.0
4 A Hologram for the King 2016 3.0
5 A Monster Calls 2016 4.0
6 A Street Cat Named Bob 2016 4.5
7 Alice Through the Looking Glass 2016 4.0
8 Allied 2016 4.0
9 Amateur Night 2016 3.5
10 Anthropoid 2016 4.0
11 Approaching the Unknown 2016 3.5
12 Arrival 2016 4.0
14 Assassin's Creed 2016 4.0
15 Bad Moms 2016 4.5
16 Bad Santa 2 2016 3.5
17 Barbershop: The Next Cut 2016 4.5
18 Batman V Superman: Dawn of Justice 2016 4.0
21 Before the Flood 2016 3.5
22 Ben-Hur 2016 4.0
24 Blair Witch 2016 3.0
25 Bleed for This 2016 4.0
26 Blood Father 2016 4.0
27 Bridget Jones's Baby 2016 4.0
28 Busanhaeng 2016 4.5
29 Cabin Fever 2016 4.0
30 Cafe Society 2016 3.5
31 Captain America: Civil War 2016 4.5
32 Captain Fantastic 2016 4.0
33 Cell 2016 3.0
34 Central Intelligence 2016 4.5
... ... ... ...
178 The Girl with All the Gifts 2016 4.0
179 The Great Wall 2016 4.0
180 The Huntsman: Winter's War 2016 4.0
181 The Infiltrator 2016 4.0
182 The Jungle Book 2016 4.5
184 The Legend of Tarzan 2016 4.5
186 The Light Between Oceans 2016 4.0
187 The Magnificent Seven 2016 4.5
188 The Neon Demon 2016 3.5
189 The Nice Guys 2016 3.5
190 The Other Side of the Door 2016 3.5
191 The Perfect Match 2016 4.0
192 The Purge: Election Year 2016 4.0
193 The Secret Life of Pets 2016 4.0
195 The Shallows 2016 4.0
197 The Take (Bastille Day) 2016 4.0
198 The Whole Truth 2016 3.0
199 The Wild Life 2016 3.0
200 Triple 9 2016 3.5
201 Trolls 2016 4.5
202 Under the Shadow 2016 4.0
203 Underworld: Blood Wars 2016 4.0
204 War Dogs 2016 4.0
205 War on Everyone 2016 4.0
206 Warcraft 2016 4.0
207 Whiskey Tango Foxtrot 2016 3.5
208 Why Him? 2016 4.0
209 X-Men: Apocalypse 2016 4.0
212 Zoolander 2 2016 2.5
213 Zootopia 2016 4.5

191 rows × 3 columns

We can now start analyzing the two samples we isolated before. Once again, our goal is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016.

Comparing Distribution Shapes for 2015 and 2016¶

In [10]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
plt.style.use('fivethirtyeight')

plt.figure(figsize = (8,5.5))
fd_2015['Fandango_Stars'].plot.kde(label = '2015_Rating', legend = True)
fd_2016['fandango'].plot.kde(label = '2016_Rating', legend = True)
plt.xticks(np.arange(0,5.0,0.5))
plt.xlim((0,5.0))
plt.xlabel('Stars')
plt.title('Comparing Distribution Shapes for 2015 and 2016')
Out[10]:
<matplotlib.text.Text at 0x7f7f9bba1780>

Based on the above kernel density plots, we can analyze :

  • Shape of the distribution

For both density plots, the distribution is left skewed since the tail is in negative direction.

  • How do their shapes compare?

The 2015 distribution seems to be slightly left skewed than the 2016 distribution.

  • Evidence that there is indeed a change between Fandango's ratings for popular movies in 2016

Looking at the 2016 dsitribution ,the plot peeks at 4.0 whereas the 2015 distribution plot peeks at 4.5. Thus, there is indeed a difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016.

We can also see the direction of difference: movies in 2016 were rated lower or higher compared to 2015. While comparing the distributions with the help of the kernel density plots was a great start, we now need to analyze more granular information.

Comparing Relative Frequencies¶

Let's explore the frequency distribution tables of the two distributions.

In [11]:
fd_2015['Fandango_Stars'].value_counts().sort_index()
Out[11]:
3.0    11
3.5    23
4.0    37
4.5    49
5.0     9
Name: Fandango_Stars, dtype: int64
In [12]:
fd_2016['fandango'].value_counts().sort_index()
Out[12]:
2.5     6
3.0    14
3.5    46
4.0    77
4.5    47
5.0     1
Name: fandango, dtype: int64

Looking at the above frequency distribution tables, we cannot directly compare them since there are different number of movies in each of the dataset. A more logical sense to compare each of them, is that for each dataset calculate relative frequency distribution and then compare them against each other.

Let's calculate the relative frequencies for each dataset.

In [13]:
(fd_2015['Fandango_Stars'].value_counts(normalize= True)*100).sort_index()
Out[13]:
3.0     8.527132
3.5    17.829457
4.0    28.682171
4.5    37.984496
5.0     6.976744
Name: Fandango_Stars, dtype: float64
In [14]:
(fd_2016['fandango'].value_counts(normalize = True)*100).sort_index()
Out[14]:
2.5     3.141361
3.0     7.329843
3.5    24.083770
4.0    40.314136
4.5    24.607330
5.0     0.523560
Name: fandango, dtype: float64

Looking at the 2016 relative frequency distribution, the Fandango ratings span a slightly large range from 2.5 to 5.0 , whereas for 2015, the Fandango ratings span a relatively short range starting at 3.0. Also, number of 5- star ratings has also reduced from ~7% in 2015 to under 1% in 2016. Similar to the kernel density plot, the peak for 2015 distribution plot is at 4.5 whereas for 2016 distribution it peaks at 4.0.

However, the direction of the difference is not as clear as it was on the kernel density plots.

Confirming the direction of difference with summary stats¶

We'll take a couple of summary statistics to get a more precise picture about the direction of the difference. We'll take each distribution of movie ratings and compute its mean, median, and mode, and then compare these statistics to determine what they tell about the direction of the difference.

In [15]:
mean_fd_2015 = fd_2015['Fandango_Stars'].mean()
med_fd_2015 = fd_2015['Fandango_Stars'].median()
mode_fd_2015 = fd_2015['Fandango_Stars'].mode().iloc[0]

mean_fd_2016 = fd_2016['fandango'].mean()
med_fd_2016 = fd_2016['fandango'].median()
mode_fd_2016 = fd_2016['fandango'].mode().iloc[0]

We will analyze the above summary stats using a grouped bar plot and see how the summary stats differ for 2015 and 2016. For this, let's first create a dataframe to store the summary stats calculated.

In [16]:
summary_df = pd.DataFrame( index = ['mean','median','mode'])

summary_df['2015'] = [mean_fd_2015,med_fd_2015,mode_fd_2015]
summary_df['2016'] = [mean_fd_2016,med_fd_2016,mode_fd_2016]

summary_df.head()
Out[16]:
2015 2016
mean 4.085271 3.887435
median 4.000000 4.000000
mode 4.500000 4.000000

Now, we will create a grouped bar plot.

In [17]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))


summary_df['2015'].plot.bar(width = 0.3, align = 'center', label = '2015', color = 'blue')
summary_df['2016'].plot.bar(width = 0.3,align = 'edge', label = '2016' ,color = '#cc0000')
plt.ylim((0.0,5.0))
plt.yticks(np.arange(0.0,5.5,0.5))
plt.title('Comparing summary statistics: 2015 vs 2016',y = 1.08, fontdict= {'size':'18','fontweight':'normal'})
plt.ylabel('Stars')
plt.xticks(rotation=0)
plt.legend(loc = 'upper center')
Out[17]:
<matplotlib.legend.Legend at 0x7f7f9ba89f60>

Looking at the grouped bar plot, we can see that there is a drop in the mean for 2016 by 0.5% as compared to the 2015 rating. The median for both year remains the same. Thus, as the mean drops , we can comment that the direction of 2016 ratings tend to be slightly shifting on the right side than that of 2015.

Conclusion:¶

In conclusion, our analysis showed that there's indeed a slight difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We also determined that, on average, popular movies released in 2016 were rated lower on Fandango than popular movies released in 2015.