-
Notifications
You must be signed in to change notification settings - Fork 0
/
Movie rating.Rmd
717 lines (555 loc) · 28.2 KB
/
Movie rating.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
---
output:
html_document:
toc: true
toc_float: false
toc_depth: 4
number_sections: false
code_folding: hide
code_download: true
fig_width: 9
fig_height: 4
fig_align: "center"
highlight: pygments
theme: cerulean
keep_md: true
title: "IMDB rating prediction"
subtitle: "Variable importance and prediction"
author: "by Peter Hontaru"
---
## Background
**What are the most important factors in determining the IMDB rating of a movie and can we use a multiple linear regression model to predict this rating?**
This project is based on a fictitious scenario where I’ve been hired as a data scientist at Paramount Pictures. The data presents numerous variables on movies such as audience/critic ratings, number of votes, runtime, genre, etc. Paramount Pictures is looking to gather insights into determining the acclaim of a film and other novel patterns or ideas. The data set is comprised of 651 randomly sampled movies produced and released before 2016.
```{r load-packages, message=FALSE, include=FALSE}
library(ggplot2)
library(dplyr)
library(ggmosaic)
library(ggcorrplot)
library(ggthemes)
library(broom)
library(plotly)
library(statsr)
library(tidyverse)
library(kableExtra)
library(ggpubr)
library(lubridate)
library(car)
library(broom)
library(quantreg)
knitr::opts_chunk$set(
echo = TRUE, # show all code
tidy = FALSE, # cleaner code printing
size = "small", # smaller code
fig.path = "figures/", #graphics location
out.width = "100%",
message = FALSE,
warning = FALSE
)
load("movies.Rdata")
#remove some columns that are not needed
movies <- movies%>%
select(-imdb_url, -rt_url, -director:-actor5)
```
* * *
## Part 1: Data
### generabizability
The dataset is comprised of 651 randomly sampled movies produced and released before 2016. Therefore,due to the random sampling, it can be assumed that the data is representative of all movies produced.
However, as seen below, the earliest date included is 1970 and some years do not have a significant amount of data. Thus, the data is not representative of each year within our sample and this should be considered when interpreting the results. We should also ensure to not extrapolate outside of this year range when calculating predictions.
```{r}
movies%>%
group_by(thtr_rel_year)%>%
summarise(count = n())%>%
ggplot(aes(thtr_rel_year, count))+
geom_col(col="black", fill = "dark red", width = 1)+
geom_text(aes(label = count), nudge_y = 3, size = 3)+
theme_few()+
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())+
labs(x = NULL,
y = NULL,
title = "Number of movies included within each year")
```
### causality
Due to the observational nature of the study, no random assignment was used (test/control group), and hence causality cannot be inferred.
* * *
## Part 2: Research question
**What are the most important factors in determining the IMDB rating of a movie and can we use a multiple linear regression model to predict this rating?**
The ability to predict ratings based on film metrics could help:
- understand the popularity of a movie before the full ratings are generated
- inform us if a movie is performing better or worse than expected
- provide insight into which areas to focus for the most desirable outcome
* * *
## Part 3: Exploratory data analysis (EDA)
### Collinearity and Parsimony
We can observe high correlations between our various ratings (IMDB rating, audience score and critics score). We should consider only keeping one of our variables for our final model, but we will explore all of them throughout our EDA.
There's also a high correlation between theatre release year and DVD release year so we should only choose to include one. With the current trend of online streaming (ie. Netflix, Hulu, Amazon), physical DVDs are not as relevant and we can choose to exclude the DVD variables in favour of the theatre ones.
```{r fig.height=5, fig.width=10}
#select all to start
raw_data_corr <- select_if(movies, is.numeric)
# Compute a correlation matrix
corr <- round(cor(raw_data_corr, use="complete.obs"),2)
# Compute a matrix of correlation p-values
p.mat <- cor_pmat(raw_data_corr)
# Visualize the correlation matrix
ggcorrplot(corr, method = "square",
ggtheme = ggthemes::theme_few,
#title = "We can observe some clear patterns",
outline.col = "black",
colors = c("blue","white", "red"),
lab = TRUE,
lab_size = 2.5,
digits = 2,
type = "lower",
legend = "",
tl.cex = 8,
#show insignificant ones as blank
p.mat = p.mat,
hc.order = TRUE,
insig = "blank")
movies <- movies%>%
select(-dvd_rel_year, -dvd_rel_month, -dvd_rel_day)
```
### Which types of film are included in our analysis?
We can see that the majority of the movies were within the Feature Film category. Given that Paramount operates in this area (rather than documentaries or TV shows), we will also focus our analysis/model on this category.
```{r}
movies%>%
group_by(title_type)%>%
summarise(count = n())%>%
mutate(prop = round(count/sum(count)*100,0))%>%
ggplot(aes(title_type, prop))+
geom_col(col="black", fill = "dark red")+
geom_text(aes(label = paste(count, " - (", prop, "%)", sep = "")), nudge_y = 5, size = 5)+
theme_few()+
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())+
labs(x = NULL,
y = NULL)
movies <- movies%>%
filter(title_type == "Feature Film")%>%
select(-title_type)
```
### Which genres are included in our analysis?
The majority of our data contains Drama movies(51%), followed by Comedy(14%) and Action and Adventure(11%). We can see that we still have a few documentaries left, even though we removed them based on type. We will also remove this subset for the same reason as before.
```{r}
movie_genre <- movies%>%
group_by(genre)%>%
summarise(count = n())%>%
mutate(proportion = round(count/sum(count)*100,1))%>%
arrange(desc(count))
movie_genre%>%
kbl(caption = "Summary of movie genres")%>%
kable_paper("hover", full_width = F)
movies <- movies%>%
filter(genre != "Documentary")
```
### Are the IMDB ratings impacted by the genre?
We can observe significant differences between the various genres and metrics such as variance, medians, range and outliers.
```{r}
ggplot(movies, aes(genre, imdb_rating, col = genre))+
geom_boxplot(show.legend = FALSE)+
geom_jitter(alpha = 0.1, show.legend = FALSE)+
coord_flip()+
labs(y = "IMDB rating")+
theme_few()+
labs(x=NULL)
```
### Is there an association between the IMDB rating and the total amount of IMDB votes?
The high-performing movies (as measured by IMDB rating) also have a higher number of votes. This trend seems exponential towards the higher IMDB ratings/votes, where the increase in total number of votes is much higher than in the low/medium ranges.
```{r}
movies%>%
ggplot(aes(imdb_rating, imdb_num_votes))+
geom_jitter(width=0.1, height = 0.1, alpha = 0.5)+
geom_smooth(se = FALSE)+
theme_few()+
labs(x = "IMDB rating",
y = "Total IMDB votes")+
theme(legend.position = "none")
```
### Do movie critics and audiences share the same taste in movies?
Judging by the variances in our scatter plots, we can observe fairly large discrepancies between how audiences and critics perceive movies. This is even more apparent in genres such as Action & Adventure and Comedy, where audiences tend to give much higher ratings than the critics.
```{r}
movies%>%
ggplot(aes(audience_score, critics_score, col = genre))+
geom_jitter(width=0.1, height = 0.1, show.legend = FALSE)+
facet_wrap(.~genre)+
theme_few()+
annotate("segment", x=-Inf, xend=Inf, y=-Inf, yend=Inf, alpha = 0.5, lty = 2)+
theme(legend.position = "none")+
labs(x = "Audience score",
y = "Critics score")
```
We can focus more on these differences by looking in average scores for each genre. Below, we can observe very large differences between the various genres.
```{r}
movies %>%
select(genre, audience_score, critics_score)%>%
group_by(genre)%>%
summarise(audience_score = mean(audience_score),
critics_score = mean(critics_score))%>%
mutate(difference_score = round((audience_score - critics_score), 1),
status = ifelse(abs(difference_score) <=3, "good (lower than 3)",
ifelse(abs(difference_score) <=5, "ok (between 3 and 5)", "bad (higher than 5)")))%>%
#create plot
ggplot(aes(reorder(genre, difference_score), difference_score, fill = status))+
geom_col(col = "black")+
scale_fill_manual(values = c("dark red", "#009E73", "gold3"))+
scale_colour_manual(values = c("dark red", "#009E73", "gold3"))+
geom_label(aes(genre, difference_score + ifelse(difference_score >= 0, +0.2, -0.2), label = difference_score), show.legend = FALSE, size = 3, fill = "white")+
labs(x=NULL,
y="Points disagreement",
title = "Rating disagreement between critics and audiences",
subtitle = "A positive score means the audience rated higher than critics",
fill = "Status")+
theme(legend.position = "top",
axis.ticks.x = element_blank(),
axis.text.x = element_blank())+
coord_flip()+
theme_few()
```
### Is there an association between the year a movie was released in and the IMDB rating?
Given that we have the number of votes for each IMDB rating, we will use a weighted mean for this comparison.
While the total number of votes increased, as shown by the size of the points, we can see that the overall rating was fairly consistent. Again, to note the relatively low sample size of movies released before ~1990.
```{r}
movies%>%
group_by(thtr_rel_year)%>%
summarise(imdb_wmean = weighted.mean(imdb_rating, imdb_num_votes),
count = n(),
votes = sum(imdb_num_votes))%>%
ggplot(aes(thtr_rel_year, imdb_wmean))+
geom_point(aes(size = votes), show.legend = FALSE)+
geom_line()+
coord_cartesian(ylim = c(0,10))+
labs(x = NULL,
y = "IMDB rating",
subtitle = "Size of point shows the relative sample of number of votes within each year")+
theme_few()
```
### Are there any associations between the IMDB rating and the month of release?
It seems that the release dates are fairly equal distributed across the year (width of the bar) as well as equal in variance. Thus, we can say that it looks like these factors are independent of each other.
```{r}
movies%>%
mutate(thtr_rel_month = as.factor(thtr_rel_month))%>%
ggplot(aes(thtr_rel_month, imdb_rating, col = thtr_rel_month))+
geom_boxplot(aes(group = thtr_rel_month), varwidth = TRUE, show.legend = FALSE)+
geom_jitter(alpha = 0.1, show.legend = FALSE)+
labs(x = "Month",
y = "IMDB rating")+
theme_few()
```
### Is there an association between the day of the week a movie was released and the IMDB rating?
Both the boxplot and table below show that almost 75% of movie releases happen on a Friday and 11% on a Wednesday. An squal split of around 2-3% each happen on the other days. While the sample sizes are different across the various weekdays, there doesn't seem to be a clear trend regarding the IMDB rating. Both audiences and critics tend to agree that movies released on Monday, Tuesday and Friday perform worse than the rest.
```{r}
day_of_week <- movies%>%
mutate(date = make_date(thtr_rel_year, thtr_rel_month, thtr_rel_day),
wday = weekdays.Date(date),
wday_num = wday(date, week_start = 1))%>%
group_by(wday, wday_num)%>%
summarise(count = n(),
votes = round(mean(imdb_num_votes)),
imdb_wmean = weighted.mean(imdb_rating, imdb_num_votes),
audience_score = round(mean(audience_score),1),
critics_score = round(mean(critics_score),1),
imdb_wmean = round(imdb_wmean, 1))%>%
ungroup()%>%
mutate(prop = round(count/sum(count)*100,1))%>%
arrange(wday_num)
#add the weekday column to our model
movies <- movies%>%
mutate(date = make_date(thtr_rel_year, thtr_rel_month, thtr_rel_day),
wday = as.factor(weekdays.Date(date)),
wday_num = wday(date, week_start = 1))%>%
select(-date)
ggplot(movies, aes(reorder(wday,wday_num), imdb_rating, col = wday))+
geom_boxplot(show.legend = FALSE, varwidth = TRUE)+
geom_jitter(alpha = 0.1, show.legend = FALSE)+
labs(x = NULL,
y = "IMDB rating")+
theme_few()
day_of_week%>%
kbl(caption = "Proportion of releases by day of week")%>%
kable_paper("hover", full_width = F)
```
### Does the runtime of the movie affect the IMDB rating?
We can see a slight trend where movies of a longer length correlate with a higher IMDB rating. However, this is not causal and it could be that this is an indirect effect of the genre of the movie (and thus, longer movies are of a favourable genre).
```{r}
ggplot(movies, aes(runtime, imdb_rating))+
geom_jitter(width = 0.1, height = 0.1, alpha = 0.75)+
geom_smooth(se=FALSE)+
theme_few()+
labs(x = "Runtime (min)",
y = "IMDB rating")
```
### Where does Paramount Pictures rank amogst the various other studios?
Based on movie averages, Paramount Pictures is ranked somewhere at the top, based on the IMDB rating/number of total votes and just above average regarding audience ratings.
```{r}
studios <- movies%>%
group_by(studio)%>%
summarise(imdb_wmean = weighted.mean(imdb_rating, imdb_num_votes),
audience_score = mean(audience_score, na.rm = TRUE),
count = n(),
avg_votes = sum(imdb_num_votes)/count)
paramount_studio <- studios%>%
filter(studio == "Paramount Pictures")
ggplot()+
geom_point(data = studios, aes(audience_score, imdb_wmean, size = avg_votes), show.legend = FALSE, alpha = 0.6)+
geom_text(data = paramount_studio, aes(audience_score, imdb_wmean, label = "Paramount"), col = "red", size = 5, nudge_y = 0.3)+
geom_point(data = paramount_studio, aes(audience_score, imdb_wmean), col = "red", size = 3.7)+
labs(x = "Audience score",
y = "IMDB rating",
subtitle = "Size of point shows the relative sample of number of votes within each year (average per studio)")+
theme_few()
```
### Is there an association between the type of an award a movie receives and the rating?
We can notice significant differences for the bottom three awards, but not nothing significant between those that received the top 3 awards in the below boxplot.
```{r}
awards <- movies%>%
select(audience_score, critics_score, imdb_rating, imdb_num_votes, best_pic_nom:top200_box)%>%
gather(Award, Status, -audience_score, -critics_score, -imdb_rating, -imdb_num_votes)
ggplot(awards, aes(Status, imdb_rating, col = Status))+
geom_boxplot()+
theme_few()+
labs(col = "Award received",
x = NULL,
y = "IMDB rating")+
facet_wrap(.~Award)+
theme(legend.position = "top")
```
The same trend can be seen below where movies that have received the bottom 3 awards tend to perform better than those who have not on either movie rating score.
```{r}
award_yes <- awards%>%
filter(Status == "yes")
ggplot(awards, aes(critics_score, imdb_rating, col = Status))+
geom_point(alpha = 0.50)+
geom_point(data = award_yes, aes(critics_score, imdb_rating), col = "dark red")+
theme_few()+
scale_colour_manual(values = c("gold3", "dark red"))+
labs(col = "Award received",
x = "Critics score",
y = "IMDB rating")+
facet_wrap(.~Award)+
theme(legend.position = "top")
```
We can also spot a trend where, generally speaking, the rarer the award, the higher the chance it would impact any of the ratings in a favourable way.
```{r}
awards_table <- awards%>%
group_by(Award, Status)%>%
summarise(count = n(),
audience_score = round(mean(audience_score),1),
critics_score = round(mean(critics_score),1),
imdb_rating = round(weighted.mean(imdb_rating, imdb_num_votes),1),
imdb_num_votes = round(mean(imdb_num_votes),0))%>%
mutate(proportion = round(count/sum(count)*100,1))%>%
arrange(proportion)
awards_table%>%
kbl(caption = "Summary of awards and movie ratings")%>%
kable_paper("hover", full_width = F)
```
* * *
## Part 4: Modeling
### Feature selection
We will be excluding the following variables from our **full model**:
- title, studio as they might cause our model to overfit
- critics/audience ratings and scores as they are highly correlated with our predicted variables (IMDB rating)
- number of IMDB ratings as it would act as a proxy for the IMDB rating (they're also correlated)
- DVD release year, month and day for reasons mention before (high correlation with theatre release dates and lack of importance given current streaming trends)
**Feature engineering**: we also added day of the week to see if there are any trends
### Selection method
We chose a backward p value adjustment to our model, given the context of the study where we could not use an automated way to reduce variables. An R adjustment would be incredibly time-consuming given the manual methodology and prone to human error. Furthermore, a p value adjustment would ensure that the factors included are significant (which is not necessarily true in an R adjustment).
```{r}
model_initial <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + thtr_rel_month + thtr_rel_day + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box + wday, data=movies)
tidy(model_initial)%>%
arrange(term, p.value)%>%
kbl(caption = "Initial model - summary")%>%
kable_paper("hover", full_width = F)
# removed best_actor_win
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + thtr_rel_month + thtr_rel_day + best_pic_nom + best_pic_win + best_actress_win + best_dir_win + top200_box + wday, data=movies)
# removed best_actress_win
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + thtr_rel_month + thtr_rel_day + best_pic_nom + best_pic_win + best_dir_win + top200_box + wday, data=movies)
# removed thtr_rel_day
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + thtr_rel_month + best_pic_nom + best_pic_win + best_dir_win + top200_box + wday, data=movies)
# removed thtr_rel_month
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + best_pic_nom + best_pic_win + best_dir_win + top200_box + wday, data=movies)
# removed thtr_rel_year
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + best_pic_nom + best_pic_win + best_dir_win + top200_box + wday, data=movies)
# removed best_pic_win
model_final <- lm(imdb_rating ~ genre + runtime + mpaa_rating + best_pic_nom + best_dir_win + top200_box + wday, data=movies)
```
### Final model
**Reason for excluding certain variables**: following a backward model adjustment, the following variables were not proven to be significant and were excluded from the final model (in order of exclusion):
- best_actor_win
- best_actress_win
- thtr_rel_day
- thtr_rel_month
- thtr_rel_year
- best_pic_win
That leaves us with the following variables included in our **final model**:
- genre
- runtime
- mpaa_rating
- best_pic_nom
- best_dir_win
- top200_box
- weekday
### Interpretation of our prediction model
Overall, we can see that the model was able to capture 26% of the variability in our data.
```{r}
glance(model_final)%>%
kbl(caption = "Final model - overall summary")%>%
kable_paper("hover", full_width = F)
```
At the coefficient level, we can see that all our included predictors were significant. The most significant predictor was runtime. This was followed by whether a movie was nominated for Best Picture as well as whether it belonged to the "Drama" genre.
We can conclude that with everything else held constant:
* for each minute of additional **runtime**, it is expected that the IMDB rating increases by 0.012 points
* a movie nominated for the **Best Picture** award is estimated to have a higher rating, on average, by 0.84 points
* a movie that won the **Best Director** award is predicted to have a higher rating, on average, by 0.30 points
* movies included in the **top 200 box** are estimated to be rated higher by 0.53 points
* a movie within the Drama **genre** is expected to score 0.57 points higher than an Action and Adventure movie (the intercept). Three other genres can also be estimated to have a significantly higher rating than the intercept, based on their coefficient and p-value:
* Musical & Performing Arts
* Mystery & Suspsense
* Other
* movies with an **mpaa rating** of PG and PG-13 are estimated to score lower than G rated movies by 0.59 and 0.80, respectively
* **day of week** was also significant. A movie released on a Saturday was predicted to score higher than the intercept (Friday) by 0.42 points
```{r}
tidy(model_final)%>%
arrange(term, p.value)%>%
kbl(caption = "Final model - summary")%>%
kable_paper("hover", full_width = F)
```
To note, however, that while our model performs well in the middle regions, it significantly over-predicts for low IMDB ratings and under-predicts for high IMDB ratings.
```{r}
ggplot(model_final, aes(imdb_rating, .resid))+
geom_hline(yintercept = 0, alpha = 0.5, size = 3, color = "grey52")+
geom_jitter(alpha = 0.5, color = "blue", height = 0.1, width = 0.1)+
geom_smooth(method = "lm", lwd = 0.5, col = "red", se = FALSE)+
theme_few()+
labs(x = "IMDB rating",
y="Residuals")
```
### Model diagnostics
#### 1. linear relationship between each (numerical) explanatory variable and response
We only have one numerical variable (runtime) which is shown to have a linear relationship with our predictor variable (IMDB rating) by the random scatter in the below **residual plot**. However, to note that there weren't many movies with a runtime of over 150 minutes.
```{r}
ggplot(model_final, aes(runtime, .resid))+
geom_hline(yintercept = 0, alpha = 0.5, size = 3, color = "grey52")+
geom_point(alpha = 0.5, color = "blue")+
geom_smooth(method = "lm", lwd = 0.5, col = "red", se = FALSE)+
theme_few()+
labs(x = "Runtime",
y="Residuals")
```
#### 2. nearly normal residuals with mean 0
Despite a slight skew to the left, the majority of our residuals within the below **residual histogram** are centered around the mean 0.
```{r}
ggplot(model_final, aes(.resid))+
#geom_histogram(binwidth = 0.1, col = "black", fill = "blue")+
geom_histogram(aes(y=..density..), color="black", fill="white", lwd = 0.75) +
geom_density(alpha=0.2, fill="#FF6666") +
geom_vline(aes(xintercept=mean(.resid)), col = 'red', lwd = 1, lty = 2) +
labs(x="Residuals",
y= "Density")+
theme_few()
```
Similarly, the **Normal probability plot of residuals (QQ plot)** below shows a similar trend. Outside of the tail areas, we do not see any significant deviations from the mean.
```{r}
ggplot(model_final, aes(sample=.resid))+
stat_qq()+
stat_qq_line()+
theme_few()+
theme(legend.position = "none")
```
#### 3. constant variability of residuals
We can see that our residuals are equally variable for low and high values of the predicted response variable (IMDB rating).
```{r}
ggplot(model_final, aes(.fitted, .resid))+
geom_hline(yintercept = 0, alpha = 0.5, size = 3, color = "grey52")+
geom_point(alpha = 0.5, color = "blue")+
geom_smooth(method = "lm", lwd = 0.5, col = "red", se = FALSE)+
theme_few()+
labs(x = "IMDB rating prediction",
y="Residuals")
```
We can also plot the absolute values of the residuals as seen below. This can be thought of the above plot folded in half. Thus, a fan shape in the above plot would look as a triangle in the below plot. This is not the case here, and thus, this condition is also met.
```{r}
ggplot(model_final, aes(.fitted, abs(.resid)))+
geom_hline(yintercept = 0, alpha = 0.5, size = 3, color = "grey52")+
geom_point(alpha = 0.5, color = "blue")+
#geom_smooth(method = "lm", lwd = 0.5, col = "red", se = FALSE)+
theme_few()+
labs(x = "IMDB rating prediction",
y="Residuals")
```
#### 4. independence of residuals (and hence observations)
No apparent trend which suggests independence from the order of data collection.
```{r}
df <- augment(model_final)
ggplot(data=df, aes(x = 1:nrow(df), y = .resid)) +
labs(x = "Index",
y = "Residuals")+
geom_hline(yintercept = 0, alpha = 0.5, size = 3, color = "grey52", lty = 2)+
geom_hline(yintercept=0, col="red", linetype="dashed")+
geom_point(alpha = 0.5, color = "blue")+
geom_smooth(method = "lm", lwd = 0.5, col = "red", se = FALSE)+
theme_few()
```
* * *
## Part 5: Prediction
### Prediction and interpretation
Here are the details of the **La La Land** movie, a movie released in 2016.
*Reference - IMDB, Oscars and Box Office Mojo official websites*.
```{r}
movie_lalaland <- data.frame("genre" = c("Musical & Performing Arts"),
"runtime" = c(128),
"best_dir_win" = "yes",
"best_pic_nom" = "yes",
"mpaa_rating" = "PG-13",
"wday" = "Friday",
"top200_box" = "yes")
movie_lalaland%>%
kbl(caption = "Movie data")%>%
kable_paper("hover", full_width = F)
```
Given this data, the model predicted an IMDB rating of **8.58**. The lower and upper interval values tell us that the IMDB rating for La La Land is in the interval (7.65 and 9.50) with 95% probability.
```{r}
pred_lalaland <- predict(model_final, newdata = movie_lalaland, interval = "confidence")
pred_lalaland%>%
kbl(caption = "Predictions")%>%
kable_paper("hover", full_width = F)
```
Lastly, we can plot this data to be able to visually interpret it. We can see that it was relatively close to the prediction and within the predicted interval.
```{r}
ggplot(movies, aes(imdb_rating, imdb_num_votes))+
geom_jitter(alpha = 0.5, size = 0.75, width = 0.1, height = 0.1)+
geom_smooth(se=FALSE)+
#plot prediction
geom_point(aes(x=8.576486 , y = 513225), size = 5, col = "blue")+
geom_text(aes(x=8.576486 , y = 513225), label = "Prediction", nudge_x = 0.25, nudge_y = -75000)+
#correct value
geom_point(aes(x=8.0, y = 513225), size = 5, col = "red")+
geom_text(aes(x=8.0, y = 513225), label = "Actual", nudge_y = -55000)+
#confidence interval
geom_errorbar(aes(xmax = 9.502354, xmin = 7.650619, x =8.576486, y=513225), width = 0.5)+
geom_text(aes(x=9.2, y = 513225), label = "(CI)", col = "black", nudge_y = 48000)+
theme_few()+
labs(x = "IMDB rating",
y = "IMDB number of votes")
```
* * *
## Part 6: Conclusion
### Summary
- our linear model was only able to capture 29% of the variability in our data
- the model significantly over-predicts for low IMDB ratings and over-predicts for high IMDB ratings
- while we were able to identify significant predictors, the design of the study does not allow for causation
- critics and audiences tend to differ in their movie taste, particularly in categories such as Comedy and Action & Adventure
- some accolades are better predictors of IMDB ratings than others
- Friday might be the best day to release a movie in terms of audience access, but the movies are not the most popular (as judged by total IMDB votes or IMDB rating)
- there is an exponential relationship between the number of votes a movie receives and the IMDB rating, where highly rated movies are much more popular than normally expected
- further optimisation can be made if we have a specific goal to make a movie for either critics or audience
- similarly, popularity could also be determined by total number of votes (quantity) rather than rating (quality)
### Shortcomings/challenges
- relatively low sample sizes before 1990; our data is not representative of each year included in the dataset(but it is representative overall)
- we could use more data on factors we would normally have access before the release of the film: budget, actor/director social media influence (which might affect ratings or popularity), etc
- other types of models could be explored, particularly exponential ones
- some techniques were specifically used in the context of the course - more robust methods are available such as:
- R adjustment, as it is known to be more reliable than an arbitrary p value selection
- train/test split of our dataset
- automated backward/forward model selection (ie. through Caret), since a manual approach is susceptible to human error
- no data pre-processing