Modeling and prediction for movies

Setup

Load packages

#install.packages("psych")
library(psych)
library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("movies.Rdata")

Part 1: Data

A study conclusion is generalizable to population only if the study uses random sampling. The data set that we are using is comprised of 651 randomly sampled movies produced and released before 2016. IMDB and Rotten Tomatoes, two popular movie review websites were used to collect sample data. As random sampling was used to collect data so survey conclusions are generalizable for all movies.

Data collection technique does not uses random assignments so study results can not be used to draw causal conclusions.

Part 2: Research question

IMDB and Rotten Tomatoes are popular movie review websites where both critics and audiences give their ratings. Often a movie gets high ratings but is created by director, actor, actresses who have never won any award in oscar. We are interested to answer the follwing research question that addresses relationship between winning oscar award and rating in movie review website.

Research quesion : If a movie win award in oscar or is created by oscar award winning director/actor/actress, is it tend to have higher ratings in movie review websites?

Part 3: Exploratory data analysis

We will create a new variable score from the following variables that will indicate popularity of a movie in movie review websites.

imdb_rating: Rating on IMDB

critics_score: Critics score on Rotten Tomatoes

audience_score: Audience score on Rotten Tomatoes

imdb rating is scored from 1-10 and rest of the two variables are scored from 1-100. We will use the following formula to calculate score.

\[ score= (imdb\_rating*10 + critics\_score+audience\_score)/3\]

We will use score as our only response variable.

score<-(movies$imdb_rating*10+movies$audience_score+movies$critics_score)/3
movies['score']<-score

Let us take a look at the distribiution of score of variable.

par(xpd=TRUE)
h<-hist(movies$score, breaks=20, col="green", xlab="Movie Rating", 
    main="Histogram with Normal Curve",sub="Fig 1: Movie rating frequency barplot")
xfit<-seq(min(movies$score),max(movies$score))
yfit<-dnorm(xfit,mean=mean(movies$score),sd=sd(movies$score))
yfit <- yfit*diff(h$mids[1:2])*length(movies$score) 
lines(xfit, yfit, col="blue", lwd=2)

The distribiution of score look slightly left skewed. We can get more information from the summary statistics.

describe(movies$score)

##    vars   n  mean    sd median trimmed   mad   min   max range  skew
## X1    1 651 61.66 18.24  62.67   62.55 21.74 12.67 94.67    82 -0.34
##    kurtosis   se
## X1    -0.81 0.71

Keeping in my mind that distribiution is slightly left skewed, as mean is 61.66 and standard deviation is 18.24, from 68-95-99.7 rule we can expect 68% movies get rating from 43.42 and 79.9.

Part 4: Modeling

Let us take a look at the column names of movies dataframe so that we can get a idea about the dataset.

colnames(movies)

##  [1] "title"            "title_type"       "genre"           
##  [4] "runtime"          "mpaa_rating"      "studio"          
##  [7] "thtr_rel_year"    "thtr_rel_month"   "thtr_rel_day"    
## [10] "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [13] "imdb_rating"      "imdb_num_votes"   "critics_rating"  
## [16] "critics_score"    "audience_rating"  "audience_score"  
## [19] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [22] "best_actress_win" "best_dir_win"     "top200_box"      
## [25] "director"         "actor1"           "actor2"          
## [28] "actor3"           "actor4"           "actor5"          
## [31] "imdb_url"         "rt_url"           "score"

As we want to predict rating of movie based on winning awards in Oscar we include the following explanatory variables from the dataset.

Explanatory variables:

Our full model dataset will consist of the following variables as we are interested to find assosiation between movie rating and winning award in oscar.

best_pic_win: Whether or not the movie won a best picture Oscar (no, yes)

best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) - note that this is not necessarily whether the actor won an Oscar for their role in the given movie

best_actress_win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) - note that this is not necessarily whether the actresses won an Oscar for their role in the given movie

best_dir_win:Whether or not the director of the movie ever won an Oscar (no, yes) - note that this is not necessarily whether the director won an Oscar for the given movie.

Model selection:

When the sole goal is to improve prediction accuracy, adjusted \(R^2\) technique is used. When we care about understanding which variables are statistically significant predictors of the response, or if there is interest in producing a simpler model at the potential cost of a little prediction accuracy, then the p-value approach is preferred. We are interested to find out which variables affect the response variable most, so we are going to use p value approach with backward elimination technique to answer our research question.

p value approach with backward elimintation:

In p value approach with backward elimintation we start with the full model and at every step eliminate the variable whose p value is greater than significance level. We will consider 0.05(95% confidence interval) as significance level.

full_model<-lm(score~best_pic_win + best_actor_win +best_actress_win+best_dir_win ,data = movies)
summary(full_model)

## 
## Call:
## lm(formula = score ~ best_pic_win + best_actor_win + best_actress_win + 
##     best_dir_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.299 -13.162   0.691  15.363  31.691 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          60.6427     0.8056  75.276   <2e-16 ***
## best_pic_winyes      17.0782     7.3317   2.329   0.0201 *  
## best_actor_winyes     1.3226     2.0479   0.646   0.5186    
## best_actress_winyes   1.7045     2.3011   0.741   0.4591    
## best_dir_winyes       6.9151     3.0316   2.281   0.0229 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.04 on 646 degrees of freedom
## Multiple R-squared:  0.02802,    Adjusted R-squared:  0.022 
## F-statistic: 4.656 on 4 and 646 DF,  p-value: 0.001036

As variable best_actor_win has p value greater than 0.05 we will discard this variable from the model.

model<-lm(score~best_pic_win+best_actress_win+best_dir_win ,data = movies)
summary(model)

## 
## Call:
## lm(formula = score ~ best_pic_win + best_actress_win + best_dir_win, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.132 -12.966   0.534  15.368  31.534 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          60.7992     0.7679  79.171   <2e-16 ***
## best_pic_winyes      17.0377     7.3281   2.325   0.0204 *  
## best_actress_winyes   1.8954     2.2810   0.831   0.4063    
## best_dir_winyes       7.0935     3.0176   2.351   0.0190 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.03 on 647 degrees of freedom
## Multiple R-squared:  0.02739,    Adjusted R-squared:  0.02288 
## F-statistic: 6.074 on 3 and 647 DF,  p-value: 0.0004431

As variable best_actress_win has p value greater than 0.05 we will discard this variable from the model.

model<-lm(score~best_pic_win+best_dir_win ,data = movies)
summary(model)

## 
## Call:
## lm(formula = score ~ best_pic_win + best_dir_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.328 -12.994   0.491  15.339  31.339 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       60.994      0.731  83.438   <2e-16 ***
## best_pic_winyes   17.850      7.261   2.458   0.0142 *  
## best_dir_winyes    7.182      3.015   2.382   0.0175 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.02 on 648 degrees of freedom
## Multiple R-squared:  0.02636,    Adjusted R-squared:  0.02335 
## F-statistic:  8.77 on 2 and 648 DF,  p-value: 0.0001745

As none of the variable’s p value is greater than significance level 0.05, we can not discard any more variables. So our final model consist of variables best_pic_win, best_dir_win.

Model analysis:

We can predecit score variable from best_pic_win and best_dir_win variable in the following way

\[score= \beta_0+\beta_1 * best\_pic\_win + \beta_2 * best\_dir\_win \]

\(\beta_0,\beta_1,\beta_2\) are respectively estimates of intercept, best_pic_win, best_dir_win. Let us analyse the summary report and explain what does the co efficients \(\beta_0,\beta_1,\beta_2\) mean.

If we take a closer look at summary report we can see there is a yes after best_pic_win and best_dir_win variable. That yes represent non reference lebel. That means we will consider no as reference lebel. best_pic_win and best_dir_win variable takes value 0 when a movie has not won best picture award at Oscar and the director of the movie never won an Oscar respectively, 1 otherwise. \(\beta_0\) is estimate of intercept that takes value 60.994. That means if a movie has not won best pic award in oscar and director of the movie has not ever won award in oscar, estimated score of the movie will be 60.994. \(\beta_1\) is the average increase in score if picture has won best pic award in oscar holding the other variables constant. The point estimate is \(\beta_1\) = 17.850. \(\beta_2\) is the average increase in score if director of the picture has ever won in oscar holding the other variables constant. The point estimate is \(\beta_2\) = 7.182.

Part 5: Prediction

We are going to pick “The Girl on the Train” movie released in 2016 and predict score of this movie using our model. Director of the movie Tate Taylor has never won oscar award before though his movie “The Help” released in 2011 was nominated for oscar(source:wikipedia). As of today’s date we do not know whether “The Girl on the Train”" is going to win oscar or not, we are going to consider both the cases to predict the score. We are going to assume Tate tailor is not going to win oscar for this movie.

If the movie is going to win oscar we can predict the score in following way.

The_Girl_on_the_Train <- data.frame(best_pic_win = "yes", best_dir_win = "no")
predict(model,The_Girl_on_the_Train)

##        1 
## 78.84421

predict(model,The_Girl_on_the_Train,interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 78.84421 40.67194 117.0165

Using our model predicted score of the movie is 78.84421.Hence, the model predicts, with 95% confidence, that the movie “The Girl on the Train” is expected to have an evaluation score between 40.67194 and 117.0165 if it wins best picture award in oscar given that director Tate Taylor has neither win best director award before and nor is going to win for this movie.

If the movie is not going to win oscar we can predict the score in following way.

The_Girl_on_the_Train <- data.frame(best_pic_win = "no", best_dir_win = "no")
predict(model,The_Girl_on_the_Train)

##        1 
## 60.99422

predict(model,The_Girl_on_the_Train,interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 60.99422 25.57516 96.41327

Using our model predicted score of the movie is 60.99422. Hence, the model predicts, with 95% confidence, that the movie “The Girl on the Train” is expected to have an evaluation score between 25.57516 and 96.41327 if it does not win best picture award in oscar given that director Tate Taylor has neither win best director award before and is nor is going to win for this movie.

Part 6: Conclusion

Adjusted R-squared value of our final model is only 0.02335 which indicates that our model is not that strong enough predicting scores. Though it is not a strong model low p value (0.0001745) indicates that variables used in the model as a whole is significat. Other variables like genre, runtime , MPAA rating of the movie could have helped us to predict score more accurately but as these variables are out of scope of the research question we did not include them in our full model.