Statistics 2B Coursework

--- title: "Statistics 2B Coursework" author: "Group 42" date: "`r format(Sys.time(), '%d %B %Y')`" output: html_document: css: style.css --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE, fig.align="center") rm(list=ls()) #Clear all variables ``` --- Initialising libraries and setting defaults for ggplot: ```{r Loading Packages, message = FALSE} if (!require("pacman")) install.packages("pacman") pacman::p_load('dplyr','ggrepel','ggplot2','grid','gridExtra','ggthemes','lattice','tidyverse','broom', 'ggfortify','corrplot','lindia','ggpubr','car', 'effects', 'ggcorrplot','ggExtra','knitr','sjlabelled','sjPlot','sjmisc','openxlsx','bezier','data.table') theme_set(theme_bw() + theme(legend.key.size = unit(0.7,"lines"), legend.box.spacing = unit(-0.01,"lines"))) ggplot <- function(...) ggplot2::ggplot(...) + guides(colour = guide_legend(override.aes = list(shape = 15))) + scale_color_brewer(palette="Paired") + scale_fill_brewer(palette="Paired") ``` Custom-made diagnostics function: ```{r Diagnose Function} diagnose <- function(model, x_variable="EpisodeNumber", sqrt.level=1.6, sqrt.nudge.x=0, sqrt.nudge.y=0, sqrt.box=0, std.nudge.x=0, std.nudge.y=0, std.box=0.5, cooks.level=0.1, cooks.nudge.x=0, cooks.nudge.y=0, cooks.box=0.5, alpha=1, box.size=2, CI.fill="grey82", cooks.x.lab="Episode Number", colour=NULL, title, resid.x.lab="Fitted rating") { df = eval(model$call$data) almod = augment(model) almod[c("ID.1","Season","EpisodeNumber","Episode")] = df[c("ID.1","Season","EpisodeNumber","Episode")] AbsSqrtStd = ggplot(data = almod,aes_string(".fitted","sqrt(abs(.std.resid))",label="ID.1",colour=colour,fill=colour)) + geom_point(alpha=alpha) + scale_colour_brewer(palette="Paired") + scale_fill_brewer(palette="Paired") + theme(legend.key.size = unit(0.5,"lines")) + geom_smooth(method="loess",formula=y~x,level=0.95,colour="salmon",fill=CI.fill,size=0.4,linetype="dashed") + geom_label_repel(data = subset(almod, sqrt(abs(.std.resid)) > sqrt.level),alpha=0.7,size=box.size,box.padding=sqrt.box,aes(fill=NULL), nudge_x=sqrt.nudge.x,nudge_y=sqrt.nudge.y) + labs(x=resid.x.lab, y=expression(sqrt(abs("Standardised residuals")))) QQplot = ggplot(almod,aes(sample=.std.resid,label=ID.1)) + geom_qq(alpha=alpha) + geom_qq_line(linetype="dashed",colour="salmon") + labs(x = "Theoretical residuals", y = "Standardised residuals") StdResid = ggplot(data = almod,aes_string(".fitted",".std.resid",label="ID.1",colour=colour,fill=colour)) + geom_point(alpha=alpha) + scale_colour_brewer(palette="Paired") + scale_fill_brewer(palette="Paired") + geom_smooth(method="loess",formula=y~x,level=0.95,colour="salmon",fill=CI.fill,size=0.4,linetype="dashed") + geom_label_repel(data = subset(almod, sqrt(abs(.std.resid)) > sqrt.level),alpha=0.7,size=box.size,box.padding=std.box,aes(fill=NULL), nudge_x=std.nudge.x,nudge_y=std.nudge.y) + labs(x=resid.x.lab,y="Standardised residuals") + theme(legend.key.size = unit(0.5,"lines")) Cooks = ggplot(almod, aes_string(x=x_variable, y=".cooksd", label="ID.1",colour=colour, fill = colour)) + geom_bar(stat="identity") + labs(x=cooks.x.lab, y="Cook's distance") + geom_hline(yintercept=4/length(df$ID),linetype="dashed",colour="red") + geom_label_repel(data = subset(almod,.cooksd>cooks.level), size=box.size, nudge_x=cooks.nudge.x, alpha=0.7,box.padding=cooks.box, nudge_y=cooks.nudge.y, aes(fill=NULL)) + scale_fill_brewer(palette="Paired") + scale_colour_brewer(palette="Paired") + theme(legend.key.size = unit(0.7,"lines"), legend.box.spacing = unit(0.1,"lines")) ggarrange(StdResid,QQplot,AbsSqrtStd,Cooks,ncol=2,nrow=2,common.legend = TRUE,legend="right") } #diagnose(lmod2, alpha=0.6,sqrt.box=0,box.size=2,CI.fill="grey90") ``` ```{r Summary Print Function, include=FALSE} #This chonk is hidden from Faraway #taken from: https://stackoverflow.com/questions/32342018/summary-lm-output-customization #makes the summary print more concise print.summary.lm <- function (x, digits = max(3L, getOption("digits") - 3L), symbolic.cor = x$symbolic.cor, signif.stars = getOption("show.signif.stars"), concise = TRUE, ...) { cat("\nCall:", if(!concise) "\n" else " ", paste(deparse(x$call), sep = "\n", collapse = "\n"), if (!concise) "\n\n", sep = "") resid <- x$residuals df <- x$df rdf <- df[2L] if (!concise) { cat(if (!is.null(x$weights) && diff(range(x$weights))) "Weighted ", "Residuals:\n", sep = "") } if (rdf > 5L) { nam <- c("Min", "1Q", "Median", "3Q", "Max") rq <- if (length(dim(resid)) == 2L) structure(apply(t(resid), 1L, quantile), dimnames = list(nam, dimnames(resid)[[2L]])) else { zz <- zapsmall(quantile(resid), digits + 1L) structure(zz, names = nam) } if (!concise) print(rq, digits = digits, ...) } else if (rdf > 0L) { print(resid, digits = digits, ...) } else { cat("ALL", df[1L], "residuals are 0: no residual degrees of freedom!") cat("\n") } if (length(x$aliased) == 0L) { cat("\nNo Coefficients\n") } else { if (nsingular <- df[3L] - df[1L]) cat("\nCoefficients: (", nsingular, " not defined because of singularities)\n", sep = "") else { cat("\n"); if (!concise) cat("Coefficients:\n") } coefs <- x$coefficients if (!is.null(aliased <- x$aliased) && any(aliased)) { cn <- names(aliased) coefs <- matrix(NA, length(aliased), 4, dimnames = list(cn, colnames(coefs))) coefs[!aliased, ] <- x$coefficients } printCoefmat(coefs, digits = digits, signif.stars = signif.stars, signif.legend = (!concise), na.print = "NA", eps.Pvalue = if (!concise) .Machine$double.eps else 1e-4, ...) } cat("\nResidual standard error:", format(signif(x$sigma, digits)), "on", rdf, "degrees of freedom") cat("\n") if (nzchar(mess <- naprint(x$na.action))) cat(" (", mess, ")\n", sep = "") if (!is.null(x$fstatistic)) { cat("Multiple R-squared: ", formatC(x$r.squared, digits = digits)) cat(",\tAdjusted R-squared: ", formatC(x$adj.r.squared, digits = digits), "\nF-statistic:", formatC(x$fstatistic[1L], digits = digits), "on", x$fstatistic[2L], "and", x$fstatistic[3L], "DF, p-value:", format.pval(pf(x$fstatistic[1L], x$fstatistic[2L], x$fstatistic[3L], lower.tail = FALSE), digits = digits, if (!concise) .Machine$double.eps else 1e-4)) cat("\n") } correl <- x$correlation if (!is.null(correl)) { p <- NCOL(correl) if (p > 1L) { cat("\nCorrelation of Coefficients:\n") if (is.logical(symbolic.cor) && symbolic.cor) { print(symnum(correl, abbr.colnames = NULL)) } else { correl <- format(round(correl, 2), nsmall = 2, digits = digits) correl[!lower.tri(correl)] <- "" print(correl[-1, -p, drop = FALSE], quote = FALSE) } } } cat("\n") invisible(x) } options("scipen"=10, "digits"=4) ``` --- # Loading and formatting the data ```{r Loading Data} BigBang <- read.csv("BigBangTheory.csv") ``` Adding formatting for neater visualisation: ```{r Formatting Data} BigBang["EpisodeNumber"] = 1:length(BigBang$ID) #All episodes numbered from 1 to 231. BigBang["AirDate.1"] = as.Date(BigBang$AirDate, format = "%m/%d/%Y") #Air date in proper date form BBT_Characters = BigBang[,c(7:21)] #Just the characters data for (i in 1:length(BigBang$ID)) { SiEj = strsplit(as.character(BigBang$ID[i]),"_")[[1]] BigBang$ID.1[i] = sprintf("S%s E%s",SiEj[1],SiEj[2]) #stores episode IDs in 'S3 E17' form BigBang$Season[i] = strtoi(SiEj[1]) #stores season, as an integer BigBang$Episode[i] = strtoi(SiEj[2]) } #stores episode within the season BigBang$Season = factor(BigBang$Season) #turns season to factor BigBang["TotalLines"] = rowSums(BigBang[,c(7:21)]) #total lines, from all characters #tab_df(BigBang) # <-- shows entire data frame as a nice large HTML table, good to view in browser. ``` We went through every row and column of the data frame, and at first glance we don't think the data contains any mistakes. Some writers' credits are off, like in episode 169 `...and David Saltzberg & Ph.D`, but nothing major. --- # 1. Initial data summary BigBang constits of data from each episode, including the $\textit{Title}$, $\textit{Director}$, $\textit{Writer}$, $\textit{Air Date}$ and $\textit{Rating}$, along with the number of lines said by numerous (but not all) characters from the show. Below is a summary of the data: ```{r Data Summary} options(width = 140) summary(BigBang) ``` From the summary we see that $\textit{Sheldon}$, $\textit{Leonard}$ and $\textit{Howard}$ all appear in every episode throughout all 10 seasons. Other characters such as $\textit{Penny}$ and $\textit{Raj}$ appear to be main characters also, as they have a higher mean number of lines in comparison to others. Characters such as $\textit{Bernadette}$ and $\textit{Amy}$ have a lower mean numer of lines compared to this but may be main characters added in later seasons. You can also see that some characters, like $\textit{Zack}$, appear with a significant role in a few episodes (i.e. where he had 47 lines) but do not appear in many other episodes. We can see that the show had many writers, who frequently collaborated for episodes: ```{r Writers Plot} BigBang["nWriters"] = 1 + str_count(BigBang$Writer, " and ") + str_count(BigBang$Writer, " & ") ggplot(BigBang, aes(x=EpisodeNumber, y=nWriters, colour=Season, fill=Season, label=ID.1)) + geom_point(shape="I", size=6.7, show_guide=FALSE) + labs(y="Number of writers", x="Episode Number", title="Writers per Episode") + geom_label_repel(data=subset(BigBang,EpisodeNumber==100 | nWriters==4), aes(fill=NULL), size=2, nudge_y=-0.2) ``` We see that the S1, and most of S2, had 1 to 2 writers, and then this increased to 3 in S3, which stayed similar to this - never decreasing to 2 writers again, but on occasion 4 writers did collaborate - until the end of the show. It is interesting to see that Episode 100 had just one writer, Chuck Lorre. Perhaps it was a special episode. In contrast to this, considering the directors, $\textit{Mark Cendrowski}$ directed 203 episodes, which is $88\%$ of all episodes: ```{r Directors Bar, message=FALSE} x = count(BigBang, Director, Season); x = arrange(x, -n); x = data.frame(x) ggplot(x, aes(x=reorder(Director, -n,sum),y=n,colour=Season,fill=Season)) + geom_col() + ggtitle("Episodes per Director") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=1.16)) + labs(y = "Number of episodes directed") + stat_summary(aes(x=Director,y=n,label =..y.., group=Director),fun.y ='sum',geom='text',vjust=-0.2, colour="grey50",inherit.aes=FALSE) + theme(legend.key.size = unit(0.7,"lines"), legend.box.spacing = unit(-0.01,"lines"),panel.grid.major.x = element_blank(),axis.title.x=element_blank()) ``` Season 1 had a few one-off directors. The equivalent chart for writers is: ```{r Writers Bar} q = str_split(BigBang$Writer, "\\ & |\\ and "); w= rbindlist(lapply(q, as.data.frame.list),fill=TRUE) #splits writers and turns into df w$EpisodeNumber = 1:231; w$Season = BigBang$Season; w1 = gather(w,value="Writer",key="key",-EpisodeNumber, -Season) #turns into long form top15names = subset(arrange(count(w1,Writer),-n), Writer != "NA")[1:15,] #counts and gets top 15 names x = arrange(count(w1,Writer,Season),-n); x = subset(x,Writer!="NA" & (Writer %in% top15names$Writer)); #sorts and removes NA values ggplot(x, aes(x=reorder(Writer, -n,sum),y=n,colour=Season,fill=Season)) + geom_col() + theme_minimal() + labs(y="Number of episodes co-written") + theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1.2),panel.grid.major.x = element_blank(), axis.title.x=element_blank(), legend.key.size = unit(0.7,"lines"), legend.box.spacing = unit(-0.01,"lines")) + stat_summary(aes(x=Writer,y=n,label =..y.., group=Writer),fun.y='sum', geom='text',vjust=-0.2,colour="grey50",inherit.aes=FALSE) + ggtitle("Episodes per writing credit") ``` Chuck Lorre is by far the most prolific writer for the show, with over 112 writer credits. Even though from the summary we saw there were a lot of writer collaborations, there are only about 8-10 writers who have conststently written episodes throughout the seasons. The show's total lines break down across the characters is illustrated in the bar chart below: ```{r Characters Bar, message=FALSE} BBT_Characters["Season"] = BigBang["Season"] x0 = gather(BBT_Characters,value="lines",key="Character",-Season) #turns into long form x0$lines = 100*x0$lines/sum(x0$lines) #turns lines in each episode into % of total lines ggplot(x0, aes(x=reorder(Character,-lines,sum),y=lines,fill=Season,colour=Season)) + geom_col(position="stack") + theme_minimal() + labs(y = "% of show's total lines",title="Total Lines per Character") + stat_summary(aes(x=Character,y=lines,label=sprintf("%g%%",round(..y..,0)),group=Character),fun.y ='sum',geom='text',vjust=-0.25, colour="grey50", inherit.aes=FALSE) + theme(axis.text.x = element_text(angle = 45, hjust = 1,vjust=1.37),legend.key.size = unit(0.7,"lines"), legend.box.spacing = unit(-0.01,"lines"), panel.grid.major.x = element_blank(),axis.title.x=element_blank()) ``` From this bar chart we see that approximately one quarter of the show's lines are Sheldon's, 1/5th are Leonard's and 1/6th are Penny's. The top 3 make up $62\%$ of the show's lines, and the top 4 make up $74\%$, sugggesting that these are some of the main characters. Plotting the number of lines from each episode by character shows how the characters have changed over time, and where new characters were added: ```{r Character Plots old, include=FALSE} #p = list() #for (i in 7:21) { # BBTCharacter = colnames(BigBang)[i] # p[[i-6]] = ggplot(BigBang, aes_string(x="AirDate.1",y=BBTCharacter,colour="Season")) + geom_point(alpha=0.45) + # labs(x="", y="") + ylim(c(0,130)) + scale_color_brewer(palette="Paired") + ggtitle(BBTCharacter) + # theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank()) } #grid.arrange(grobs=p[c(1,2,3,4)],bottom="Episode Air Date",left="Lines spoken") #grid.arrange(grobs=p[c(5,6,7,8)],bottom="Episode Air Date",left="Lines spoken") #grid.arrange(grobs=p[c(9,10,11,12)],bottom="Episode Air Date",left="Lines spoken") #grid.arrange(grobs=p[c(13,14,15)],ncol=2,bottom="Episode Air Date",left="Lines spoken") ``` ```{r Characters Plots} BBT_Characters["EpisodeNumber"] = BigBang["EpisodeNumber"] x = gather(BBT_Characters,value="Lines",key="Character",-Season, -EpisodeNumber) x$Character = factor(x$Character,levels=colnames(BBT_Characters[-16,-17])) #this is to arrange names in custom order p = ggplot(x,aes(x=EpisodeNumber,y=Lines,colour=Season)) + geom_point(alpha=0.6,shape=16,size=1.2) + facet_wrap(~Character) + theme_linedraw() + theme(panel.spacing = unit(0, "lines"), strip.text.x = element_text(colour="black",margin = margin(0.1,0,0.1,0, "lines")), strip.background = element_rect(fill="grey82",colour="black")) + labs(x="Episode Number",y="Lines spoken") + guides(colour = guide_legend(override.aes = list(shape = 15, size=4, alpha=1))) g <- ggplotGrob(p); rg <- g$layout$name %in% c("panel-1-4","panel-2-4","panel-3-4","strip-t-4-1","strip-t-4-2","strip-t-4-3","axis-b-4-3"); g$layout[rg, ][c("t","b")] = g$layout[rg, ][c("t","b")] + 5; grid.draw(g) #this moves the lone empty cell from bottom to top right corner :) ``` Examining $\textit{Sheldon}$'s and $\textit{Leonard}$'s appearances here we can see that their number of lines decreases over time. This could be because their characters had become established as the series progressed and new main characters such as Bernadette and Amy were added which meant effectively there were fewer lines to go around (as no characters were removed). We also see that all characters have a lot of variation in the number of lines in each season. This could be connected to the fact that the show had a lot of different writers. It could be later investigated whether certain writers favour certain characters. Furthermore we also see that some characters such as $\textit{Amy}$ and $\textit{Bernadette}$ do not have any lines in the first 2 to 3 seasons and then have many after that. This suggests they are added to the show and become main characters. We can make a matrix representing the correlation between the number of lines spoken over time between pairs of main characters: ```{r Correlation Matrix} BigBangCor <- BigBang[,c(7:14)]; corMatrix = cor(BigBangCor) ggcorrplot(corMatrix, method="square",lab=TRUE,color=c("#226DB4", "white", "salmon"), legend.title="Correlation") ``` From this we see that $\textit{Leonard}$ and $\textit{Sheldon}$ share a positive correlation because over time their number of lines both decrease over time, whereas $\textit{Amy}$ and $\textit{Bernadette}$ have a positive correlation because they are introduced during the same season and their number of lines over time follow a similar trend. Both $\textit{Leonard}$ and $\textit{Sheldon}$ have negative correlations with $\textit{Amy}$ and $\textit{Bernadette}$ because as the show progressed and Leonard and Sheldon began to speak fewer lines, Amy and Bernadette began to speak more. --- # 2. Ratings over time We first plot the show's ratings over time to figure out what the data looks like: ```{r Show Rating Plot 1} BBSub = subset(BigBang, Rating > 9 | Rating < 7 | ID.1 == "S10 E24" | ID.1 == "S7 E9" | ID.1 == "S5 E11") p = ggplot(data = BigBang, aes(EpisodeNumber, Rating, label=ID.1, fill=Season, color=Season)) + geom_point() + ggtitle("Show Rating Over Time") + geom_label_repel(data = BBSub, size=2, box.padding=0.2, aes(fill = NULL)) + labs(x="Episode Number", y="Rating") p + geom_smooth(method=loess, formula=y~x, size=0.6, fill="grey92", colour="#f9ccac", linetype="dashed") ``` We see some seasons had particularly highly and lowly rated episodes, and overall that the smooth local mean line seems to follow a quadratic without the need for any transformations. We can start by fitting a linear model comparing the basic linear model to other polynomial models of degree up to 3. NB: `anova(lmod1,lmod2,lmod3)` compares `lmod2` to `lmod1` and `lmod3` to `lmod2` ```{r Show Rating ANOVA 1} lmod1 = lm(Rating ~ EpisodeNumber, BigBang) lmod2 = lm(Rating ~ EpisodeNumber + I(EpisodeNumber^2), BigBang) lmod3 = lm(Rating ~ EpisodeNumber + I(EpisodeNumber^2) + I(EpisodeNumber^3), BigBang) #anova(lmod1,lmod2,lmod3) ``` The ANOVA table comparing the 3 models is as follows: Model parameters |F-stat|p-value| ----------------------------------------------------------------------|------|-------| _EpisodeNumber_ | -- | -- | **_EpisodeNumber_** + **_EpisodeNumber_$^{\ 2}$** |**11.528**|**0.0008**| _EpisodeNumber_ + _EpisodeNumber_$^{\ 2}$ + _EpisodeNumber_$^{\ 3}$ |1.8647|0.1734| We see that the quadratic model is significantly better than the simple linear model, and the cubic is not significantly better than the quadratic. So we stick with the **quadratic** model: $$ \mathtt{Rating} = \alpha + \beta_1\mathtt{EpisodeNumber} + \beta_2\mathtt{EpisodeNumber}^2 $$ ```{r} tab_model(lmod2,auto.label=TRUE,show.ci=FALSE,show.fstat = TRUE,digits=6,digits.p=4,show.se = TRUE,show.stat = TRUE, title="lmod2: Show's rating over time") ``` From the regression summary table we see that the first coefficient is insignificant, while the intercept and the quadratic terms are significant. Thus, under this model, a significant negative quadratic coefficient implies that the ratings have declined as the show progressed. Since the linear coefficient is insignificant, we expect the ratings to decrease slowly at the start, but decrease faster over time due to the negative quadratic term. Diagnostic plots: ```{r Show Diagnostics 1, message=FALSE} diagnose(lmod2, colour="Season", alpha=0.5, sqrt.level=1.66, box.size=2, std.nudge.x=-0.1, std.nudge.y=-0.7, std.box=0.2, sqrt.box=0.2, sqrt.nudge.x=0, sqrt.nudge.y=0.1, cooks.level=0.05, cooks.box=0, cooks.nudge.y=-0.005, cooks.nudge.x=-22) ``` The first plot shows that the residuals are scattered randomly around 0, meaning that they're likely to be independent, which is good for our linear model. The $\sqrt{|\text{std. resid.}|}$ plot shows that the variance is roughly constant, which is also good. The QQ-plot shows that there is some non-normality for very high and very low rated episodes, but normality of the residuals is not a crucial assumption particularly given the large sample size, as the Central Limit Theorem will ensure this. We also see that **Seasons 7** and **8** have particalarly high average residual variance. This is because most of their episodes are rated significantly lower than predicted, we investigate exceptional episodes later. From the Cooks plot we can see that there are numerous data points which exceed the Cook's distance threshold for removal – which we set at $\frac{4}{N-k-1}$. However, there are multiple factors to consider before removing outliers, so we consider the residuals-fitted plot and studentised residuals next. The residuals-fitted plot shows that there are at least 3 episodes with high residuals. The studentised residuals for each of them are: ```{r} tail(sort(abs(rstudent(lmod2))),3) ``` Which correspond to **S9 E1**, **S10 E24** and **S9 E11** respectively. We compute the critical value for declaring outliers using the Bonferroni correction as: ```{r Show Bonferroni} abs(qt(.05 / (231*2), 231-2-1)) ``` So according to this, **S10 E24** and **S9 E11** can be called outliers. Were we to omit them from the data the model would become: ```{r Show Rating wo Outliers} BigBang2 = subset(BigBang, ID.1 != "S10 E24" & ID.1 != "S9 E11") lmod2.1 = lm(Rating ~ EpisodeNumber + I(EpisodeNumber^2), BigBang2) tab_model(lmod2.1, lmod2, auto.label=TRUE, show.ci=FALSE, show.fstat = TRUE, digits=6, digits.p=4, show.se = TRUE, show.stat = TRUE, dv.labels = c("Model with outliers removed","Original model"), title="Comparing the two models" ) ``` From this we can see that the significance of the predictors doesn't change in comparison to the old model - before outlier removal. Furthermore we see that the coefficient estimates only changed slightly, and the fit of the model $\textit{R}^2$ remains very similar. For this reason, we keep the outliers in the data at this stage of analysis, as we do not see a significant reason to remove them. In conclusion, the ratings did decrease for the series over time, in a quadratic fashion. A plot of episode ratings against episode number overlaid with the regression curve from our model looks like so: ```{r Show Rating Plot Final} p + geom_smooth(method=lm, formula=y~poly(x,2), size=0.4, fill="grey92", colour="#A9DFBF", linetype="dashed") ``` This plot clearly illustrates the decrease in ratings, and also shows that the quadratic linear model fits the data well. --- # 3. Ratings within season To investigate the ratings within each individual season, we first plot the rating over time for each season: ```{r Seasons Rating Plot} ggplot(BigBang,aes(Episode,Rating,colour=Season,fill=Season))+geom_jitter(alpha=0.3)+labs(x="Episode number", y="Rating",title="Ratings Within Each Season")+ geom_smooth(method=loess, se=FALSE, formula=y~x, size=0.45) + guides(colour = guide_legend(override.aes = list(shape = 15,size=4))) ``` ```{r Seasons Rating Facet} BBSub = subset(BigBang, ID.1=="S10 E24" | ID.1=="S9 E11") ggplot(BigBang, aes(Episode,Rating, label=ID.1, fill=Season, colour=Season)) + geom_point(alpha=0.7) + geom_label_repel(data=BBSub, size=1.5, box.padding=0, nudge_x=-3.7, aes(fill=NULL)) + facet_wrap(~Season, scales="free_x", labeller=label_both) + labs(x="Episode Number",y="Rating",title="Ratings Within Each Season") + geom_smooth(method=loess,formula=y~x,size=0.4,colour="#f99cac", fill="grey75") + theme(strip.background = element_rect(fill="white"),legend.position="none") ``` From the confidence bands around the smooth average lines, it appears that there is no change in ratings for most seasons, but we check to make sure we are not missing anything. Checking for $\textbf{Season 1}$, we fit 3 models: ```{r Season 1 Model} Season1 = subset(BigBang, Season == 1) lmod1 = lm(Rating ~ Episode, Season1) lmod2 = lm(Rating ~ Episode + I(Episode^2), Season1) lmod3 = lm(Rating ~ Episode + I(Episode^2) + I(Episode^3), Season1) tab_model(lmod1, lmod2, lmod3,auto.label=T, show.ci=F, show.fstat = T, digits=6, digits.p=4, show.se = T, show.stat = F, show.aic = T, dv.labels=c("lmod1","lmod2","lmod3"),title="Season 1: models for Rating" ) #anova(lmod1, lmod2, lmod3) ``` We fit a couple of possible models and none of them have any significant predictors (large $\textit{p}$-values), so we conclude that episode number has **no effect** on the rating. Running ANOVA, unsurprisingly, tells that the other polynomial models aren't better either. So we can conclude that the ratings didn't change significantly within season 1. Repeating this analysis on other seasons, we find that the same holds true for all but seasons **6**, **9** and **10**. --- First we investigate $\textbf{Season 6}$: ```{r Season 6 Model 1} Season6 = subset(BigBang, Season == 6) lmod1 = lm(Rating ~ Episode, Season6) lmod2 = lm(Rating ~ Episode + I(Episode^2), Season6) lmod3 = lm(Rating ~ Episode + I(Episode^2) + I(Episode^3), Season6) tab_model(lmod1, lmod2, lmod3, auto.label=TRUE, show.ci=FALSE, show.fstat = TRUE, digits=6, digits.p=4, show.se = TRUE, dv.labels=c("lmod1","lmod2","lmod3"),title="Season 6: models for Rating") ``` We also used ANOVA *(omitted for brevity)* to compare all polynomial linear models up to degree $3$ and we saw that the quadratic linear model is significantly better than the simple model, and the cubic isn't significantly better than the quadratic. So we will use the **quadratic**, `lmod2`. We now look at the diagnostic plots: ```{r Season 6 Diagnostics 1} diagnose(lmod2, alpha=0.7, x_variable="Episode", cooks.x.lab="Episode", std.box=0, std.nudge.x=-0.09, std.nudge.y=-0.3,sqrt.box=0, sqrt.nudge.x=-0.09, sqrt.nudge.y=-0.1, cooks.level=0.25, cooks.box=0.6, cooks.nudge.y=0.1) ``` We don't see anything wrong here. There is potentially heteroskedasticity toward the end, but it isn't too high so is not of concern. From the summary of `lmod2` we see that both $\mathtt{Episode}$ and $\mathtt{Episode}^2$ are significant predictors. Their coefficients are of opposites signs, so solving the quadratic shows that the season peaks in rating at episode 12. This means the ratings were *(on average)* increasing from episode 1 to 12, and decreasing after episode 12, which is exactly what the rating plot for this season at the beginning of this section shows. --- Next we investigate further for seasons 9 and 10. The same episodes stick out as before – **S9 E11** and **S10 E24** – and we have a suspicion that those two episodes skew their entire seasons' ratings and may be more significant as outliers now due to the smaller sample size. We start off with $\textbf{Season 9}$. From the plots above, we see that its plot has a shape which may be a cubic. We seek to confirm our suspicions: ```{r Season 9 Model 1} Season9 = subset(BigBang, Season == 9) lmod1 = lm(Rating ~ Episode, Season9) lmod2 = lm(Rating ~ Episode + I(Episode^2), Season9) lmod3 = lm(Rating ~ Episode + I(Episode^2) + I(Episode^3), Season9) lmod4 = lm(Rating ~ Episode + I(Episode^2) + I(Episode^3) + I(Episode^4), Season9) tab_model(lmod1, lmod2, lmod3, lmod4, auto.label=TRUE, show.ci=FALSE, show.fstat = TRUE, digits=6, digits.p=4, show.se = FALSE, show.aic = TRUE, dv.labels=c("lmod1","lmod2","lmod3","lmod4"),title="Season 9: initial models for Rating") ``` All predictors in **model 3** (the cubic) are significant, its AIC value is the lowest, and *adjusted* $\textit{R}^2$ value is the highest. So we will use it. Again, we include diagnostic plots: ```{r Season 9 Diagnostics 1} diagnose(lmod3,CI.fill="grey80",alpha=0.8, std.nudge.x=-0.1, sqrt.nudge.x=-0.11, sqrt.nudge.y=-0.14, cooks.level=0.2, cooks.nudge.x=-2.3, cooks.nudge.y=0.05, x_variable="Episode", cooks.x.lab="Episode") ``` We see again that **S9 E11**, `The Opening Night Excitation`, sticks out, and the first plot shows **heteroskedasticity** (non constant variance of the residuals). We try omitting this episode from the season: ```{r Season 9 Remove Outliers} Season9 = subset(BigBang, Season == 9 & ID.1 != "S9 E1" & ID.1 != "S9 E11") lmod1 = lm(Rating ~ Episode, Season9) lmod2 = lm(Rating ~ Episode + I(Episode^2), Season9) lmod3 = lm(Rating ~ Episode + I(Episode^2) + I(Episode^3), Season9) lmod4 = lm(Rating ~ Episode + I(Episode^2) + I(Episode^3) + I(Episode^4), Season9) tab_model(lmod1, lmod2, lmod3, lmod4, auto.label=TRUE, show.ci=FALSE, show.fstat = TRUE, digits=6, digits.p=4, show.se = FALSE, show.aic = TRUE, dv.labels=c("lmod1","lmod2","lmod3","lmod4"),title="Season 9: Models for Rating without E1 & E11") ``` We use AIC for comparing 4 models, and the **cubic** linear model again has the lowest value. So we investigate the model further. We ran the diagnostics *(omitted)* and found that the diagnostic plots became worse than before the outlier removal: heteroskedasticity is more present, and there is evidence of dependence in the residuals. There is not much we can do to fight it: we could do log transformations, but that would make the result of the regression a lot more difficult to interpret – especially given how small the data set is for this season. The best we can do is **stick with our model**: $$ \mathtt{Rating} = 6.55 + 0.41\mathtt{Episode} - 0.034\mathtt{Episode^2} + 0.0008\mathtt{Episode^3} $$ Solving this cubic suggests that the ratings in Season 9 increased on average and peaked at episode 9, decreased to the lowest rating at around episode 20, and increased (on average) slightly after. This agrees with the ratings over time plot for this season. --- Next we investigate $\textbf{Season 10}$. From the plot a polynomial seems like a reasonable model form, so we again begin by fitting multiple linear models: ```{r Season 10 Model 1} Season10 = subset(BigBang, Season == 10) lmod1 = lm(Rating ~ Episode, Season10) lmod2 = lm(Rating ~ Episode + I(Episode^2), Season10) lmod3 = lm(Rating ~ Episode + I(Episode^2) + I(Episode^3), Season10) lmod4 = lm(Rating ~ Episode + I(Episode^2) + I(Episode^3) + I(Episode^4), Season10) tab_model(lmod1, lmod2, lmod3, lmod4, auto.label=TRUE, show.ci=FALSE, show.fstat = TRUE, digits=6, digits.p=4, show.se = FALSE,show.aic = TRUE, dv.labels=c("lmod1","lmod2","lmod3","lmod4"),title="Season 10: initial models for Rating") ``` We see that the **cubic** model is the preferred model among these, so we will choose it for now. Diagnostic plots: ```{r Season 10 Diagnostics 1, message=FALSE} diagnose(lmod3, alpha=0.7, x_variable="Episode", cooks.x.lab="Episode", std.box=0, std.nudge.x=-0.09, std.nudge.y=-0.3, sqrt.box=0, sqrt.nudge.x=-0.09, sqrt.nudge.y=-0.1, cooks.level=0.25, cooks.box=0.6, cooks.nudge.y=0.1) ``` We see evidence of **heteroskedasticity** in the $\sqrt{|\text{std. resid.}|}$ plot. We check what would happen upon removing **S10 E24**, an expected outlier from the Cook's distance plot: ```{r Season 10 Remove Outliers} Season10 = subset(BigBang, Season == 10 & ID.1 != "S10 E24") lmod1 = lm(Rating ~ Episode, Season10) lmod2 = lm(Rating ~ Episode + I(Episode^2), Season10) lmod3 = lm(Rating ~ Episode + I(Episode^2) + I(Episode^3), Season10) lmod4 = lm(Rating ~ Episode + I(Episode^2) + I(Episode^3) + I(Episode^4), Season10) tab_model(lmod1, lmod2, lmod3, lmod4, auto.label=TRUE, show.ci=FALSE, show.fstat = TRUE, digits=6, digits.p=4, show.se = FALSE, show.aic = TRUE, dv.labels=c("lmod1","lmod2","lmod3","lmod4"),title="Season 10: models for Rating w/o E24") ``` Now we see that the polynomial models are not significantly better than the simple linear model, so we should switch to `lmod1`. We run the diagnostics one last time: ```{r Season 10 Diagnostics 2,message=FALSE} diagnose(lmod1, alpha=0.7, x_variable="Episode", cooks.x.lab="Episode", std.box=0, std.nudge.x=-0.09, std.nudge.y=-0.3, sqrt.box=0, sqrt.nudge.x=-0.09, sqrt.nudge.y=-0.1, cooks.level=0.25, cooks.box=0.6, cooks.nudge.y=0.1) ``` We see no real evidence of heteroskedasticity from the $\sqrt{|\text{std. resid.}|}$ plot as a horizonal line can clearly be drawn through the confidence interval. The first plot also shows that the residuals are roughly independent. There is some non-normality in the Q-Q plot, but it's usually not a problem because of the Central Limit Theorem – applying to large samples. Here we have a sample size of $23$, which is small and errors appear to be close to normal anyway, with only the tail ends appearing non-normal. Applying a transformation here may yield a nicer Q-Q plot, but would increase the difficulty of the interpretation, it is better to stick with the model we have. So our model for $\textbf{Season 10}$ is: $$ \mathtt{Rating} = 7.856 - 0.0340\mathtt{Episode} $$ Which suggests that, not accounting for **S10 E24**, the season's ratings fell linearly by $0.034$ points per episode. From the plot of season 10 we see that this is exactly what happened. --- # 4. Impact of Amy and Bernadette on ratings First, looking at both $\textit{Amy}$ and $\textit{Bernadette}$ individually: ```{r Amy & Bern Plots} th=list(geom_jitter(width=3,alpha=0.3),theme(plot.margin=margin(-10,0,-10,0),plot.title=element_text(vjust=-5.5),plot.subtitle=element_text(hjust=1,vjust=-2, size=8)), geom_smooth(method=loess, formula=y~x, colour="salmon", size=0.8, fill="grey80"), labs(x="Lines in an episode")) BBAmy = subset(BigBang, Amy != 0); BBBern = subset(BigBang, Bernadette != 0) p1 = ggplot(BigBang, aes(Amy,Rating)) + th + labs(x="", title="Amy",subtitle="Entire show") p2 = ggplot(BigBang, aes(Bernadette,Rating)) + th + labs(x="",y="", title="Bernadette",subtitle="Entire show") p3 = ggplot(BBAmy,aes(Amy,Rating)) + th + labs(x="Lines in an episode",title="Amy",subtitle="After introduction") p4 = ggplot(BBBern, aes(Bernadette,Rating)) + th + labs(y="",title="Bernadette",subtitle="After introduction") ggarrange(p1,p2,p3,p4,ncol=2,nrow=2) ``` From the top two plots it appears that the show's ratings are lower when Amy and Bernadette have lines. However, the bottom two plots show that given Amy and Bernadette are present in an episode, the rating is not affected by their number of lines. Together, this suggests that the ratings were higher before Amy and Bernadette were introduced (where they have 0 lines), but after their addition, the number of lines Amy and Bernadette spoke had no effect on the rating. We can investigate further to see if this is the case by forming two linear models: ```{r Amy & Bern Models} amod = lm(Rating ~ Amy, BigBang) bmod = lm(Rating ~ Bernadette, BigBang) tab_model(amod, bmod, auto.label=T, show.ci=F, show.fstat = T, digits=6, digits.p=4, show.se = F, show.aic = F, dv.labels=c("amod","bmod"),title="Amy & Bern: simple models for Rating") ``` We can see that the coefficient for $\textit{Amy}$ is significant and negative which suggests that the more she speaks, the lower the rating is expected to be. The coefficient of $\textit{Amy}$ is -0.012 which implies that for every line that Amy says, the episode's rating declines by 0.012 on average. The same can be said about the effect of $\textit{Bernadette}$. However, from Question 1 we know that the general trend was that the show's ratings decreased over time – we can account for the overall decline in ratings by adding the time trend from Question 2 as a predictor: ```{r Amy Accounted For Time} amod2 = lm(Rating ~ Amy + EpisodeNumber + I(EpisodeNumber^2), BigBang) tab_model(amod2, digits=6, digits.p=4, auto.label=F, show.ci=F, show.fstat=T, show.se=T, show.aic=F, dv.labels="Rating", title="amod2: accounting for time") ``` And we can see that Amy's impact on ratings is not significant when accounted for the show's decline. We can see if Amy's lines are correlated with time: ```{r Amy-Time Correlation} cor(BigBang$Amy,BigBang$EpisodeNumber) ``` 0.64 is quite a strong correlation, but 0.7 is usually considered strong enough to be considered multicollinearity between the two predictors ($\textit{Amy}$ and $\textit{EpisodeNumber}$). To make sure there is little multicollinearity we investigate further and compute the variance inflation factors (VIFs) to assess whether factors are correlated to each other enough to affect the reliability of the model: ```{r} vif(amod2) ``` Usually a value greater than 2.5 (a conservative bound) is considered as a problem, but we see each of these are less than this value. Which means the two predictors are not correlated enough to affect the model. So we conclude that Amy did not have any impact on ratings as she is an insignificant predictor when accounting for the shows decline over time. We can do a similar analysis on the introduction of Bernadette: ```{r Bern Accounted For Time} bmod2 = lm(Rating ~ Bernadette + EpisodeNumber + I(EpisodeNumber^2), BigBang) tab_model(bmod2, digits=6, digits.p=4, auto.label=F, show.ci=F, show.fstat=T, show.se=T, show.aic=F,dv.labels=c("bmod2"),title="Bern: accounting for time") cor(BigBang$Bernadette,BigBang$EpisodeNumber) ``` Again, Bernadette's lines are correlated with time, so we compute the VIF to see whether this correlation causes an effect on the model: ```{r} vif(bmod2) ``` Again we see both of these are less than 2.5, so we there isn't enough multicollinearity here to cause issue. So we conclude that Bernadette did not have any impact on ratings as she is an insignificant predictor when accounting for the show's decline. The same holds true for the combined effect of Amy and Bernadette: ```{r Amy+Bern Accounted For Time} abmod2 = lm(Rating ~ I(Amy + Bernadette) + EpisodeNumber + I(EpisodeNumber^2), BigBang) tab_model(abmod2, digits=6, digits.p=4, auto.label=F, show.ci=F, show.fstat=T, show.se=T, show.aic=F, dv.labels="abmod2",title="Amy & Bern: combined effect") cor(BigBang$Bernadette + BigBang$Amy, BigBang$EpisodeNumber) ``` The correlation here is stronger than before and is around the required threshold (0.7) for causing significant multicollinearity but we compute the VIF to see whether this is the case. ```{r} vif(abmod2) ``` These values are higher than before, but are still not above 2.5 and therefore we can say that there is not enough evidence to suggest that the ratings declined with the introduction of Amy and Bernadette together. If we go back to analysing the introduction of Amy and only look at the episodes after her introduction: ```{r} amod1 = lm(Rating ~ Amy, subset(BigBang, Amy != 0)) amod2 = lm(Rating ~ Amy + EpisodeNumber + I(EpisodeNumber^2), subset(BigBang, Amy != 0)) tab_model(amod1,amod2, digits=6, digits.p=4, auto.label=F, show.ci=F, show.fstat=T, show.se=F, show.aic=F, dv.labels=c("amod1","amod2"),title="Amy: post-introduction only") ``` We don't consider $\textit{Amy}$ to be a significant predictor in either of the models, albeit more significant in the second model. Doing the same for the effect of $\textit{Bernadette}$ doesn't produce anything interesting either. Therefore, we see that there was a significant decline in the rating around the time Amy and Bernadette were introduced, but we can't conclude that Amy and Bernadette caused the fall in the ratings. We **cannot tell** what caused what, perhaps the production companies *Warner Bros. Television* and *Chuck Lorre Productions* predicted a decline in ratings and introduced the new characters in an attempt to prevent this decline, alternatively the new characters could have been added and people did not like them (although the ratings were not lower the more lines they spoke), causing the decline. We have also *quickly* investigated whether **any other characters** at all affect ratings, and we found that no individual character has a significant effect. We a step function to find the linear models that contains at least some characters up to Amy (top 8 characters only) with the lowest AIC values. ```{r} p_load("olsrr") terms <- colnames(BigBang)[7:14] #names of the predictors: all characters up to Amy exprs <- sprintf("%s ~ EpisodeNumber + I(EpisodeNumber^2) + %s", "Rating", paste(terms, collapse = "+")) #expression: "Rating~EpisodeNumber+...+Amy" model <- lm(as.formula(exprs), data = BigBang) #full model with every possible predictor k <- ols_step_all_possible(model,aic=TRUE); k = arrange(k,aic) #step function k <- k[grepl(c("EpisodeNumber ","EpisodeNumber^2"), k$predictors),] #only the ones that contain both EpisodeNumber terms tab_df(k[1:5,c(1,2,3,4,5,8)],title="Top 5 linear models for characters (by lowest AIC)") #select top 5 by lowest AIC values ``` We have done regression summaries for each of these 5 models (*omitted for brevity*), and we found that **none of them** contain any predictors for characters that are significant at $5\%$ level. It appears that no single character has a significant effect on ratings when accounted for the show's overall rating trend. --- # 5. Impact of Chuck Lorre on ratings Below is a graph showing Lorre's participation in the show over time. ```{r Lorre Time Chart} BigBang["Lorre"] = 0; BigBang[which(str_detect(BigBang$Writer,"Chuck Lorre")),]$Lorre = 1 #puts 1 for Lorre, 0 for not-Lorre BigBang$LorreLoess <- predict(loess(Lorre~EpisodeNumber, data = BigBang)) #gets the smooth average line ggplot(BigBang, aes(x=EpisodeNumber,y=Lorre,colour=Season)) + geom_point(shape="I",size=6.7) + theme(axis.title.y = element_text(vjust=-10)) + labs(y="Written by",x="Episode Number",title="Episodes Co-written By Chuck Lorre") + geom_line(aes(y=LorreLoess,x=EpisodeNumber,colour=Season),size=1) + annotate(geom="text",label="Smooth average",size=3.5,colour="grey60",x=126,y=0.55,angle=-65) + scale_y_continuous(breaks=c(0,1),label=c("Others","Chuck\nLorre")) ``` We see that Lorre's participation in episode writing follows a sinusoidal pattern, and he wrote a lot in season 4-6, and didn't write much in seasons 7 and 8 and then started to write more again in season 10. Next we fit a simple linear model, as $\textit{Lorre}$ is a binary variable, for rating with Lorre as the predictor: ```{r Lorre Model 1} lmod1 = lm(Rating ~ Lorre, BigBang) tab_model(lmod1, digits=6, digits.p=4, auto.label=F, show.ci=F, show.fstat=T, show.se=F, show.aic=F, dv.labels="Rating",title="lmod1: Effect of Lorre on Rating") ``` We see that $\textit{Lorre}$ is not a significant predictor for rating. In fact, adding $\textit{EpisodeNumber}$ as another predictor – thus accounting for the show's decline – makes Lorre an even less significant predictor. Below is a graph showing the rating against the episode number, with episodes co-written by Chuck Lorre highlighted. The linear model for rating with $\textit{Lorre}$ as a predictor as determined above is added on the graph below. A linear model for rating with the episodes without Lorre as a predictor is also shown. ```{r Lorre Ratings Chart} BBLorre1 = subset(BigBang, Lorre == 1); BBLorre0 = subset(BigBang, Lorre == 0) BBSub = subset(BBLorre1, Rating > 9 | Rating < 7.15 | ID.1 == "S5 E11") ggplot(data = BigBang, aes(EpisodeNumber, Rating, label=ID.1, colour=factor(Lorre),fill=factor(Lorre))) + geom_label_repel(data = BBSub, size=2, box.padding=0.2, nudge_y=-0.08,nudge_x=-10,aes(fill=NULL)) + geom_point(alpha=0.4) + labs(x="Episode Number",y="Rating") + ggtitle("Impact of Chuck Lorre on Ratings") + scale_colour_manual(values=c("#BABABA","#699BFF"),name="Written by",labels=c("Others","Chuck Lorre")) + scale_fill_manual(values=c("#BABABA","#699BFF"),name="Written by",labels=c("Others","Chuck Lorre")) + geom_smooth(data=BBLorre0,method=loess,formula=y~x,size=1,colour="#AAAAAA",fill="grey100",se=FALSE) + geom_smooth(data=BBLorre1,method=loess,formula=y~x,size=0.9,colour="#699BFF",fill="grey100",se=FALSE) ``` This graph shows that on the whole the episodes that Chuck Lorre co-wrote did not have ratings significantly different from average. Both the lines for the linear models and are very similar and are not significantly different. However, it is worth noting that if you take the exceptional episodes into account (explored in further detail later) he wrote several of the worst rated epsisodes (including the lowest rated) and none of the highest rated episodes. We see that, holding everything else equal, episodes co-written by Chuck Lorre **did not** have ratings significantly different from average. However it is worth noting that this analysis takes into account not only every episode written by Chuck Lorre but also every other episode that had Chuck Lorre as a co-writer, this means Chuck Lorre would not have had full creative control in all these episodes. All things considered though this is the best approach to the analysis, as only $6$ episodes were written by Chuck Lorre alone and the dataset of 6 would not be sufficient to draw significant conclusions. --- # 6. Exceptional episodes We return to the model explaining the show's ratings over time: ```{r Load Show Rating Model} lmod2 = lm(Rating ~ EpisodeNumber + I(EpisodeNumber^2), BigBang) ``` The residuals vs time plot from this model can tell us how the rating of each episode is distributed around the show's running average: ```{r Residuals-EpisodeNumber Plot} almod = augment(lmod2); almod["ID.1"] = BigBang$ID.1; t.level.high = 0.6; t.level.low = -0.6 #preparing for the plot almod["EpisodeNumber"] = BigBang$EpisodeNumber; almod["Season"] = BigBang$Season #preparing for the plot almodsub = subset(almod, .resid > t.level.high | .resid < t.level.low) #preparing for the plot ggplot(data = almod,aes(x=EpisodeNumber,y=.resid,label=ID.1,colour=Season)) + geom_smooth(method=lm, formula=y~x, level=0.95,colour="#f9ccac", fill="grey92", size=0.5, linetype="dashed") + geom_label_repel(data = almodsub, size=2, box.padding=0.4, nudge_y=ifelse(almodsub$.resid > 0, 0.1, -0.1)) + geom_point(alpha=0.66) + labs(x="Episode Number", y="Deviation From Predicted Rating",title="Exceptional episodes") + theme(legend.position = "none") + scale_color_brewer(palette="Paired") ``` Here line $y=0$ represents the show's predicted rating under the model. We define an *exceptional episode* as any episode whose rating lies outside $(R_{episode} - 0.6,R_{episode} + 0.6)$, where $R_{episode}$ is the predicted rating of the episode. We chose $0.6$ as an arbitrary -- but good -- threshold to use because it only allows for a handful of episodes ($17$ out of $231$, $\approx 7\%$), and any wider interval would contain far too many more episodes, thus not making the episodes selected 'exceptional' at all. All of the exceptional episodes are labelled in the plot above. Limiting our attention to just those episodes: ```{r Good & Bad Episodes Bar} df = almodsub[order(-almodsub$.resid),]; df = df[,c("ID.1","Rating",".fitted",".resid")] ggplot(df, aes(x=ID.1, y=.resid, fill=.resid-0.2)) + geom_bar(stat = "identity") + scale_x_discrete(limits=df$ID.1) + scale_fill_gradient2(low="red", high="#00DD00") + labs(y="Deviation from predicted rating",title="Particularly Good vs Particularly Bad Episodes") + theme(axis.text.x = element_text(angle=45, hjust=1), legend.position="none", axis.title.x=element_blank()) ``` We see that there are more exceptionally popular episodes than exceptionally unpopular ones, and the further the show progressed the more episodes were better than average. This could be related to the fact that the average ratings went down in later seasons, making particularly good episodes stand out. We also see S8 and S9 premieres are some of the most negatively deviant episodes, moreover most of the season 8 was rated lower than predicted – only 5 out of 24 episodes from season 8 were rated higher than expected. Note also that the graph above does not include the lowest rated episode - **S10 E22** (joint lowest with **S9 E1**). This is because despite this episode having the lowest rating, it follows the overall declining trend of the data, making it less exceptional in comparison to **S9 E1** - despite both episodes having the same rating. The season 10 finale was very well-received. Top 10 best and worst episodes overal, including those already mentioned in the chart above, are: ```{r} a = arrange(BigBang,-Rating) #sorts by rating b = rbind(a[1:10,],a[(length(a$ID)-9):length(a$ID),]) #selects top 10 and bottom 10 in that list ggplot(b, aes(x=ID.1, y=Rating, fill=(Rating-8))) + geom_bar(stat = "identity") + theme_bw() + scale_x_discrete(limits=b$ID.1) + scale_fill_gradient2(low="red", high="#00DD00") + labs(y="Rating",title="Top 10 and Bottom 10 Rated Episodes") + scale_y_continuous(breaks=c(0:20)/2) + coord_cartesian(ylim=c(6.5,9.5)) + theme(axis.text.x = element_text(angle=45, hjust=1), legend.position="none", axis.title.x=element_blank()) ``` All of the best rated episodes here are from Seasons 1-4, and almost all of the worst episodes are from Season 10. Unlike the graph before, this graph **doesn't take into account the show's decline** over time and therefore contains a number of unsurprisingly high/low rated episodes in comparison to the previous one, which only contained the exceptional episodes. That explains why the best rated episodes here are from early seasons and the worst rated ones are from later seasons. --- # 7. Impact of characters on Raj ```{r Raj Model 1} lmod = lm(Raj ~ Leonard + Sheldon + Penny + Howard + Leslie + Bernadette + Amy + Stuart + Emily + Mary + Zack + Bert + Janine + Wil, BigBang) tab_model(lmod, digits=6, digits.p=4, auto.label=F, show.ci=F, show.fstat=T, show.se=T, show.aic=F, dv.labels=c("Raj"),title="lmod: characters' effect on Raj's lines") ``` From this we can see that that aside from the four main characters ($\textit{Leonard}$, $\textit{Sheldon}$, $\textit{Penny}$ and $\textit{Howard}$) the only other significant characters are $\textit{Bernadette}$ and $\textit{Emily}$ as their *p*-values are small. We can also see that of these significant characters $\textit{Howard}$ and $\textit{Emily}$ are the only two that have a positive coeffcient suggesting a positive impact on how much Raj speaks. ```{r, message=FALSE} coefs = coef(lmod); coefs[-1][which(summary(lmod)$coeff[-1,4] > 0.05)] = 0 #selects significant coefficients, and sets others to 0 df = data.frame(coef=sort(coefs[-1])); df[,"Character"] = rownames(df) #sorts and adds names ggplot(df, aes(x=Character, y=coef, fill=sign(coef))) + geom_bar(stat = "identity") + scale_x_discrete(limits=df$Character) +scale_fill_gradient2(low="red", high="#00DD00") + labs(y="Effect on Raj",title="Significant Effect on Raj's Lines") + scale_y_continuous(breaks=c(-10:10)/5) + theme(axis.text.x = element_text(angle=45, hjust=1), legend.position="none", axis.title.x=element_blank()) ``` The plot above shows the effect of each of the characters on how much Raj speaks, with each of the bars representing a coefficient from the linear model above, with $0$ for insignificant ones. We can see that $\textit{Leonard}$ and $\textit{Sheldon}$ both have a negative effect on $\textit{Raj}$ – suggesting that the more lines these characters had, the fewer lines Raj would speak. We know from having watched the show that Raj is incapable of speaking to girls until the end of season 6, and Sheldon and Leonard often appear in scenes with Penny so this may explain why he speaks less with them on average. It also shows that other minor characters have almost no impact on how much Raj spoke. We also see that $\textit{Bernadette}$ has the most negative effect on $\textit{Raj}$, which is reinforced by the notion that Raj cannot speak to girls up to season 6. Of all the main characters in the show, $\textit{Howard}$ is the only one to have a positive effect on how much Raj speaks - even if it is weak - this is probably due to the fact that they were best friends in the show and therefore would interact together more frequently. Furthermore, Emily was Raj's girlfriend which explains why she has such a strong effect on how much Raj speaks. From the characters' plots we know that Emily is not a main character in the show, so this effect could just be down to Emily not appearing in the episodes which don't feature much of Raj. --- # 8. Impact of wordiness on rating We can sum all characters' lines to find the total lines for each episode: ```{r} BigBang["TotalLines"] = rowSums(BigBang[,c(7:21)]) ``` And plot the episode ratings against total lines in the episodes: ```{r} p = ggplot(BigBang, aes(x=TotalLines,y=Rating)) + geom_point(alpha=0.3,colour="black") + geom_density2d(alpha=0.3,colour="#226DB4") + labs(x="Total Lines", y="Episode Rating",title="Rating-Lines Marginal Density Plot") + geom_smooth(method=lm,formula=y~x,colour="dodgerblue",linetype="dashed",size=0.2,fill="grey80",alpha=0.2) ggMarginal(p, type="density", fill="#226DB4", alpha=0.3, colour="black", xparams=list(size=0.2), yparams=list(size=0.2)) ``` The density histogram for the total lines ($x$-axis) suggests that the episodes are uniformly spread around its mean, and almost all episodes lie within $(150,275)$ lines. The histogram for the ratings is approximately normal, but negatively skewed. The fitted line supports the negative relationship, but the $95\%$ confidence interval is very wide and includes a horizontal line. So there may not be a significant relationship. To verify our intuition we can fit a linear model: ```{r} lmod = lm(Rating ~ TotalLines, BigBang) tab_model(lmod, digits=6, digits.p=4, auto.label=F, show.ci=F, show.fstat=T, show.se=T, show.aic=F, dv.labels=c("Rating"),title="lmod: Effect of Total Lines on Rating") ``` This linear model shows that the total number of lines is not a significant predictor for the rating, so the 'wordiness' of an episode **does not** have an effect on its popularity. This is further supported by the fact that the rating and the total number of lines for each episode and rating are not correlated: ```{r} cor(BigBang$TotalLines,BigBang$Rating) ``` --- # 9. Effect of rating on viewership Rating of a show is important, but ultimately the viewership numbers are what determines whether the show stays on-air, and rating seems like it should be a good predictor of viewership. We would like to investigate whether rating has an effect on viewership. Extra data gathered for this question was found [here:](https://data.world/priyankad0993/big-band-theory-information?fbclid=IwAR1fUTNpTRfYTZEXLAqoc593rQwweh0LTh-qu2DsYYxgG_So_4f3eIaQzwg). The file is called `Big Bang Theory_Wiki.xlsx`, and we rename it to `BigBangExtra2.xlsx`. We only analyse the first 10 seasons, so we checked that all episodes 1 to 231 match up with our data, `BigBang`, by comparing episode titles. From it, we extract only the data on the US viewership for each episode, and make a plot of viewership over time, with the size of the points representing the rating: ```{r} b2 <- read.xlsx("BigBangExtra2.xlsx") #loads the data BigBang["Viewership"] = as.numeric(b2$`US.viewers.(millions)`[1:231]) #imports the viewership column into our data. All episodes line up. All OK. ggplot(BigBang,aes(EpisodeNumber,Viewership,colour=Season)) + geom_point(size = (BigBang$Rating/mean(BigBang$Rating))^10, alpha=0.6) + labs(title="Viewership over time",subtitle="Size is rating", y="US Viewership (millions)", x="Episode Number") + geom_smooth(method=loess, formula=y~x, size=0.6, fill="grey92",colour="#f9ccac",linetype="dashed") + theme(plot.subtitle=element_text(hjust=1,vjust=-0.5),plot.title=element_text(vjust=-5.5)) ``` We see that the viewership of the series peaked in season 8 and then decreased after that. It is also worth noting that rating *(point size)* is visibly lower during this decline, suggesting that the ratings decreased together with the viewership. We would like to investigate whether the decrease in ratings drives the viewership numbers. We can plot Viewership and Rating over time on a single phase plane: ```{r Phase plot} plot.new(); ps <- data.frame(xspline(BigBang[,c("Rating","Viewership")], shape=1, lwd=2, draw=F)); ps$n = 1:length(ps[,1]) #using bezier to smooth a path ggplot(ps,aes(x,y,colour=n)) + geom_path(size=0.5,alpha=0.7) + scale_colour_distiller(palette="Spectral") + labs(y="US Viewership (millions)",x="Rating",title="Viewership and Rating over time") + annotate("text",label="bold('Season 1')",x=8.75,y=8,colour="grey60",parse=TRUE) + annotate("text",label="bold('Season 10')",x=7.12,y=11.5,colour="grey60",parse=TRUE) + theme(legend.position="none") ``` We see that initally ratings are fairly high and viewership is relatively low. Then at first as the show continues the ratings stay high while the viewership increases. It looks like the **high rating drives** the increase in **viewership**. But then ratings start to decrease (green-yellow areas), and the viewership decreases afterwards (orange-red). Suggesting that initally the high ratings caused high viewership and then as the rating decreased this caused lower viewership due to the lag time between bad ratings and decline in views. The lag time makes senese - many people would not stop watching the show after only one episode is worse than before, also the people must first view the episode before rating it, so a low rated episode may not necessarily have a low viewership. We fit a simple linear model for viewership with rating as a predictor: ```{r} lmod1 = lm(Viewership ~ Rating,BigBang) tab_model(lmod1, digits=6, digits.p=4, auto.label=T, show.ci=F, show.fstat=T, show.se=T, show.aic=F, dv.labels=c("Viewership"),title="Effect of Rating on Viewership") ``` We see that rating is a significant predictor for viewership. It is also negative, which doesn't sound right, so the predictor may be **biased**. We can try to reduce the bias by adding extra predictors in form of $\textit{EpisodeNumber}$, as well as some higher degree terms and transformations: ```{r} lmod2 = lm(Viewership ~ Rating + I(Rating^2) + EpisodeNumber + I(EpisodeNumber^2) + I(EpisodeNumber^3) + I(EpisodeNumber^4) + I(EpisodeNumber^5) + I(EpisodeNumber^6) + I(EpisodeNumber^7), BigBang) tab_model(lmod2,digits=14, digits.p=4, auto.label=T, show.ci=F, show.fstat=T, show.se=T, show.aic=T,title="Effect of Rating on Viewership accounted for time") ``` To arrive at this model we started with a degree 1 polynomial in $\textit{Rating}$ and $\textit{EpisodeNumber}$, and we kept increasing the polynomial degree while checking that each increase in the degree was making the model significantly better than the previous by running ANOVA on each pairs of polynomials of consecutive degrees, and we stopped when all the problems from the diagnostic plot were eliminated. We could have kept increasing the degree, but we think this is good enough. The AIC value for this model was lower than that of all models simpler than this, which is a good sign. We also tried a log-transformation of $\textit{Viewership}$ together with polynomials, but we didn't see too much of a difference in the fit in the diagnostics. The diagnostics for our model are: ```{r} diagnose(lmod2, resid.x.lab="Fitted viewership", alpha=0.4, colour="Season", sqrt.level=1.7) ``` This show that the model fits well enough – **Seasons 3** and **4** have a lot higher and lower residuals respectively, but since they're very close together it shouldn't cause our model to be too wrong. The first plot shows that the model's structure fits the data well and the residuals are roughly independent; the Q-Q plot shows that the errors are normally distributed, and the $\sqrt{|\text{std. resid.}|}$ plot shows that the variance of the errors appears to be roughly constant. Also the $\textit{R}^2$ and *adjusted* $\textit{R}^2$ are very high, supporting this model. We make the effect plots for $\textit{Rating}$ and $\textit{EpisodeNumber}$: ```{r} closest <- function(x, x0) apply(outer(x, x0, FUN=function(x, x0) abs(x - x0)), 1, which.min) #using ggplot instead of plot(effects) eff = effect("EpisodeNumber", lmod2, partial.residuals=T); x.fit <- unlist(eff$x.all); trans <- I #collect all effect terms x1 <- data.frame(lower = eff$lower, upper = eff$upper, fit = eff$fit, EpisodeNumber = eff$x$EpisodeNumber) #store effects in df. (upper,lower) is the CIa xy1 <- data.frame(x = x.fit, y = x1$fit[closest(trans(x.fit), x1$EpisodeNumber)] + eff$residuals); xy1$Season = BigBang$Season #fitted values eff = effect("Rating", lmod2, partial.residuals=T); x.fit <- unlist(eff$x.all); trans <- I #same for the 2nd predictor (Rating) x2 <- data.frame(lower = eff$lower, upper = eff$upper, fit = eff$fit, Rating = eff$x$Rating) xy2 <- data.frame(x = x.fit, y = x2$fit[closest(trans(x.fit), x2$Rating)] + eff$residuals); xy2$Season = BigBang$Season p1 = ggplot(x1, aes(x=EpisodeNumber, y=fit)) + labs(x="Episode Number",y="Fitted viewership") + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.5,fill="grey82") + geom_line(colour="dodgerblue",size = 1) + geom_point(data = xy1, aes(x = x, y = y,colour=Season), shape = 1, size = 2,alpha=0.6) + scale_y_continuous(expand=c(0.05,0.05)) + theme(plot.margin=unit(c(1,0,0,0),"lines")) + guides(colour = guide_legend(override.aes = list(size=2,shape = 15))) p2 = ggplot(x2, aes(x=Rating, y=fit)) + labs(x="Rating") + geom_point(data = xy2, aes(x = x, y = y,colour=Season), shape = 1, size = 2,alpha=0.4) + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3,fill="grey82") + geom_line(colour="dodgerblue",size = 1) + scale_y_continuous(expand=c(0.05,0.05)) + theme(plot.margin=unit(c(1,0,0,0.25),"lines"),axis.title.y=element_blank()) + guides(colour = guide_legend(override.aes = list(size=2,shape = 15))) p = ggarrange(p1,p2,ncol=2,common.legend = TRUE,legend="right") annotate_figure(p,fig.lab="Effect plots for EpisodeNumber and Rating",fig.lab.pos="top.left") ``` We see that both of these predictors are significant and can be seen as important. The $\textit{EpisodeNumber}$ plot tells us how our model describes viewership over time. The $\textit{Rating}$ plot suggests that episodes whose rating is greater than $\approx8$ have significantly higher viewership, and the episodes whose rating is less than $8$ have viewership that's also higher, but not significantly higher. It makes sense that highly rated episodes will have higher viewership. We also see that in the effect plots, the confidence interval in the rating plot is wider than that in the plot with episode number, suggesting episode number is a more significant predictor. This is also supported by the *p*-values in the table above, and these *p*-values also tell us that both the predictors are significant. --- # 10: Summary of report Throughout this report we have investigated various aspects of the show – e.g. the episode *wordiness* or presence of *certain characters* – that might have an affect on the rating of an episode. Looking at the show as a whole, we found that overtime the **ratings fell** as the show progressed. We identified episodes which did not follow this pattern as **exceptional episodes**. These episodes were more likely to have an unusually high rating rather than an unusually low one, likely due to the fact that as the average rating decreased over time, particularly high rated episodes stand out more. After looking at the show as a whole we focused our attention on **how the ratings varied within each season**. There was little variation in rating for seasons 1,2,3,4,5,7 and 8 but seasons 9 and 10 had rating trends within them, and they also had a few episodes which couldn't be explained by the trend. Season 6, instead of having a roughly constant rating like other seasons, had ratings that increased, peaked at episode 12, and then declined after that. We also examined how the number of words in an episode affects the rating of that episode, and we found **no link between wordiness and rating** - the ratings appeared to be spread out regardeless of the number of words per episode. More broadly, we found **no evidence** that episodes written by **Chuck Lorre** have **higher/lower ratings than average**. We also found **no link** between the presence of **Amy** or **Bernadette** and **ratings** when accounted for the show's natural decline over time. We looked into how different characters encourage or discourage Raj from speaking and found that certain characters such as **Howard** (Raj's best friend in the show) and **Emily** (Raj's girlfriend in the show) have a **positive impact**, meaning that the more either of the 2 speak in an episode, the more lines Raj will have on average. Contrary to that, **Penny**, **Leonard** and **Sheldon** have a **negative impact** on how much Raj speaks. Furthermore, we have shown **viewership** changed over time and that there is a link between **rating and viewership**, in such a way that higher rated episodes are more likely to have a higher viewership than average. From all of our analysis we did not find any significant factors that explained the shows decline in rating over time. --- A short description about how we worked: We used a website called HackMD to share and edit code in real-time, which then we could each run individually in our own RStudio. We used group chats and calls to communicate when we worked, but we also worked on the project individually. Communicating with our Chinese colleague proved especially challenging after he returned to China and has been in quarantine. We chose to write this report on a single long page because of the size of some code chunks. We would have used A4 if we had some leverage over what code to include/exclude.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.