# Presentation Script
Hi I'm Kahvi with Nolan, Jayden and Ahnaf. We worked on an IMDb Rating prediction system.
## Intro
Imagine you're a studio executive and you'd like to know how much people will love or hate your Dark knight Sequel. Using our predictor, just submit the movie's attributes (like budget, duration, actor) and it will provide you with an audience rating, before the movie is ever released.
<!-- Basically, our goal was to predict the audience rating of a movie. -->
This idea was based on the Internet Movie Database (IMDb) which contains over 85,000 movies from 1906 to 2020. Our label vector (the audience ratings) contained values from 0 to 10 at intervals of 0.1.
To measure our results, we used Mean Absolute Error, or MAE.
For example, an model with an MAE of 1.5 means that its predicted rating will be in the range of +-1.5 of the actual audience's rating.
## Models
Off the bat, Nolan constructed a simple 3 layered regressional neural network that achieved an *average* MAE of 0.74.
It used 3 layers and Trenn's formula to give the optimal number of neurons in the hidden layer. The *best* MAE it achieved was 0.72 after validation.
Nolan also converted this model into a classifer by spreading the rating range over 100 classes. This gave low accuracy, overfitting and flucuations in results (as you can see in this plot).
Ahnaf worked on a deep learning model.
It consisted of 5 activation layers that were decided through the use of extensive cross validation. But, likewise it couldn't break the 0.7 barrier we'd started with.
Jayden refactored his Assignment 1 code to write a decision tree that supported multiple classes instead of just true/false values. He achieved an MAE of 0.95 using only three columns
<!-- ; `divorces`, `height` and `actor_gender`. -->
Jayden also used sklearn to construct a random forest model. In total, he tested over 500 combinations of features using 50 to 100 trees. This gave a similar final MAE of 0.96. Still above what we started.
## Conclusion
Add the end of this, we've arrived at a few conclusion:
First, training a neural network or decision tree on so many categorical columns is difficult. In the dataset, columns like `writers` contained thousands of unique names that made constructing a decision tree infeasible.
And given that *most* of the columns in the dataset were categorical, it made training a neural network difficult. Even with encoding, the categorical columns didn't make a significant difference in the resulting MAE.
Second, this prediction is fundamentally about human behaviour. The audience rating of a movie depends as much on people's emotions and psychology as it does on the movie's actual attributes. People are difficult to predict.
For future development, we would like to focus on different ways to encode categorical columns. Techniques like binary encoding or hash encoding.
Thanks for listening.