The goal of this project is to build a classification algorithm using the one vs rest machine learning algorithm; as well as to build some intuition on the basics of data science by introducing a problem from the Harry Potter series - the sorting hat.
We are given a training dataset of students scores in multiple subjects as well as their assigned houses. We will be using this dataset to create a logistic regression model to predict the houses of some other students, as they provide their own score.
It is assumed to have basic understanding of linear regression at this point.
Learning materials
Intro to LR
Intro to LR
Logistic regression with gradient descent
Derivation of loss function
Above is the formula for a normal distribution / Gaussian curve for some data
If we adapt a probalistic view in linear regression, the maximum likelihood estiator assuming zero mean gaussian noise is same as linear regression with squared error.
The goal of maximum likelihood estimation is to fit the parameters (mean and stddev) of the Gaussian curve so that all observed data has the biggest probability of happening.
We go back to the formula that gives us the probability of a single x:
to calculate all the probabilities for
And if we want to maximize a function like that, we have to use differientiation. The original roduct form is hard to differienciate. hence we would first transform the function to its natural logarithmic counterpart
and it transforms to:
say we have the
we can simplify this equation using logarithm law to
We cancalculate the partial derivative of the above function in terms of
The standard deviation can also be calculated in a similar manner
Unlike linear regression, logistic regression predicts wether something is True or false (1 or 0) and the line we fit is an 'S' shape like so
If we want to predict new data, the line will be mapped to the probability of the data being classified as true. This can be used for calssification
Since logistic regression does not have least squares or R-sqaured, we will use the maximum likelihood of all ovserved data to fit the regression line
Coefficients in linear regression are the
For logistic regression, we also have
These coefficients only apply for continious variables (values of
is a range), not for discrete variables (only contain a set number of values)
Toget coefficients for logistic regression represented in a linear nammer, we cant use least squares because the R square will be infinity. Instead we can use the maximum likelihood by doing the following.
So far we have only be able to classify values of 0 and 1 (binary). To calssify more categories like A, B, C, D, we can group them like A and not A, B and not B … We then take the true values for each pair and assign the value to the corresponding class with the highest probability. This is called One vs Rest classification
We will start with the sigmoid function (function that turns inf to 1 and -inf to 0) aka the function that calculates probability
This sigmoid function also works as our hypothesis function; function that is run to make predictions
is coefficients [ , , … ] where n are features hence,
and
is the probability and is the coefficient and are the variables
The functions above can be simplified to
Powers
and will cancl out each other (become 0) based on the actual value of
With this, we can also define the cost function for the 2 possible
We can simplify the cost functions to 1 function
For
Since our coefficients will be
We first declare some variables:
We can redefine our partial derivatives using the chain rule:
We start by solving for
Solve for
From previous definitions, we know that
which we can deduce to
Solve for
Going back to
Solve for
We can use
One of the requirements of this project is to provide a script to describe the data (give a high level view of the data itself) which can be done by evaluating different metrics that give an idea what the data is like. The description of each metric is as follows
Since the training dataset looks something like this, determing those metrics just require some basic parsing
Index,Hogwarts House,First Name,Last Name,Birthday,Best Hand,Arithmancy,Astronomy,Herbology,Defense Against the Dark Arts,Divination,Muggle Studies,Ancient Runes,History of Magic,Transfiguration,Potions,Care of Magical Creatures,Charms,Flying
0,Ravenclaw,Tamara,Hsu,2000-03-30,Left,58384.0,-487.88608595139016,5.727180298550763,4.8788608595139005,4.7219999999999995,272.0358314131986,532.4842261151226,5.231058287281048,1039.7882807428462,3.7903690663529614,0.7159391270136213,-232.79405,-26.89
1,Slytherin,Erich,Paredes,1999-10-14,Right,67239.0,-552.0605073421984,-5.987445780050746,5.520605073421985,-5.612,-487.3405572673422,367.7603030171392,4.107170286816076,1058.9445920642218,7.248741976146588,0.091674183916857,-252.18425,-113.45
2,Ravenclaw,Stephany,Braun,1999-11-03,Left,23702.0,-366.0761168823237,7.7250166064392305,3.6607611688232367,6.14,664.8935212343011,602.5852838484592,3.5555789956034967,1088.0883479121803,8.728530920939827,-0.5153268462809037,-227.34265,30.42
3,Gryffindor,Vesta,Mcmichael,2000-08-19,Left,32667.0,697.742808842469,-6.4972144445985505,-6.9774280884246895,4.026,-537.0011283872882,523.9821331934736,-4.8096366069645935,920.3914493107919,0.8219105005879808,-0.014040417239052931,-256.84675,200.64
4,Gryffindor,Gaston,Gibbs,1998-09-27,Left,60158.0,436.7752035539525,-7.820623052454388,,2.2359999999999998,-444.2625366004496,599.3245143172293,-3.4443765754165385,937.4347240534976,4.311065821291761,-0.2640700765443832,-256.3873,157.98
5,Slytherin,Corrine,Hammond,1999-04-04,Right,21209.0,-613.6871603822729,-4.289196726941419,6.136871603822727,-6.5920000000000005,-440.99770426820817,396.20180391410247,5.3802859494804585,1052.8451637299704,11.751212035101073,1.049894068203692,-247.94548999999998,-34.69
6,Gryffindor,Tom,Guido,2000-09-30,Left,49167.0,628.0460512248516,-4.861976240490781,-6.280460512248515,,-926.8925116349667,583.7424423327342,-7.322486416427907,923.5395732944658,1.6466661386700716,0.1530218296077356,-257.83447,261.55
...
...
...
The data set provided has multiple features, and we want to pick the correct features to do logistic regression on. To have more insight on the data, we can start by visualizing the data by representing it in different formats.
We were given questions to ponder and ways to visualize the data when answering those questions. One of them is which feature is homogeneous? A homogeneus feature is a feature that has similar distrubitions across all categories (something statiscally insignificant and cant help us classify)
To determine if a feature is homogeneous across categories, we can use a histogram to map the count of score ranges across all categories. The horizontal axis represent ranges of scores and the vertical axis represent the number of students who acquired that score
|
Example of a homogeneous feature, notice the distributions are all similar across categories |
|
Example of a non-homogeneous feature, notice the distributions are unique across categories |
Our next question is to determine which features are similar? Similar features should not be used in logistic regression because they contribute to the same weights anyway and will take up compute resources. A good way to determine this is to use a scatter plot against all features for all categories to find out how the values of the features relate to each other. The horizontal axis represent the values of a feature and the vertical axis represent the values of another feature.
|
Example of similar features, notice the features have a linear relationship, means both the features have the equivalent effect when classifying the categories |
|
Example of unique features, notice the features form clusters, means both the features have the different effects when classifying the categories |
Now that we have established some baseline rules (no selecting homogeneous features and no selecting both similar features), we can start to select the feature we want our model to work with.
Here is an example of bad features to select (Arithmancy and potions):
As you can see (from the horizontal axis - arithmancy), there is no clear classification of the features values. If we use this feature, then all values will have the same probability of being asssigned to any category. If we look at the vertical axis (potions), the same case applies as well.
Here is an example of a good feature being displayed:
The horizontal axis (Arithmancy) still remains a bad feature for the reasons above, but we can see some clustering now from the vertical axis (Charms). Right off the bat, we notice a high probability of getting classified as Ravenclaw when the Charms score is more than -240. Hence this feature will be used in the logistic regression model.
To be aware of more features like this, we have to repeat this mapping process for all the features, essentially generating something like a matrix.
Pick your poison
Once you have picked your features, it is just a matter of applying the math formulas and execute One vs Rest logistic regression with gradient descent. A few things to note when implementing:
For the good features that have chosen, the regression line will look something like this:
For the bad features, you will get a flatter line: