# Machine learning Mid Term
## Decision tree's
We use decision tree when the data is not linearly separable ie, when we can't sepertate the classes by just drawing lines.
In decsision tree we have the concept of Information. We can split a data in multiple ways but how do we figure out which split to select ? The answer is by looking at the information gain. Information gain is basically the calculation of entropy of all child nodes subtracted from the parent node.
* When entropy is 0 it is a pure leaf node.
* Root node -> No incoming data
* Leaf node -> No outgoing data.
* Impurity == High entroy and less information gain
It selects the current best split so its a greedy algorithm.
### Overfitting
When we have a lot of variables making the tress complex and introducing errors. We follow a method called pruning.
* Prediction = Average of numerical target value.
* Impurity = Sum of squared deviations
* Performance = Root mean squared error.
* Entropy formula:

## Bias and Variance
The inability for a machine learning method to capture the true relationship is called bias. Like for example a straight line would perform worse than a curve. (Linear vs Logistic)
Variance is introduced when we change the training data set.
Overfit -> Perfectly fits the training set but doesn't fit the testing set.
Fit is how well you approxmiate a target function.
Over fitting leads to less generalization and underfitting leads to a lot of assumptions. Best fit is when we have less training error and more test error.
* Overfitting is less generalised and has a complex model.
* Underfitting is less generalized and has a simple model
We have 2 types of errors induced my ML algo's:
* Bias Error.
* Variance Error.
and a type of error that is not influnced by ML
* Irreducible error. -> Due to noisy data, cannot be reduced by creating good models.
Bias is simplyifying the assumptions made and variance is the change in estimate when we change the trainig data.
## Logistic regression
## Linear regression
1. use least squares to fit the data to a line.
2. Calculate R squares
3. Calculat a p value for the R squares.
In the above place R means residual. Its the distance of the dot from the line. Our goal is to keep the sum of R squares to the least. Now that we have a line we will have a linear equation which will have the parameters we need.
We use R squared to establish a relation betweent the two paramters. We use the (R(mean) - R(fit))/R(mean) to calculate it.
### Cost function

The main use of a cost function is to represent the error between the actual value and the predicted value.
### Gradient Descent
We use Gradient Descent to minimize the cost function value.
https://www.youtube.com/watch?v=EfsjEOb596Q
## KNN
It is a lazy algorithm. We use vectors.
* Clasisifies new points based on similarity measure(Euclidean distance).
* K value should be odd and not a multiple of class number.
* Majority voting for query point classification.

Regression and Classification algorithms are Supervised Learning algorithms.The main difference between them is that Regression algorithms are used to predict the continuous values such as price, salary, age, etc. and Classification algorithms are used to predict/Classify the discrete values such as Male or Female, True or False, Spam or Not Spam, etc.
Confusion matrix: Its a matrix where in we compare the actual value with the predicted value.
From confusion matrix we calculate other metrics.
#### Sensitivity -> Measure of True positives.
(True positives)/ (True positives + false negatives)
#### Specifity -> Measure of True negatives.
True negatives / (False postives + True negatives)
These metrics are used to select which ML algorithm to use. When its not apparent by just looking at the confussion matrix.
### Regularization
This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.
A standard least squares model tends to have some variance in it, i.e. this model won’t generalize well for a data set different than its training data. Regularization, significantly reduces the variance of the model, without substantial increase in its bias
## Logloss
Log Loss is the negative average of the log of corrected predicted probabilities for each instance.
In logistic we use squared differences and in linear we use mean squared error.
## Linear vs Logistic
* Linear Regression is used to handle regression problems whereas Logistic regression is used to handle the classification problems.
* Linear regression provides a continuous output but Logistic regression provides discreet output.
* The purpose of Linear Regression is to find the best-fitted line while Logistic regression is one step ahead and fitting the line values to the sigmoid curve.
* The method for calculating loss function in linear regression is the mean squared error whereas for logistic regression it is maximum likelihood estimation.
## Odds (Logistic regression)
The odds ratio represents the constant effect of a predictor X, on the likelihood that one outcome will occur.
## Maximum likelihood estimation
Maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed.
## Precision
In the simplest terms, Precision is the ratio between the True Positives and all the Positives.
## Recall
The recall is the measure of our model correctly identifying True Positives.
## Gini impurity
The Gini impurity measure is one of the methods used in decision tree algorithms to decide the optimal split from a root node, and subsequent splits.
Gini Impurity tells us what is the probability of misclassifying an observation.
Gini index ->

## AUC - ROC
ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1.