# Machine learning: Week 6
### Debugging a learning algorithm
- Unacceptably large errors. What are you try next?
- Get more training examples -> **fixes high variance**
- Try smaller set of features -> **fixes high variance (may cause high bias)**
- Try getting additional features -> **usually fixes high bias**
- Try adding polynomial features -> **fixes high bias**
- Try decreasing $\lambda$ -> **fixes high bias**
- Try increasing $\lambda$ -> **fixes high variance**
- Machine learning diagnostic
- Test you can run to gain insight what is/isn't working with algorithm
- Also gain guidance as to how best to improver performance
- Take time to implement
### Evaluating hypothesis
- plot the hypothesis of x (but difficult with the large number of features)
- split the training set into two group: training set (70%) and test set (30%)
> Training/testing procedure for linear regression
> - Learn parameter $\theta$ from training set
> - Compute test set error ($J_{test}\theta$)
> Training/testing procedure for logistic regression
> - Learn parameter $\theta$ from training set
> - Compute test set error ($J_{test}\theta$)
> - Misclassification error (0/1 misclassification error)
### Model selection and Train/Validation/Test sets
#### Model selection
- list all model. In each of them
- Compute $\theta$ and $J_{test}\theta$
- Select module which has lowest test set error $J_{test}\theta$
- Problem: it is likely to be an optimistic estimate of generlization error. I.e: our extra parameter is fit test set.
- Solving: Split training examples into 3 groups: training set (m=60%), test set (m~test~=20%) and cross validation (m~cv~=20%)
- Then we have:
- Training error
- Cros Validation error
- Test error
- Instead of selecting lowest test set error, we select lowest cross validation error
- Using test set error to measure/estimate generalization error
- Due to d has been fit to cross validation then $J_{test}\theta$ might be expected to larger than$J_{cv}\theta$
### Diagnosing Bias vs Variance
> Problem: either high bias or high variance. In other words, either an underfitting or an overfitting
Training error:
> $J_{train}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^{2}$
Cross Validation error
> $J_{cv}(\theta) = \frac{1}{2m}\sum_{i=1}^{m_{cv}}(h_{\theta}(x^{(i)}_{cv})-y^{(i)}_{cv})^{2}$
- Suppose your learning algorithm is performing less well than you were hoping ($J_{cv}\theta$ or $J_{test}\theta$ is high). It is a bias problem or a variance problem
- Bias (underfit): $J_{train}\theta$ and $J_{cv}\theta$ will be high
- Variance (overfit): $J_{train}\theta$ will be low and $J_{cv}\theta >> J_{train}\theta$
### Regularization
- create list of $\lambda$s
- create a set of models with different degrees or any other variants
- iterate through the $\lambda$s and for each $\lambda$ go through all the models to learn som $\theta$
- compute the cross validation error using the learned $\theta$ (compute with $\lambda$) on the $J_{cv}(\theta)$ without regularization or $\lambda=0$
- select the best combo that produces the lowest error on the cross validation set
- using the best combo $\theta$ and $\lambda$, apply it on $J_{test}(\theta)$ to see if it has a good generalization of the problem
### Learning curves
- If learning algorithm is suffering from high variance, getting more training data is likely to help
- If learning algorihtm is suffering from high bias, getting more training data will not (by itself) help much
### Neural network and overfitting
- small network will be cheaper computation, it also is more prone to underfitting
- larg network (either more hidden units or more hidden layers) more prone to overfitting; computationally more expensive
### Machine learning system design
#### Spam classification
##### Prioritizing What on Work On
- $x$ = features of email.
> For example: choose 100 words indicative of spam/not spam.
- $y$ = spam(1) and not spam(0)
- Options:
- Collect lots of data via "honeypot" project
- Develop sophisticated features based on email routing information
- Develop sophisticated features for message body
- Develope sophisticated algorithm to detect misspellings
##### Error analysis
- Implement algorithm and test it on cross-validation data
- Plot learning curves to decide if more data, more features, etc. are likely to help
- Error analysis: manually examine the examples (cross validation set) that your algorihtm made errors on
##### Handling skewed classes
- skewed classes -> number of a class is much smaller than the others
- $precision = \frac{\text{true positives}}{\text{# predicted positives}} = \frac{\text{true pos}}{\text{true pos + false pos}}$
- $recall = \frac{\text{true positives}}{\text{# actual positives}} = \frac{\text{true pos}}{\text{true pos + false neg}}$
- high precision and high recall make good classification
- $F_{1} = 2\frac{precision * recall}{precision + recall}$
##### Large data rationale
- can a human expert look at a feature x and confidently predict the value y
- can we actually get a large training set and train the learning algorithm with a lot of parameters