# Machine learning: Week 6 ### Debugging a learning algorithm - Unacceptably large errors. What are you try next? - Get more training examples -> **fixes high variance** - Try smaller set of features -> **fixes high variance (may cause high bias)** - Try getting additional features -> **usually fixes high bias** - Try adding polynomial features -> **fixes high bias** - Try decreasing $\lambda$ -> **fixes high bias** - Try increasing $\lambda$ -> **fixes high variance** - Machine learning diagnostic - Test you can run to gain insight what is/isn't working with algorithm - Also gain guidance as to how best to improver performance - Take time to implement ### Evaluating hypothesis - plot the hypothesis of x (but difficult with the large number of features) - split the training set into two group: training set (70%) and test set (30%) > Training/testing procedure for linear regression > - Learn parameter $\theta$ from training set > - Compute test set error ($J_{test}\theta$) > Training/testing procedure for logistic regression > - Learn parameter $\theta$ from training set > - Compute test set error ($J_{test}\theta$) > - Misclassification error (0/1 misclassification error) ### Model selection and Train/Validation/Test sets #### Model selection - list all model. In each of them - Compute $\theta$ and $J_{test}\theta$ - Select module which has lowest test set error $J_{test}\theta$ - Problem: it is likely to be an optimistic estimate of generlization error. I.e: our extra parameter is fit test set. - Solving: Split training examples into 3 groups: training set (m=60%), test set (m~test~=20%) and cross validation (m~cv~=20%) - Then we have: - Training error - Cros Validation error - Test error - Instead of selecting lowest test set error, we select lowest cross validation error - Using test set error to measure/estimate generalization error - Due to d has been fit to cross validation then $J_{test}\theta$ might be expected to larger than$J_{cv}\theta$ ### Diagnosing Bias vs Variance > Problem: either high bias or high variance. In other words, either an underfitting or an overfitting Training error: > $J_{train}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^{2}$ Cross Validation error > $J_{cv}(\theta) = \frac{1}{2m}\sum_{i=1}^{m_{cv}}(h_{\theta}(x^{(i)}_{cv})-y^{(i)}_{cv})^{2}$ - Suppose your learning algorithm is performing less well than you were hoping ($J_{cv}\theta$ or $J_{test}\theta$ is high). It is a bias problem or a variance problem - Bias (underfit): $J_{train}\theta$ and $J_{cv}\theta$ will be high - Variance (overfit): $J_{train}\theta$ will be low and $J_{cv}\theta >> J_{train}\theta$ ### Regularization - create list of $\lambda$s - create a set of models with different degrees or any other variants - iterate through the $\lambda$s and for each $\lambda$ go through all the models to learn som $\theta$ - compute the cross validation error using the learned $\theta$ (compute with $\lambda$) on the $J_{cv}(\theta)$ without regularization or $\lambda=0$ - select the best combo that produces the lowest error on the cross validation set - using the best combo $\theta$ and $\lambda$, apply it on $J_{test}(\theta)$ to see if it has a good generalization of the problem ### Learning curves - If learning algorithm is suffering from high variance, getting more training data is likely to help - If learning algorihtm is suffering from high bias, getting more training data will not (by itself) help much ### Neural network and overfitting - small network will be cheaper computation, it also is more prone to underfitting - larg network (either more hidden units or more hidden layers) more prone to overfitting; computationally more expensive ### Machine learning system design #### Spam classification ##### Prioritizing What on Work On - $x$ = features of email. > For example: choose 100 words indicative of spam/not spam. - $y$ = spam(1) and not spam(0) - Options: - Collect lots of data via "honeypot" project - Develop sophisticated features based on email routing information - Develop sophisticated features for message body - Develope sophisticated algorithm to detect misspellings ##### Error analysis - Implement algorithm and test it on cross-validation data - Plot learning curves to decide if more data, more features, etc. are likely to help - Error analysis: manually examine the examples (cross validation set) that your algorihtm made errors on ##### Handling skewed classes - skewed classes -> number of a class is much smaller than the others - $precision = \frac{\text{true positives}}{\text{# predicted positives}} = \frac{\text{true pos}}{\text{true pos + false pos}}$ - $recall = \frac{\text{true positives}}{\text{# actual positives}} = \frac{\text{true pos}}{\text{true pos + false neg}}$ - high precision and high recall make good classification - $F_{1} = 2\frac{precision * recall}{precision + recall}$ ##### Large data rationale - can a human expert look at a feature x and confidently predict the value y - can we actually get a large training set and train the learning algorithm with a lot of parameters