DD2421 - HackMD

--- title: 'DD2421' disqus: hackmd --- DD2421: Machine Learning === This HackMD note is a collective note-taking effort by the Singaporean students of this course. [Lecture Slides](https://kth.instructure.com/courses/12504/pages/lectures) ###### tags: `Machine Learning` `DD2421` `Lecture Notes` ## Table of Contents [TOC] ## Sample Template Date: 1/1/2019 Contributors: xxx, yyy,zzz Miscellaneous Information: Exam dates on yada yada yada. Students are expected to blah blah blah Lecture Notes: Machine learning yada supervised unsupervised lorem ipsum yay. *Sample Code formatting* ```gherkin= Feature: Guess the word # The first example has two steps Scenario: Maker starts a game When the Maker starts a game Then the Maker waits for a Breaker to join # The second example has three steps Scenario: Breaker joins a game Given the Maker has started a game with the word "silky" When the Breaker joins the Maker's game Then the Breaker must guess a word with 5 characters ``` *Sample diagram* ```sequence Alice->Bob: Hello Bob, how are you? Note right of Bob: Bob thinks Bob-->Alice: I am good thanks! Note left of Alice: Alice responds Alice->Bob: Where have you been? ``` --- ## Lecture 1 Date: 4/9/2019 Contributors: Wei Ying, Si Jie, Louiz _N.B. I'll write my understanding / analogies of ideas in italics --Si Jie_ ### Miscellaneous Information - 3 Labs (with deadlines) - Lab presentation - Decision Trees - Support Vector Machines - Bayes Classifier & Boosting - Use Canvas to book time slots for examination for labs. (Basically need to present and convince the TA that we did what we did.) - 2 people in one team in lab presentation unless cannot find partner. (The TA can ask anyone any questions) - 10 mins. You can bring computer, not to demonstrate the programme, but to present on what one has understood. - Bring ID. - Pass or fail for this part. - 3 mini lectures and exam Q&A for the last lecture. - Finals 25/10/2019, 8 am to 12pm (register for exam by 18/9/2019 - 2/10/2019) - Part A: MCQ - 7/8 points to pass - Part B: 10/27 points to pass - Possibility for re-exam in December - If you fail part A, you can make up for the difference by doing well in part B ### Lecture Notes - Types of learning - Supervised Learning a.k.a predictive - Function approximation (where (input, label) pairs are available) - Well-defined problem. Battle-tested in industrial applications. - Learning mappings from A to B (input to output), with B provided by human supervision. Not scalable and biologically implausible. - Unsupervised Learning - CLustering, dimensionality reduction, density estimation - Mainly used for preprocessing and exploratory data analysis - There is no B, only A. - Reinforcement Learning - Behaviour selection (through reward and punishment signals) - Evolutionary Learning - A way to do optimisation - Generation to generation selection, not covered - Classification problems and nearest neighbour rules - Classification - Enable to learn from data to answer a question - "What is it" - Framework for classification: - Training phase - to give the concept of classes to a machine using labelled data. - Testing phase - to determine the class of new unseen data - Eg. in Hand-written digits, we do *feature extraction* of training samples - Representation of the data through the use of a *feature vector*. - Feature extraction converts input into a form that is easily recognizable for machine learning. - _e.g. Image --> grayscale RGB representation_ - Nearest Neighbour methods - Binary classifications - For N1 and N2 samples of different classes that are clustered together in a feature space, compute distances from new point, x, to all the N1 and N2 samples. We then find the nearest neighbour to classify x into a class. - k-nearest neighbour rule - Compute the distances to all the samples from new data x. - Pick k, an arbitrary number decided by user, nearest neighbours that are nearest to x. - Majority vote to classify point x. - What to do if there is a tie? Odd numbers are better. - In the example given, k = 15 is better than k= 1 as it provided a more generalised view of the trend (_smoother and not influenced by anomalies_). --- ## Lecture 2 ### Decision boundaries with different k Q: What is a good value of k, then? A: No exact answer, higher k makes a more generalised trend, lower k is more stochastic and reduces absolute error/loss. Strategies that minimise loss is NOT neccessarily the best solution. _We want to generalise and discover the trend, while avoiding overfitting._ Keywords: 1. Classification: - Feature extraction - If you have a vector, $x = \{x_1, x_2, ... x_n\}$, then the individual $x_i$ are the features (_note: they are sometimes also called attributes_). $x$ is the **representation**. - Training/testing - Generalization 2. Classification Methods Nearest Neighbour (memory based approach) ### Decision Tree Binary Decision Tree (Rule-based approach, Logicial Inference) ![](https://i.imgur.com/NVJg86o.png) Basic Idea - Ask questions about the target/stauts sequentially - Useful when nominal data are involved. - We also want to be able to generalize - know when to stop - Depth = no. of questions (kinda) - A tree can encode arbitrary boolean operations $({Sunny} \wedge Normal\ Humidity) \vee (Cloudy) \vee (Rainy \wedge Weak\ Wind)$ Process of growing/constructing tree from a set of labeled training data: - Choose the best question (according to information gain), split the input data into subsets - For each attribute, we can ask a question, infinite number of possibilities, but pick one that maximizes info gain, measured in terms of entropy. - Terminate: branches with **unique** class labels are called leaves (no more further questions needed at that point) - Grow - recursively extend other branches (with subsets bearing mixtures of labels) ### Unpredictability Entropy: how we measure information gain - Shannon's information content of the outcome $$log_2 \frac{1}{Prob(c_1 | question)} $$ How do we ask questions so that we gain the most amount of information? Slide on game of sixty-three: - First question: $X \geq 32$, second $x \mod 32 \geq 16$, etc. Eventually you get a bit string, $X=101010$ for example, where yes = 1, no = 0. - For 64, each question gives you $log_2 \frac{1}{0.5}=1$ bit of information Entropy is a measure of uncertainty ( also a sensible measure of expected information content) Entropy equation: $$ Entropy=\sum_i{-p_{i}\log_2 {p_{i}}} $$ $p_i$ is the probability of event i >In the lecture example, we saw a case where a yes/no answer happens with 50% chance each since the possible input range is 0-63 inclusive. What if it is instead 0-47? (midway between 31 and 63). In this case, the first question: $\geq 32$ is gives us less information, as compared to if we had asked $\geq 24$. Indeed, when we sub the values into the entropy equation, we get: $$-\frac{1}{3}\log_2(\frac{1}{3}) + -\frac{2}{3}log_2{\frac{2}{3}} = 0.9182 < 1.$$ If we manage to ask "better" questions instead of yes/no questions, we can get more information. When we throw a 6-sided dice, we get $\log_2{6} = 2.58$ bits of information. i.e if you just ask what is the number for the example, you get 6 bits -.- LOLOLOLOL 100 examples, 42 positive $$Entropy = -\frac{58}{100}log_2\frac{58}{100} - \frac{42}{100}log_2\frac{42}{100} = 0.981\\ $$ 100 examples, 3 positive (less entropy , preferred due to low uncertainty) $$Entropy = -\frac{97}{100}log_2\frac{97}{100} - \frac{3}{100}log_2\frac{3}{100} = 0.194\\ $$ Each new question you ask lead to a reduce of entropy, thus ask the question that leads to the highest gain (or largest decrease in entropy) Information gain $$ Gain = \underbrace{Ent(S)}_\text{before} - \underbrace{\sum_{v\in Values(A)} \frac{|S_v|}{|S|}}_\text{weighted sum (probability)}\underbrace{Ent(S_v)}_\text{after} $$ Gini index: another definition of predictability. If we plot the Gini index on a graph against Pr(x), we get a graph slightly more peak-y than when using Entropy-Pr(x) graph. $$ G=\sum^K_{k=1}{\hat{p}_{mk}(1-\hat{p}_{mk})} $$ ### Overfitting When the learned models are overly specialised for the training saples. It gives good results on training data, but generalises poorly. Occam's principle -> **"Entities should not be multiplied beyond necessity"** Stop growing when data split not statistically significant Post-prune Use decision Forest or Bootstrap aggregating ## Lecture 3: Challenges in Machine Learning Date: 11/9/2019 Contributor(s): Wei Ying, Louiz, Zhou Zhi ### Lecture Notes #### Overfitting - Overfitting is when the learned models are overly specialised for the training samples. - Polynomial curve fitting (Not classification, is regression) - 9th order polynomial can have smaller errors but it might not run as well, depending on sample size. - 9th order ended up with 0 error in this example because the number of data points was too low. If we increase the data set size, the model fits somewhat better. - ![](https://i.imgur.com/LUdBUVy.png) - Training data is contaminated with noise, so we shouldn't try to get 100% performance with training data. ### K-fold cross validation - Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. - The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation. - Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model. - Training set T: to fit the models - Validation set V: to estimate prediction error for model selection (i.e. to determine hyperparameters) - Eg. for k= 5 ![](https://i.imgur.com/cSjLInY.png) - If we have a lot of data, we can partition the data into training, validation and testing sets to assess generalisation error of the final chosen model. - Validation is still for the sake of tuning of the model parameters. Test data, however, is to tell if the data is good and the output of the test data cannot be used to build of the model at all. - Training can be 50%-60% while the rest is validation and testing. #### Curse of Dimensionality - If there are a lot of input features but some of them are not as relevant to the target function, we may or may not use all of them as they increase the complexity of the model through increased dimensionality . - Easy problems in low-dimensions are harder in high dimensions (we are trying to use complex models, but we have limited sample data) - In high-dimensions, everything is far from everything else (especially in the Nearest Neighbours Classifier) - Data are not distributed in a symmetric manner, if it is skewed in local, as long as we go higher dimensions, we will face some problems. - Basically, higher dimensions = an exponential increase of data needed (eg. for a 1D space, to achieve 10% volume filled, we only need 10% of data, but if we want to achieve 10% volume for a 2D space we need to take more data as per the graph below) - ![](https://i.imgur.com/WhT2FHT.png) - Lecturer gave an example of the pizza, in which if you choose to eat a certain percentage of the pizza's *radius*, you still won't eat much of the pizza. ### Bias-Variance Trade Off - Training data is never enough. One specific set of training data is just representative of that one set. - Resulting models from different training data sets will have a range of predictions due to randomness in the underlying data set. - Error due to Bias: difference between expected (average) prediction of model and correct value. - Error due to Variance: the variability of a model prediction for a given data point between different realisations of a model. ![](https://i.imgur.com/2uyKxpU.png) - If high variance, there will be low bias. If high bias, there will be low variance. (Read more here: https://medium.com/@mp32445/understanding-bias-variance-tradeoff-ca59a22e2a83) - Using the mean squared error for estimating the true function using the prediction model, we can determine that the mean squared error is a composition of bias and variance. - Bias of a classifier is the discrepancy between its average estimated and true function. - Low complexity model (with lower degree of freedom) = high bias - High complexity model (with higher degree of freedom) = low bias - Variance of a classifier is the expected divergence of the estimated prediction function from its average value. - Low model complexity (low degree of freedom) = low variance - High model complexity (high degree of freedom) = high variance - k nearest neihgbours will have decreased variance with increased number of k - decision trees will have increased complexity and hence increased variance as the trees grow deeper downwards. ## Lecture 4: Linear Regression Date: 12/9/2019 Contributor(s): Wei Ying, Zhou Zhi ### Lecture Notes - A parametric algorithm has a fixed number of parameters. - In contrast, a non-parametric algorithm uses a flexible number of parameters, and the number of parameters often grows as it learns from more data. ### Function Approximation (Parametric) - Minimised in-sample mean squared error - pseudoinverse to determine the weights involved in the function. - An analogy by the prof was that the error is equivalent to a spring. Hence when new data are added, there wiill be fluctuations in the errors, resulting a new best fit line to be formed. ### RANSAC line fitting - Get 2 points, draw one line, set a threshold and calculate the total number of points within the threshold of the line. - Points within the threshold are called supporters/inlier. Outside points are called outliers/posterior. ![](https://i.imgur.com/zygCS4u.png) - We have to be careful of high thresholds as sometimes the threshold when too high might result in good but inaccurate results. - Randomly picking 2 points and those 2 points residing in the inlier has a higher probability that 3 points. - In Least Median of Square (LMedS), similar to RANSAC, but evaluation of points does not have anything to do with threshold. - Calculate error of all data. - Choose relation to minimise median of errors. - RANSAC can be still be used if there is more than 50% of points in the outliers, whereas LMedS cannot. ### k-NN Regression (non-parametric) - If the underlying function is a curve, k-NN regression will work better than linear regression - Basically if parametric form is close to the true form, the parametric approach will outperform the non-parametric. - If there is a small number of observations per predictor, parametric methods will outperform non-parametric methods. ### Ridge Regression - i ## Lecture 5: Probabilistic Reasoning Date: 17/9/2019 Contributor(s): Wei Ying, Zhou Zhi ### Probability Theory #### Axiomatic definition of probabilities \begin{array}{l}{\text { Given a sample space } \Omega \text { of all possible outcomes, and an event } \mathrm{E} \text { , }} \\ {\text { or set of outcomes from } \Omega \text { (possibly empty), then: }} \\ {\text { - } P(E) \geq 0 \text { for all } E \subseteq \Omega} \\ {\text { - If } E=\Omega \text { then } P(\Omega)=1 \text { (sure event) } \Rightarrow 0 \leq P(E) \leq 1} \\ {\text { - If } E_{1}, E_{2}, \ldots \text { is a countable sequence of pairwise disjoint }} \\ {\text { events, then }}P\left(E_{1} \cup E_{2} \cup \cdots\right)=\sum_{i} P\left(E_{i}\right)\end{array} ## Lecture 6: Date: 19/9/2019 Contributor(s): Wei Ying ### Introduction #### Probabilistic Classification and Regression Posterior com: $\operatorname{Pr}(y | X=x)=\frac{\operatorname{Pr}(x | Y=y) \operatorname{Pr}(Y=y)}{\operatorname{Pr}(X=x)}$ #### Discriminative vs Generative Model #### Parametric vs Non-parametric Inferences | Parametric Inference |Non-parametric Inference | | -------- | -------- | | Text | Harder, number of parameters can grow with data | ### Maximum Likelihood Estimation #### Regression #### Classification