Machine Learning for Intelligent Systems

# Machine Learning for Intelligent Systems Course offered by Prof. Kilian Weinberger in Fall 2018 at the University of Cornell [Lesson Videos](https://www.youtube.com/playlist?list=PLl8OlHZGYOQ7bkVbuRthEsaLr7bONzbXS) [Course Page](http://www.cs.cornell.edu/courses/cs4780/2018fa/) ### Machine Learning for Decision Making * The application of machine learning - email spam filtering - web-search ranking (predict which web-page the user will click on based on his/her search query) - placing of online advertisements (predict the expected revenue of an ad, when placed on a homepage, which is seen by a specific user) - visual object recognition (predict which object is in an image - e.g. a camera mounted on a self-driving car) - face-detection (predict if an image patch contains a human face or not). * The human brain is one big learning machine. * Very few people can design new machine learning algorithms, but many can use them. * What types of ML are there? - Supervised learning: given labeled examples, find the right prediction of an unlabeled example.(e.g. given annotated images, learn to detect faces) - Unsupervised learning: given data, try to discover similar patterns, structure, sub-spaces.(e.g. automatically cluster news articles by topic) - Reinforcement learning: try to learn from delayed feedback.(e.g. robot learn to walk, fly, play chess) ### Setup - Our training data comes in pairs of inputs (x,y), where x∈Rd is the input instance and y its label. The entire training data is denoted as $D = {(x1,y1),…,(xn,yn)} ⊆ Rd × C$ - where: Rd is the d-dimensional feature space xi is the input vector of the ith sample yi is the label of the ith sample C is the label space * Before you do machine learning, you have to decide - label space - data space - function h (hypothesis class) - decision tree - linear classifier (perceptron: In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers ; artificial neural network: an ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain ; support vector machines: support-vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.) **There is no the best algorithm. It always depends on the problem.** * Loss Functions - Zero-one loss: It literally counts how many mistakes an hypothesis function h makes on the training set. For every single example it suffers a loss of 1 if it is mispredicted, and 0 otherwise. - Squared loss: The squared loss function is typically used in regression settings. It iterates over all training samples and suffers the loss $(h(xi)−yi)^2$. The squaring has two effects: 1., the loss suffered is always nonnegative; 2., the loss suffered grows quadratically with the absolute mispredicted amount. - Absolute loss: Similar to the squared loss, the absolute loss function is also typically used in regression settings. It suffers the penalties $|h(xi)−yi|$. Because the suffered loss grows linearly with the mispredictions it is more suitable for noisy data (when some mispredictions are unavoidable and shouldn't dominate the loss). **The distribtion of h should always be the same** -> 舉例來說當訓練辨識軍用車跟平民車模型時，如果背景有特定規律，如軍用車所在背景都較暗，而平民車所在背景都較亮，就會造成不合邏輯的損失誤差。 * Generalization - If the samples are drawn from the same distribution P, then the testing loss is an unbiased estimator of the true generalization loss: Generalization: ϵ = $E(x,y)∼P[ℓ(x,y|h∗(⋅))]$ **The most basic function: $ℓ(h;D)$ try to find a function h using the dataset and minimize the loss. But not just that, what we really want is to find a function minimizes the loss on new dataset. $E[ℓ(h;(x,y))](x,y)∼P$ The problem is we can't possibly compute it, because we don't know the distribution** * How to split the data? 當模型重新訓練後不能再使用同一組測試資料，因為會失真，所以需要validation data. You have to be very careful when you split the data in Train,Validation,Test. The test set must simulate a real test scenario, i.e. you want to simulate the setting that you will encounter in real life. For example, if you want to train an email spam filter, you train a system on past data to predict if future email is spam. Here it is important to split train / test temporally - so that you strictly predict the future from the past. If there is no such thing as a temporal component, it is often best to split uniformly at random. Definitely never split alphabetically, or by feature values. * By time, if the data is temporally collected. In general, if the data has a temporal component, we must split it by time. * Uniformly at random, if (and, in general, only if) the data is i.i.d.(Independent and identically distributed). The test error (or testing loss) approximates the true generalization error/loss. -> 舉例來說：spam checking function如果隨機分成train, test dataset會有問題，因為太相似so there might be very similar data in both train and test dataset. Therefore, the accuracy could be very high if the whole dataset is not diverse enough. So a better wat is to split data by time. * Precessing Steps Train models with the training dataset, then test models with the validation dataset, then pick the best performed model, train it with the whole dataset(train + validation) and test it with the testing dataset. **The reason why there is no perfect function is because you have to make some assumptions otherwise you cannot make predictions.** ### The K-Nearest Neighbors (KNN) algorithm (1967) - Assumption: similar inputs have similar outputs. - Classification rule: for a test input x, assign the most common label amongst its k most similar training inputs. * What distance function should we use? - The k-nearest neighbor classifier fundamentally relies on a distance metric. The better that metric reflects label similarity, the better the classified will be. The most common choice is the Minkowski distance. ![](https://i.imgur.com/cFsh3n6.png) p = 1 -> Manhattan distance (a kind of Lm distance) p = 2 -> The ordinary straight-line distance between two points in Euclidean space. p = $\infty$ -> Maximum **kNN doesn't work when you have high dimentional data -> dimentionality reduction is a good solution (e.g. PCA)** * KNN Summary - k-NN is a simple and effective classifier if distances reliably reflect a semantically meaningful notion of the dissimilarity. (It becomes truly competitive through metric learning) - As n→∞, k-NN becomes provably very accurate, but also very slow. - As d≫0, points drawn from a probability distribution stop being similar to each other, and the kNN assumption breaks down. ### Perceptron Algorithm (1957) - Assumptions: - Binary classification (i.e. yi∈{−1,+1}) - Data is linearly separable **找出一個n-1維的切割空間(二維就找一條線，三維就找一個面，看新的點在切割空間的哪一邊)** **The drawback of perceptron is that there is no guarentee on finding a best hyperplane** **How many steps you need to find a hyperplane is inverse proportional to the margin so what you want is the data be far away from each other. The margin for a hyperplane is the distance to the closest point.** ![](https://i.imgur.com/SUmWa0F.png) **It is possible that one don't find the hyperplane to separate dataset into two** * Perceptron convergence - The Perceptron was arguably the first algorithm with a strong formal guarantee. If a data set is linearly separable, the Perceptron will find a separating hyperplane($w^*$) in a finite number of updates. (If the data is not linearly separable, it will loop forever.) - The argument goes as follows: Suppose $∃w∗$ such that $yi(x⊤w∗)>0$ $∀(xi,yi)∈D$ - All inputs xi live within the unit sphere - There exists a separating hyperplane defined by $w^∗$, with $∥w∥^∗=1$ (i.e. $w^∗$ lies exactly on the unit sphere). - γ is the distance from this hyperplane to the closest data point. - Margin: larger margin is diserable to separate the data points - it must comverge after a finite updates ![](https://i.imgur.com/fZhpqXl.png) **簡而言之就是把所有的資料點先縮小到半徑為一的空間，然後等找出高維切割空間再還原距離，最佳切割空間一定是兩端最靠近的點(圈跟叉)等距，但注意：Perceptron並不確保找出那個高維空間，之後SVM的演算法才做到這件事** * Cauchy–Schwarz inequality ${\displaystyle |\langle \mathbf {u} ,\mathbf {v} \rangle |\leq \|\mathbf {u} \|\|\mathbf {v} \|.}$ ### Estimating Probabilities from data * Maximum likelihood estimation In statistics, maximum likelihood estimation is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. * The Bayesian Way Model θ as a random variable, drawn from a distribution P(θ). Note that θ is not a random variable associated with an event in a sample space. In frequentist statistics, this is forbidden. In Bayesian statistics, this is allowed and you can specify a prior belief P(θ) defining what values you believe θ is likely to take on. Now, we can look at $P(θ∣D)=P(D∣θ)P(θ)P(D)$ (recall Bayes Rule!), where - $P(θ)$ is the prior distribution over the parameter(s) θ, before we see any data. - $P(D∣θ)$ is the likelihood of the data given the parameter(s) θ. - $P(θ∣D)$ is the posterior distribution over the parameter(s) θ after we have observed the data. [Reference1: MLE VS MAP](https://www.ycc.idv.tw/deep-dl_3.html) **MAP受到prior estimation支配較強，需要更多data來驗證實際機率;反之MLE會從均勻分布開始，兩者最後都會Converge，但如果prior estimation錯誤對MAP影響會較大，需要的驗證時間較長** ### Bayes Classifier and Naive Bayes We can approach this dilemma with a simple trick, and an additional assumption. The trick part is to estimate P(y) and P(x|y) instead, since, by Bayes rule, $P(y|x)=\frac{P(x|y)}{P(y)P(x)}$ Recall from Estimating Probabilities from Data that estimating P(y) and P(x|y) is called generative learning. Estimating P(y) is easy. For example, if Y takes on discrete binary values estimating P(Y) reduces to coin tossing. We simply need to count how many times we observe each outcome (in this case each class): $P(y=c)=\frac{∑ni=1I(yi=c)}{n}=π^c$ Estimating P(x|y), however, is not easy! The additional assumption that we make is the Naive Bayes assumption. Naive Bayes Assumption: $P(x|y)=\prod_{α=1}^d P(x_α|y)$ ,where $x_α=[x]_α$ is the value for feature $α$ i.e., feature values are independent given the label! This is a very bold assumption. For example, a setting where the Naive Bayes classifier is often used is spam filtering. Here, the data is emails and the label is spam or not-spam. The Naive Bayes assumption implies that the words in an email are conditionally independent, given that you know that an email is spam or not. Clearly this is not true. Neither the words of spam or not-spam emails are drawn independently at random. However, the resulting classifiers can work well in practice even if this assumption is violated. ![](https://i.imgur.com/jxS3cjA.png) * Step 1: Take the data and look at each dimension and model this distribution. * Step 2: Prediction with the base formular (fall back to the optimal classifier). **It does not find a hyperplane that best separate the positive from negative points, but find one best separate one distribution from the other and it fits these distributions from the data** ### Logistic Regression Logistic Regression is the discriminative counterpart to Naive Bayes. In Naive Bayes, we first model $P(x|y)$ for each label y, and then obtain the decision boundary that best discriminates between these two distributions. In Logistic Regression we do not attempt to model the data distribution $P(x|y)$, instead, we model $P(y|x)$ directly. We assume the same probabilistic form $P(y|x_i)=\frac {1}{1+e^{−y(w^Tx_i+b)}}$ , but we do not restrict ourselves in any way by making assumptions about $P(x|y)$ (in fact it can be any member of the Exponential Family). This allows logistic regression to be more flexible, but such flexibility also requires more data to avoid overfitting. Typically, in scenarios with little data and if the modeling assumption is appropriate, Naive Bayes tends to outperform Logistic Regression. However, as data sets become large logistic regression often outperforms Naive Bayes, which suffers from the fact that the assumptions made on $P(x|y)$ are probably not exactly correct. If the assumptions hold exactly, i.e. the data is truly drawn from the distribution that we assumed in Naive Bayes, then Logistic Regression and Naive Bayes converge to the exact same result in the limit (but NB will be faster). ### Gradient Descent We want to minimize a convex, continuous and differentiable loss function $ℓ(w)$. In this section we discuss two of the most popular "hill-climbing" algorithms, gradient descent and Newton's method. Algorithm: Initialize w0 Repeat until converge: $wt+1 = wt + s$ If $∥wt+1 - wt∥2 < ϵ$, converged! * Comparison - The matrix H(w) scales d×d and is expensive to compute. A good approximation can be to only compute its diagonal entries and multiply the update with a small step-size. Essentially you are then doing a hybrid between Newton's method and gradient descent, where you weigh the step-size for each dimension by the inverse Hessian. - To avoid divergence of Newton's method, a good approach is to start with gradient descent (or even stochastic gradient descent) and then finish the optimization Newton's method. Typically, the second order approximation, used by Newton's Method, is more likely to be appropriate near the optimum. ![](https://i.imgur.com/OiN9tg4.png) A gradient descent step (left) and a Newton step (right) on the same function. The loss function is depicted in black, the approximation as a dotted red line. The gradient step moves the point downwards along the linear approximation of the function. The Newton step moves the point to the minimum of the parabola, which is used to approximate the function. **First use gradient descent and use Newton steps just before you end because Newton step only works when you are very close to the minimum.** ### Linear Regression Ordinary Least Squares，OLS * Data Assumption: $y_i∈R$ * Model Assumption: $y_i=w^⊤x_i+ϵ_i$ where $ϵ_i∼N(0,σ^2)$ ⇒ $y_i|x_i∼N(w^⊤x_i,σ^2)$ ⇒ $P(y_i|x_i,w)=\frac {1}{√2πσ^2}e^\frac {-(x^⊤_iw−y_i)^2}{2σ^2}$ In words, we assume that the data is drawn from a "line" $w^⊤x$ through the origin (one can always add a bias / offset through an additional dimension, similar to the Perceptron). For each data point with features xi, the label y is drawn from a Gaussian with mean $w^⊤x_i$ and variance $σ^2$. Our task is to estimate the slope w from the data. ![](https://i.imgur.com/bVftbv6.png) ### (Linear) Support Vector Machines (1994) The intuition behind SVM is if I want to find any hyperplane that the best hyperplane is actually the one which has the maximum margin. You will always find a hyperplane which has the exactly same margin to both sides. The Support Vector Machine (SVM) is a linear classifier that can be viewed as an extension of the Perceptron developed by Rosenblatt in 1958. The Perceptron guaranteed that you find a hyperplane if it exists. The SVM finds the maximum margin separating hyperplane. Setting: We define a linear classifier: $h(x)=sign(w^Tx+b)$ and we assume a binary classification setting with labels {+1,−1}. ![](https://i.imgur.com/1YEPenG.png) * constrained optimization problem * When there is no hyperplane satisfies all the points you might allow for some violation of the constraints. By adding a little bit of slacks, you can manually make it possible. **簡而言之就是做了人為的動作，比如說把最能區分的高維空間往左往右範圍1之內的點都犧牲掉 giving it in some sense of slack budget to optimize the separation but the margin has to be carefully selected (find the right C is important). The idea is you want a solution but you allow it to basically mess it up a bit if it's too strict. You wan t a balance between getting the points right on the training data and getting a simple solution** ### Empirical Risk Minimization Remember the unconstrained SVM Formulation ![](https://i.imgur.com/7Ck79s8.png) The hinge loss is the SVM's error function of choice, whereas the l2-regularizer reflects the complexity of the solution, and penalizes complex solutions. This is an example of empirical risk minimization with a loss function ℓ and a regularizer r, ![](https://i.imgur.com/uTfaHec.png) where the loss function is a continuous function which penalizes training error, and the regularizer is a continuous function which penalizes classifier complexity. Here, we define λ as 1C from the previous lecture. ![](https://i.imgur.com/O9V5dIn.png) **Zero-One loss: whenever it gets positive, the loss is zero, the moment it gets negative, it has the loss of one. It has a point of discontinuity.** **The hinge loss: it has a zero loss when it's correct and greater equal one, but the moment it gets to close to zero, the loss goes up, it keeps going up libearly. It focus on the close points to the hyperplane.** **Exponential loss: it's very aggressive, if you can improve the negative point just a little bit and move it a little bit to the right, a little bit towards the hyperplane, the los is going to drop tremendously. So the exponential loss will always focus on the point that is most misclassified** **The log loss: it's not an upper bound of the zero-one loss. Both the log loss and the hinge loss are well behaved when it comes to outliers** ![](https://i.imgur.com/lvaMGk9.png) ![](https://i.imgur.com/fef2Smf.png) **When you are very close to zero, you pick squared loss, but when you are far away from zero, have very large losses, you pick absolute loss** ### Regularization * L1 regularizer In high dimensional spaces, it consists of many edges, and almost all these edges have the majority of the weight set to zero so you will get a very sparse answer. When lambda is very large, the space will be very tiny and almost certainly only a single feature will be nonzero. So the single feature will be most important to predicting the label. If you can only spend very little weight, you'll have to put all the weight on that feature. One downside of the L1 regularizer is that it's not strictly convex. So what people do in practice is using elastic net, which means you do a little bit of both λ L1 + μ L2. (without L2 regularizer, that's called lasso). Most of time you only need five features to do it really well on the test set, afterwards, the other features help you on the training set but not on the test set anymore. So to speak, the other features contain something particularly specific to the particular data set. When you find the cutoff point, you can explain the result more persuasively. ### Review * General setup: We split the data set into three parts: training, validation and testing. The problem is you don't have the access to the distribution P, but you have more data than you need, so you take your dataset D, you reduce it and you only train on some of it and then the remaining points are basically drawn from the distribution P, so you can use it to evaluate how well your algorithm does. Untill you are satisfied with the algorithm you repeatedly tuned, you use the test data only once to see the performance. (test data is for the sanity check) * a discriminative algorithm given all measurement about a human, it will tell you is that human male or female. A general algorithm will tell you, given the gender , generate a human. * parametric algorithm: you have a finite number of parameters. The number of parameters is the same no matter how much data you have. e.g. perceptron. * nonparametric algorithm: there is no fixed number of parameters you are learning. The model size how much you are storing grows with the number of theta you have typically. e.g. KNN |- |dis |gen| |---- |--- |---| |para |SVM/Perc/Logis |NB | |non-para|KNN |- | * KNN algorithm: The key is it relies heavily on the quality of the metric. The assumption of KNN is that similar points have similar labels. The metric has to reflect this assumption. In very high dimensional spaces the notion of similarity somewhat can be badly defined. * Perceptron algorithm: the assumption is that data is linearly separable. If it's not, it loops forever.I looks at one data point at a time and if it's correctly classified, it won't do anything. If it's misclassified, it adds it when it's posotive and subtract it when it's negative. The key is it converges in a finite number of steps. You can rescale the data set and it doesn't change anything about the convergence. * Maximun likelihood estimation: maximize the probability of the data given the parameters of the distribution.$P(D;θ)$ Maximum A Posteriori goes the other way round. What's the probability maximize the property of θ given the data.$P(θ|D)$ * Naiva Bayes: the first assumption is that there are independent probabilities for each different dimension. So each dimension now is estimated individually. The second assumption, what is that distribution (you have to decide). The crucial part is once it has the gaussian fitted, it won't look at the data ever again. You just use the data to estimate the parameters of distributions. You will find a hyperplane to separate the two distributions. But the outlier is unlikely to be correctly classified. (in comparison, logistic regression separates the data points. Usually logistic regression works better, but when you only have very little amount of data, naiva bayes works better.) * Logistic regression: you have a form of a function and want to find the minimum so you will do a taylor expansion. Gradient descent is you take a small step and you end up somewhere and again with the linear function you take another step, you follow the direction. Newton's method goes a little bit aggressive. It fits a parabola, it says the quadratic so it takes a second order taylor expansion. It finds the minimun and jumps to it and does it repeatedly. * Adagrad: it basically set a different step size for every single dimension and it sets it automatically. It's very neat. * Linear regression: you try to find a line that best explains the data. * SVM: You want to find the maximum margin hyperplane so you minimize the norm of W. It has no probability at all. The points are either on the positive or negative side.The downside of SVM is it's hard to read certainties. ### Bias Variance Decomposition The expected error of your algorithm $A$ is computed in the following way: ![](https://i.imgur.com/XHOXLA7.png) * To be clear, D is our training points and the (x,y) pairs are the test points. We are interested in exactly this expression, because it evaluates the quality of a machine learning algorithm A with respect to a data distribution P(X,Y). In the following we will show that this expression decomposes into three meaningful terms. First I draw a training dataset $D$ and you train your algorithm so you get a $h_D$ and you take the point's x and y to compute the error. How large is that error in expectation? If you do that 10 million times, get a new training dataset, train a new classifier, draw a point pair, test it on those, and see what the error. ![](https://i.imgur.com/sOnvOBF.png) * Variance: Captures how much your classifier changes if you train on a different training set. How "over-specialized" is your classifier to a particular training set (overfitting)? If we have the best possible model for our training data, how far off are we from the average classifier? -> add datas to reduce the error * Bias: What is the inherent error that you obtain from your classifier even with infinite training data? This is due to your classifier being "biased" to a particular kind of solution (e.g. linear classifier). In other words, bias is inherent to your model. -> might have wrong assumptions * Noise: How big is the data-intrinsic noise? This error measures ambiguity due to your data distribution and feature representation. You can never beat this, it is an aspect of the data. -> add features to reduce the error **The error is a sum of three very precise things: the variance and the bias of your algorithm and the noise of your data. You want to know how to reduce each one of these. As a data scientist, if you have a classifier that is not accurate enough, it's your job to determine what is too high and you should reduce it. In practice, people have algorithm which don't work, and they try to reduce the wrong part which didn't cause the problem. So sharpen your eyes and find what is really going wrong.** ![](https://i.imgur.com/wr6aU2a.png) * low bias + low variance: you always hit the center exactly. * low bias + high variance: there is no systematic error but shaking around the center, sometimes up, down, left or right. * high bias + low variance: always hitting the same points with amazing accuracy, but unfortunately it's slightly far from the center. * high bias + high variance: an expectation you are off and you actually vary a lot. ### Model Selection / Regularization / Overfitting ![](https://i.imgur.com/eYuxKoY.png) **傳統上你透過代入不同數字如0,1000,2000,3000,4000,5000，找出lambda最小的區間，假如是2000，那你再代入第二輪1600,1700,1800,1900,2000,2100,2200...直到找出最小的lambda** * k-fold-cross-validation: you divide your dataset into k different buckets, k-1 of them are used for training and 1 is used for valisation. So you can use it for k times, you can always leave out another bucket. You will get k validation errors in the end. * How do you choose k? The larger is better. The problem is the slower it is. So find the trade-off is always important. How accurate you want and how much time you have? It also depends on how big is your dataset. * One way to regulate overfitting is regularization. Another very common way is early stopping. Both control how much you minimize the loss function but early stopping is faster because you don't retrain your classifier, you do the optimization only once and save the weight, the parameter of each time and pick the lowest one in the end. * Detecting High Bias and High Variance If a classifier is under-performing (e.g. if the test or training error is too high), there are several ways to improve performance. To find out which of these many techniques is the right one for the situation, the first step is to determine the root of the problem. ![](https://i.imgur.com/IAYy63h.png) The graph above plots the training error and the test error and can be divided into two overarching regimes. In the first regime (on the left side of the graph), training error is below the desired error threshold (denoted by ϵ), but test error is significantly higher. In the second regime (on the right side of the graph), test error is remarkably close to training error, but both are above the desired tolerance of ϵ. * Regime1: You have high variance. The training error is below epsilon (ε), which means on the training dataset, you're doing great but the problem is the testing error is too high. So you have to shrink the gap. Try to make it below the epsilon (ε). You can add more data, use bagging, reduce model complexity or increase regularization. * Regime2: you have a wrong model for this dataset (bias problem). More training data will not help. How to detect it? The training error is too high and the testing error is only a little bit higher than the training error. You need to change your algorithm to a more complex one. You can do this by adding features, kernelization, boosting or decrease regularization.. * More data will never bring the training error down and testing error is always greater equal than the training data. ### Model Selection / Kernels When you have high bias, you know that you can add more features to reduce it, but what you can also do is to take the existing features and try to construct new features out of it by using transformation. -> 假如資料長相為外圍是圈而中心全是叉，可套用公式讓兩者以不同長相呈現，增加特徵值的方式如將原先的點x,y新增一欄位$x^2+y^2$之類的 * Gradient Descent with Squared Loss The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space, without ever computing a single vector ϕ(x) or ever computing the full vector w. It is a little magical. * General Kernels * Linear: $K(x,z)=x^⊤z$ (The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix if the dimensionality d of the data is high.) -> suffer from high bias * Polynomial: $K(x,z)=(1+x^⊤z)^d$ * Radial Basis Function (RBF) (aka Gaussian Kernel): $K(x,z)=e^\frac{−∥x−z∥^2}{σ^2}$ The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is infinite dimensional and cannot be computed. However, very effective low dimensional approximations exist (see this paper) <- the most popular kernel * Exponential Kernel: $K(x,z)=e^\frac{−∥x−z∥}{2σ^2}$ * Laplacian Kernel: $K(x,z)=e^\frac{−∥x−z∥}{σ}$ * Sigmoid Kernel: $K(x,z)=tanh(ax^⊤+c)$ * Kernel functions Can any function $K(⋅,⋅)→R$ be used as a kernel? No, the matrix $K(xi,xj)$ has to correspond to real inner-products after some transformation $x→ϕ(x)$. This is the case if and only if K is positive semi-definite. That's the constraint. So you cannot have negative inner product. Remember Kij=ϕ(xi)⊤ϕ(xj). So K=Φ⊤Φ, where Φ=[ϕ(x1),…,ϕ(xn)]. It follows that K is p.s.d., because q⊤Kq=(Φ⊤q)2≥0. Inversely, if any matrix A is p.s.d., it can be decomposed as A=Φ⊤Φ for some realization of Φ. You can even define kernels over sets, strings, graphs and molecules. ![](https://i.imgur.com/iOQYNxt.png) Figure 1: The demo shows how kernel function solves the problem linear classifiers can not solve. RBF works well with the decision boundary in this case. Step1: Take the algorithm and prove that data is only accessed in the inner products. Step2: Define a kernel function and substitute the data. **convert the primal problem to a dual problem. you have to define a kernel matrix. you take the original dataset which is linearly separable and you solve the dual problem with a linear kernel. This should get out exactly the same linear classifier. $K_{ij}$ is just the inner product between two vectors then you should get the same result as before.** **So SVM, especially in the RBF kernel, is a smart nearest neighbor algorithm. But there are two things, first, it takes your training dataset and thins it out. It removes the obvious points that are obviously right. So you don't have to hold the entire dataset around. you don't need all the data with a zero alpha. Secondly, for the testing point, there basically has soft assignment where it says, if there's dense region, you take more neighbors, and if there's a sparse region, you take a few neighbors. The RBF kernel is affected by the curse of dimensionality. Just because everything if far away from everything.** ### Gaussian Processes * Gaussian Processes are based on Gaussian distributions. * Things become Gaussian after you average them. * The Gaussian is the black hole of distributions. Once it is Gaussian, it stays Gaussian. * Recap: MLE -> which w makes our data most likely, explains our data the best. MAP -> given that we have the data, what is the most likely set of parameters. * Gaussian Processes is the most elegant algorithm, you compute the kernel matrix and you get the prediction and the variance. * Examples are speech recognition system and Bayesian Optimization (function minimization of unknown expensive functions) * it starts randomly, then it fits the surface, and then it looks into what is the minimum and takes into account the uncertainty. * Gaussian Process Regression has the following properties: GPs are an elegant and powerful ML method We get a measure of (un)certainty for the predictions for free. GPs work very well for regression problems with small training data set sizes. Running time O(n3)← matrix inversion (gets slow when n≫0) ⇒ use sparse GPs for large n. GPs are a little bit more involved for classification (non-Gaussian likelihood). We can model non-Gaussian likelihoods in regression and do approximate inference for e.g., count data (Poisson distribution) GP implementations: GPyTorch, GPML (MATLAB), GPys, pyGPs, and scikit-learn (Python) ![](https://i.imgur.com/PJNdUzt.png) The figure shows a Gaussian processes trained on four training points (black crosses) and evaluated on a dense grid within the [-5,5] interval. The red line shows the predicted mean value at each test point. The shaded gray region depicts the uncertainty of the prediction (two standard deviations from the mean). Testing points that are closer to training points (i.e. have more similar features) obtain more certain predictions. ### KD-Trees / Ball-Trees k-Dimensional Trees: step1: take one dimension of the data, find a splitting point that exactly divides half the data points on one side and half the data points on the other side (splits along the median), you get two subsets, run the algorithm again recursively, keep splitting. step2: during test time, you have a test point, you descend down the tree, the point will always fall into one leaf, the point will end up somewhere and you do nearest neighbor search in the closest area (how far depends on the distance to the threshold). * KD-Trees don't work at all in high dimensional space. * Ball-tree: you only split along axis. You don't build boxes, but you build big balls. So how to do that? You pick a point at random, you find the point that is farthest away, and you find the point that is farthest away from the second point. What you tend to get is two points that are really far away from each other. Now you have a direction, you want to split along that surface, so you project everything on that direction, and then you split again by the median. you compute the mean point and the radius distance to the farthest point so you know the sphere contains all of those points. And you do the same on the other side of the median. Now you can draw a tree. * k-NN is slow during testing because it does a lot of unecessary work. * KD-trees partition the feature space so we can rule out whole partitions that are further away than our closest k neighbors. However, the splits are axis aligned which does not extend well to higher dimensions. * Ball-trees partition the manifold the points are on, as opposed to the whole space. This allows it to perform much better in higher dimensions. * Is it always possible to keep splitting the data until every single leaf is pure? The answer is no. Because you could have two points that have exactly the same features but have two different labels. But if ruling out that case the answer will be yes. You can prevent it from happening by adding a little gaussian noise, just make the two point not exactly identical. ### Decision Tree NP-Hard Problem in decision tree. It takes a long time. If your data set gets larger, the amount of extra computation time to accommodate the extra data grows exponeetially. Decision tree is a terrible algorithm because it has clear bias and variance problems that is very easy to address. But if you address the variance with bagging and bias with boosting, it can be an amazing algorithm! E.g. searching engine is a boosting decision tree. You need impurity function: * Gini impurity -> for a set of data points in a certain leaf, you want to know how pure is the set of data points, first compute a probability of getting a certain label. You can estimate that with maximum likelihood estimation. The probability for a label K is the number of points with that lable devided by the total set. For the impurity, you go over every single class and multiply the probability of picking someone of the class times the probability of not picking someone of that class. ![](https://i.imgur.com/vf0xD4x.png) You start from the middle and want to reach a lower point which is dominated only by one class. * Entropy: What you don't want? The leaves where every single class is common. The worst possible distribution within a leaf would be where $q_1$ = $q_2$ = ... = $q_k$ = $\frac{1}{k}$ so each one of the classes is equally likely. So this is terrible because you really don't know which label is your test point. The KL divergence between the distribution P and Q is the sum over all the different classes and it's $p_k$ $log$ $\frac{p_k}{q_k}$ **先判斷演算法公式可以做出好樹的機率，如果高再進行優化** Machine learning is actually very similar to compression. You want to minimize the entropy of a set by repeatedly splitting the data. For any given set, you can compute the entropy. But How do you find the optimal split? The answer is you try out every single one. You look at every dimension and find every single split you can make. Given D dimension and N data points, D times N-1 splits. Then you have to evaluate how good is the split. For that you need impurity function (Gini index and Entropy function) Decision trees have several nice advantages over nearest neighbor algorithms: 1. once the tree is constructed, the training data does not need to be stored. Instead, we can simply store how many points of each label ended up in each leaf - typically these are pure so we just have to store the label of all points; 2. decision trees are very fast during test time, as test inputs simply need to traverse down the tree to a leaf - the prediction is the majority label of the leaf; 3. decision trees require no metric because the splits are based on feature thresholds and not distances. ![](https://i.imgur.com/XOHlh0b.png) New goal: Build a tree that is: 1. Maximally compact 2. Only has pure leaves Quiz: Is it always possible to find a consistent tree? Yes, if and only if no two input vectors have identical features but different labels Bad News! Finding a minimum size tree is NP-Hard!! Good News: We can approximate it very effectively with a greedy strategy. We keep splitting the data to minimize an impurity function that measures label purity amongst the children. **檢查得很仔細所以很耗時，不確定之前有沒有筆記這個，最好的方式以2維為例，是沿著x或y軸漸進incrementally，看資料點落在左邊或右邊(y軸)，然後算fraction分數(分子分母數)，老師說斜線效果沒有比較好，但我忘了原因** * Regression Trees How do you do this for regression? CART: Classification and Regression Trees In practice, you have data points sampled from a function and you split it somewhere and you estimate everything on the right by the mean and do the same to the left so you get two straight lines, one on the right and one on the left that allows the approximation of the function. You split again and again. At the end, you will get a function that approximated many straight lines. You split the space repeatedly, and for any given x value, all the points basically fall into one interval you just predict the same label. To decide when to stop, you can also say, you only grow to a certain amount of depth or you stop when there's n points in the leaf or a better way is to limit the number of nodes in a tree. ![](https://i.imgur.com/blUlwoR.png) CART summary: CART are very light weight classifiers Very fast during testing Usually not competitive in accuracy but can become very strong through bagging (Random Forests) and boosting (Gradient Boosted Trees) Here is the problem of decision tree. It's very hard to find the sweet spot and the reason is empirical risk minimization with lambda, lambda can take any real number but decision tree, it has to be an integer. So you can either 3 or 4 but not 3.5. It won't really work because you can only go to the inteval steps. So to decision trees have a bias and variance problem. ![](https://i.imgur.com/pEihBRV.png) ### Bagging Bagging is something you can use with any classifier that has a variance problem. **amazing algorithm** Create a data set out of the original data set. You sample n points with replacement (you can't pick the same point over and over again), so you get n different data sets and each of them consist of data from the original data set D (they are not independent). You then train a classifier on each data set and you average them. So your final classifier is the averaged one. * Random Forest The Random Forest is one of the best, most popular and easiest to use out-of-the-box classifier. There are two reasons for this: * 1. The RF only has two hyper-parameters, m and k. It is extremely insensitive to both of these. A good choice for k is k=$√d$ (where d denotes the number of features). You can set m as large as you can afford. * 2. Decision trees do not require a lot of preprocessing. For example, the features can be of different scale, magnitude, or slope. This can be highly advantageous in scenarios with heterogeneous data, for example the medical settings where features could be things like blood pressure, age, gender, ..., each of which is recorded in completely different units. * You randomly subsample(sampling with replacement so they won't be the same content as original D but they will be the same size as D) K dimensions and restrict the search to those K dimensions. (but those dimensions should be very different) Because the dimensions are very different so they will make very different mistakes on test time and if you average them, the mistakes averaged out. But remember, you only split on K less than d features. * no need of pre-processing on the scale * The advantage is the out of bag error. You do not need a training validation data set. You can estimate the test error directly on the training data set. Basically you go through all your training data points, for every training data points, you estimate the error (the loss). This is extremly powerful because it means you train your whole classifier on all your data and for free you're getting an unbiased estimate of how well your classifier is doing on the test set. ### Boosting What if we have really high bias? * Scenario: Hypothesis class H, whose set of classifiers has large bias and the training error is high (e.g. CART trees with very limited depth.) * Famous question: In his machine learning class project in 1988 Michael Kearns famously asked the question: Can weak learners (H) be combined to generate a strong learner with low bias? * Famous answer: Yes! (Robert Schapire in 1990) 1. Gradient Boosted Regression Tree(GBRT) ![](https://i.imgur.com/panlL1Z.png) 2. AdaBoost ![](https://i.imgur.com/ANgsYVL.png) You want to find the H that minimizes your weighted error. You sum all the data points you got wrong and try to minimize them. For adaboost, you need an algorithm that takes Xs, Ys and weights and it tries to find you the lowest weighted error. Because those misclassified points will be really heavily weighted. So the next time you run the classifier, you won't misclassify those points again.(they could be detected) So every iteration you reweigh your data points and you give those data points that you're still getting wrong more weight. Also the weight increases exponentially. ![](https://i.imgur.com/5cISAAY.png) Adaboost also overfit but in a slow speed. Boosting is a great way to turn a week classifier into a strong classifier. It defines a whole family of algorithms, including Gradient Boosting, AdaBoost, LogitBoost, and many others ... Gradient Boosted Regression Trees is one of the most popular algorithms for Learning to Rank, the branch of machine learning focused on learning ranking functions, for example for web search engines. A few additional things to know: * The step size α is often referred to as shrinkage. * Some people do not consider gradient boosting algorithms to be part of the boosting family, because they have no guarantee that the training error decreases exponentially. Often these algorithms are referred to as stage-wise regression instead. * Inspired by Breiman's Bagging, stochastic gradient boosting subsamples the training data for each weak learner. This combines the benefits of bagging and boosting. One variant is to subsample only $n/2$ data points without replacement, which speeds up the training process. * One advantage of boosted classifiers is that during test time the computation $H(x)=\sum_{t=1}^{T}α_th_t(x_t)$ can be stopped prematurely if it becomes clear which way the prediction goes. This is particularly interesting in search engines, where the exact ranking of results is typically only interesting for the top 10 search results. Stopping the evaluation of lower ranked search results can lead to tremendous speed ups. A similar approach is also used by the Viola-Jones algorithm to speed up face detection in images. Here, the algorithm scans regions of an image to detect possible faces. As almost all regions and natural images do not contain faces, there are huge savings if the evaluation can be stopped after just a few weak learners are evaluated. These classifiers are referred to as cascades, that spend very little time on the common case (no face), but more time on the rare interesting case (face). With this approach Viola and Jones were the first to solve face recognition in real-time on low performance hardware (e.g. cameras). * AdaBoost is an extremely powerful algorithm, that turns any weak learner that can classify any weighted version of the training set with below 0.5 error into a strong learner whose training error decreases exponentially and that requires only $O$(log(n)) steps until it is consistent. ![](https://i.imgur.com/HCYgLM8.png) Adaboost will never stop because the loss is always nonzero. It's very good about not increasing the variance but it's not good when it's too noisy. ### Deep Learning and Neural Network * Gradient Descent and Stochastic Gradient Descent Neural Network is also known as multi-layer perceptrons and deep nets. In gradient descent one time what you do is you go over all the data points, compute the gradient and take your gradient steps. The original gradient descent algorithm, you take some large steps, while stochastic gradient descent, you take many many little steps and get to the minimum faster. you take the data points randomly, compute the little approximation of the gradient, make the gradient update and just keep iterating it. But the disadvantage of stochastic gradient descent is that it's hard to converge when getting too close to the exact minimum and you still need countless steps even to actually get there. However, this feature also make it powerful to avoid being trapped in local minimum. Also, usually in Machine learning, people don't care about the exact minimum. People thought you want to have the really deep hole so you want to minimize loss and that's what gradient descent is really good at. But the loss function will change during the test time. The very narrow local minima tend to be specific to the data set that you trained on but if you take very wide minima, when you change the data, it's unlikely that you would change a lot. So what you really want is to get as low as possible in your loss function on a really wide minimum. That's the only thing SG D is capable of finding. People do two tricks in practice, firstly, they don't just take one point at random but take around 64 to 128 points (mini batches) and secondly, initially you take a large learning rate so it's really noisy. However, it prevents you from falling into small local minima and you get close to the actual area where the local minimum is. ![](https://i.imgur.com/xtd4nxE.png) * Neural Network Every single dimension of X is a little circle so this is a five dimensional vector. A new representation called Phi double prime of X and it has a bunch of dimensions. You write again Phi prime of X, Phi of X and H of X. It's important to know when people say I take a look at the neuron, the value of the neuron, what they really mean is just one of the dimensions of the upper vector. ![](https://i.imgur.com/yTq3ZRe.png) [tensorflow playground ](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.13684&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false) ###### tags: `Digital Media` `WS20/21`