--- tags: Bachelor --- # SUNTO - Algoritmi avanzati **Indice** [TOC] ## Ch. 1 - Introduction (1-8) **LION = Learning + Intelligent OptimizatioN** ### Optimization Decision making picks the best among a set of possible solutions which is given, **optimization actively creates new solutions** Optimization fuels **automated creativity and innovation** Optimizing goodness(x): - easy using automation - hard to define goodness(x) - HERE COMES MACHINE LEARNING ### Machine Learning **Can build goodness(x) from abundant data** #### Method 1: Kringing Kringe -> mining -> Gold(x), x=position **Using previous history as data for learning** **Estimate highest point in space using history, dig, repeat** ![](https://i.imgur.com/5k2gLXO.png) #### Method 2: Feedback from decision makers Solving many if not most real-world problems requires **iterative processes with learning involved** LION can: - learn from the past - learn while working - cope with incomplete info - quickly adapt to new situations ### Business Intelligence - understand current and past performance - predict outcome of decisions - improve profit by optimizing factors (during production etc) From **descriptive** to **predictive** ## Ch. 2 - Lazy learning: nearest neighbors (9-12) ### Supervised Learning Uses **LABELED DATA** - list of Examples - Example = x (features) and y (label) ### Nearest Neighbor - **LAZY**: stores all data, does nothing until interrogated with new point (new vector of features) - Simple NN: label as the NN - KNN: search the K Nearest, decide based on majority / unanimity. If doing **regression**, simply **average labels** - Weighted KNN: find KNN, apply magic formula (usually for regression) ![](https://i.imgur.com/k84VdNV.png) ## Ch. 3 - Learning requires a method (21-31) Learning is: - unifying different cases by discovering the underlying explanatory laws - generalization, the capability of explaining new cases Before applying ML: - **feature selection**: meaningful subset of original feature set - **feature extraction**: combination (normalization/product/square, etc) of features Two classes of problems: - **classification**: Y is a set of labels - **regression**: Y is a numerical value Sometimes classification between two labels is transformed into regression on **probability** or **fuzzy membership** ### Minimization and generalization Optimizing a set of weights W (usually one for each feature) The free parameters are fixed by demanding that the learned model works correctly on the examples in the training set. **Supervised learning => minimizing a specific error function** ### Bias-Variance dilemma - too few parameters => inaccurate, large bias, lack flexibility - too many parameters => inaccurate, large variance (same as in numerical instability), too sensitive to example details, *possible overfitting (NdR)* - finding **right balance** is key ![](https://i.imgur.com/A0c3G6Y.png) ### Two types of classification - **generative** methods: why/how are these labels generated from the examples? *learn how to generate other examples (NdR)* - **discriminative** algorithms: given new point, find label Examples of the latter: Perceptrons, SupportVectorMachines, ... **The second option implies that accurate classifiers can be built without knowing or building a detailed model of the generation process** ### Learn :books:, Validate :microscope:, Test :boom: 1. split data 2. check performance using **RMS** on validation set **Overfitting**: Is the **equivalent of memorizing** examples, without learning - small training set - trained too much => huge difference between training RMS and validation RMS, or simply training RMS is close to zero - too many parameters **Cross validation** Repeat training and validation using different subsets of data, to check the model's performance ### K-fold cross validation 1. split data in K subsets 2. train on K-1 subsets, validate using 1 3. repeat step 2 for each of the K subsets as validation **Stratified K-fold cross validation** In stratification, a separate slice is kept for each class to **maintain the relative proportion of cases of the classes** 1. split data in each label subset 2. split each label-sub in K subsets 3. train on the stratified union of K-1 and validate on stratified remaining 1 4. repeat step 3 for each of the K subsets as validation ![](https://i.imgur.com/bifM7gv.png) ### Test set **Used only once for final performance** In the standard case of a single train-validation cycle, the terms validation and testing are often used as synonyms. ### Errors :red_circle: ![](https://i.imgur.com/KZL0sAD.png) - Accuracy: how good at recognizing true values - Precision: how many of an output are right - Recall: how good at classifying a given input ### Confusion matrix ![](https://i.imgur.com/M2QdBlG.png) ### GIST ![](https://i.imgur.com/R74mSnd.png) ## Ch. 4 - Linear models (35-45) ### Finding a linear dependence Given W a vector of weights, try to optimize W so that ![](https://i.imgur.com/a0mTM2s.png) approximates the examples data as best as possible, IE: **Minimize the model error (least squares approximation)** ![](https://i.imgur.com/RqiwUac.png) Usually data is too big for zero-error optimization. ### Converting non-linear to linear Simply add more dimensions to the input / modify existing ones: EG: **for squared relations, use the squares as extra inputs** **In general, f(X) becomes ![](https://i.imgur.com/dDCxJ6u.png) where $\varphi(x)$ is a function that modifies original non-linear inputs to make them linear.** ### What about classification? How to define the error function? In the case of **2 labels**, rebuild the label-function so that ![](https://i.imgur.com/na2o4Ec.png) Basically **dividing the classes using an hyperplane** *With more labels, use **logistic regression** (NdR)* ### How precise does the hyperplane need to be? (Still with **only 2 labels**) By **forcing the outputs to be far** (+1 and -1) and **penalizing squared errors, the fragility caused by accepting any separating hyperplane disappears.** ![](https://i.imgur.com/ruvfUoV.png) The margin of line A is larger than that of line B. A large margin increases the probability that new examples will fall in the right side of the separator line. ### Why linear models? The deep reason why linear models are so popular is the ***smoothness*** underlying many if not most of the physical phenomena: **“Nature does not make jumps”**. **Every smooth (differentiable) function can be approximated around an operating point because the first term of its Taylor series is always linear** ### Minimizing the ModelError function In an error-free ideal world, simply solve the linear system ![](https://i.imgur.com/uQ1jYXt.png) one for each example. In the real world, we need to **generalize the solution of a system of linear equations by allowing errors** 1. **Using the pseudo-inverse** It's a **"one shot"** technique. ![](https://i.imgur.com/gAOrqJQ.png) Basically exploits the fact that to find the minimum of a squared function (ModelError) we need to solve a linear system. 2. **Gradient descent** In some cases, **if the number of examples is huge, an iterative technique based on gradient descent can be preferable** Start from initial weights and keep moving them by executing small steps along the direction of the negative gradient **Similar to our brain's iterative learning** ### Physical analogy for linear regression Minimizing this system's potential energy ![](https://i.imgur.com/N7rBydk.png) ### Numerical Instability and Ridge Regression When the training examples are almost in line, the resulting line can shift a lot because of numerical instability. Stable vs instable ![](https://i.imgur.com/G2A7uLZ.png) **Stability means that small changes in training data lead to small changes in the results.** **RIDGE REGRESSION** Add a **regulation term ( $\lambda$ )** to the error function: ![](https://i.imgur.com/oxdB0as.png) ![](https://i.imgur.com/MuqsaIw.png) Where $\lambda$ is a number and $I$ is the identity matrix. ![](https://i.imgur.com/hUVqS0Z.png) This is the best way to cure **ill-posed problems**. Ill-posed problems have **no unique solution because of lacking data / data not spread enough** ## Ch. 5 - Mastering generalized linear least-squares (47-51) ### Polinomial fitting ![](https://i.imgur.com/TKCNUnH.png) Optimize using the **chi-squared** error function ![](https://i.imgur.com/RGNiMew.png) What degree, how many parameters? ![](https://i.imgur.com/DwUnGaH.png) Too many (degree=N of points) vs 3 On the left, **overfitting** ## Ch. 6 - Rules, decision trees, and forests (59-67) **Rules:** way to condense nuggets of knowledge in a way amenable to human understanding. - characterised by `precondition` and `conclusion`. If the set of rules gets large, **complexities** can appear, like rules leading to **contradictory classifications**. **Automated generation of rules is the key:** `decision trees`. ### Decision trees :evergreen_tree: For a given input object, a decision tree **estimates an unknown property** of the object by asking successive questions about its known properties. - each answer determines the next question - the sequence of answers determines a **path** - can be built in a **greedy manner**: (`most informative questions first`) - aims at getting the children subsets from each question **as pure as possible** - **faster purification** of the set **means faster decision** ![](https://i.imgur.com/gIxfAup.png) **Decision reliability** - **Pure subset:** when reaching the leaf, the only possible outcome is given - **Impure subset:** in some cases one stops with slightly impure subsets, and an output probability for the different classes when cases reach a given leaf node **Measuring purity** of a subset: `information gain` (or mutual information) vs `Gini impurity`. ### Shannon entropy Measures the statistical uncertainty in the obtained class: $H(Y)=−\sum_{y∈Y}Pr(y)\space logPr(y)$ ![](https://i.imgur.com/jn4rI4w.png) ### Information gain With this method the impurity of a set is **measured by the entropy** of the class probability distribution. - let $S$ be the current set of examples, and let $S = S_{YES} ∪ S_{NO}$ be the splitting obtained after answering a question about an attribute: $$IG = H(S) − \frac{|S_{YES}|}{|S|} H(S_{YES}) − \frac{|S_{NO}|}{|S|} H(S_{NO})$$ An optimal question maximizes IG, that is, reduces the average entropy as much as possible. The **ideal question** leaves no indecision: - all elements in $S_{YES}$ are cases of one class, while elements of $S_{NO}$ belong to another class, therefore the entropy of the two resulting subsets is zero **NB:** the information gain is useful but not perfect (possible over-fitting,...). ### Gini impurity Measures how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. Produces **zero errors if the set is pure**, and a small error rate if a single class gets the largest share of the set. - given $m$ classes and $f_i$ the fraction of items labeled with value $i$ in the set: $$GI(f) = \sum_{i=1}^mf_i(1−f_i)= \sum^m_{i=1}(f_i − f_i^2) = \sum^m_{i=1}f_i − \sum^m_{i=1}f_i^2 = 1 −\sum^m_{i=1}f_i^2$$ - it is computed as the expected value of the **mistake probability** - IE, sum of (probability of choosing i)*(probability of wrong label for i) ### Posing questions In general, questions should have a **binary output**. - for a **categorical variable**, the test can be based on the variable having (or not) a subset of the possible values - for **real-valued variables**, the tests can be on a `single variable` or on `simple combination` (linear) of a subset of variables ### Missing values > In real-world data, **missing values** are abundant like snowflakes in winter - LIONBook ![](https://i.imgur.com/0KmgwiB.png) As a final remark, **do not confuse** the **construction** of the decision trees (using the labeled examples, the purity measures, the choice of appropriate questions) with the **usage** of a built tree. ### Decision forests Use an ensemble of trees to make decisions in a democratic way. - train different trees on **different sets** of examples with **bootstrap samples** - **bagging (**`bootstrap aggregation`**):** the application of bootstrapping to the creation of different sample sets - allow for **randomized** choices during training **Democracy:** an isolated tree will be rather weak, however, collecting all outputs and `averaging` (regression) or `voting` (classification) will provide **reasonable answers**. >A single tree casts a small shadow, hundreds of them can cool even the most torrid machine learning applications. **Features:** - like all trees they naturally **handle problems** with more than two classes and with missing attributes - they provide `probabilistic output`, `probabilities` and `error bars` - **good generalization** to previously unseen data without risking over-training - **fast and efficient** thanks to their `parallelism` and `reduced set of tests` per data point The **abundance of memory** and computing power permits training very large numbers of different trees. ## Ch. 7 - Ranking and selecting features (69-77) **Feature extraction:** starts from an initial set of **raw data** and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps - is a **dimensionality reduction** process - while still accurately and completely describing the original data set ### Feature selection **Feature selection (** or attribute selection, or variable subset selection **):** the process of selecting a subset of **relevant features** to be used in model construction. - one must be sure that the input data have sufficient information to predict the outputs, **without excessive redundancy** **Selecting a `small number of informative features` has advantages:** 1. ➖ dimensionality 2. ➖ memory usage 3. ➕ generalization 4. ➕ human understanding **Methods for feature selection:** 1. **designer intuition** and **existing knowledge** 2. estimate the **relevance** or **discrimination power** of the individual features 3. top-down / bottom-up methods - **top-down (**`truncation`**):** one starts with the complete set of features and progressively eliminates features while searching for the optimal performance point - **bottom-up (**`greedy inclusion`**):** one gradually adds the ranked features in the order of their individual discrimination power and stops when the error rate stops decreasing 4. other methods - **wrapper methods:** are built “around” a specific predictive model (measure error rate) - computationally intensive - usually provide the best performing feature set for the specific model - **filter methods:** use a `proxy measure` instead of the error rate to score a feature subset - common proxy measures: **mutual Information**, **correlation coefficient** - **NB:** measuring individual features in isolation will discard mutual relationships and therefore the result will be approximated - **embedded methods:** perform feature selection as an integral part of the model construction process - **LASSO:** for constructing a linear method, `penalizes the regression coefficients`, shrinking many of them to zero so that the corresponding features can be eliminated - **Recursive Feature Elimination:** commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights ### Proxy measures(?) It is difficult to **rank individual features** without considering the specific `modeling method` and their `mutual relationships`. #### Correlation coefficient 🔢▪🔢 **`Pearson` product-moment correlation coefficient:** the most widely used measure of `linear relationship` between **numeric variables**. - is obtained by dividing the `covariance` of the two variables by the product of their `standard deviations` $$ ρ_{X_i,Y} = \frac{cov[X_i , Y ]}{σ_{X_i}σ_Y} = \frac{E[(X_i − µ_{X_i}) (Y − µ_Y )]}{σ_{X_i}σ_Y} $$ - $ρ_{X_i,Y}$ is the `correlation coefficient` between the $i$-th input feature $X_i$ and the classifier’s outcome $Y$ - $µ_{X_i}$ and $µ_Y$ are the `expected values` - $σ_{X_i}$ and $σ_Y$ are `standard deviations` - the division by the standard deviations makes the correlation coefficient **independent of units** - the correlation value varies **from `−1` to `1`** (de/in-creasing linear relationship) **NB:** statistically independent variables have **zero correlation**, but zero correlation does not imply that the variables are independent. - correlation coefficient detects **only `linear` dependencies** between two variables **Trust** the correlation coefficient only if you have reasons to suspect linear relationships. #### Correlation ratio 🔢▪🔠 Measures a `relationship` between a **numeric input** and a **categorical output**. - the correlation ratio between the $i$-th feature and the outcome $$η^2_{X_i,Y} = \frac{\sum\limits_{y∈Y}\ell_y(\bar{x}^{(i)}_y - \bar{x}^{(i)})^2}{\sum\limits_{y∈Y} \sum\limits^{\ell_y}_{j=1}(x^{(i)}_{jy} − \bar{x}^{(i)})^2}$$ - $\color{black}{\ell_y}$ is the number of times that outcome $y∈Y$ appears, so that one can **partition the sample input vectors** by their output - $\color{black}{x^{(i)}_{jy}}$ is the $i$-th component (feature) of the $j$-th sample vector among the $\ell_y$ samples having outcome $y$ - $\color{black}{\bar{x}^{(i)}_y =\frac{1}{\ell_y}\sum\limits^{\ell_y}_{j=1}x^{(i)}_{jy}, \quad ∀y ∈ Y}$ is the average of the $i$-th feature within each output class - $\color{black}{\bar{x}^{(i)} = \frac{1}{\ell}\sum\limits_{y∈Y}\ell_y\bar{x}^{(i)}_y}$ is the overall average The correlation ratio **can capture nonlinear dependencies** but, if the relationship between values of the $i$-th feature component and values of the outcome is `linear`, then $η^2_{X_i,Y} = ρ^2_{X_i,C}.$ #### Chi-square test 🔠▪🔠 Is a **statistical hypothesis test:** a method of making statistical decisions by using experimental data. Use **chi-square** to identify **possible dependencies** between inputs and output. Measures a `relationship` between two **categorical features**: $$\chi^2 = \sum_{c,t}\frac{[\text{count}_{c,t}−n·\text{Pr}(\text{class} = c) · \text{Pr}(\text{term} = t)]^2}{n · \text{Pr}(\text{class} = c) · \text{Pr}(\text{term} = t)}$$ - $\text{count}_{c,t}$ is the number of occurrences of the value $t$ given the class $c$ - The larger the $\chi^2$ value, the **higher the belief** that the two events are **depedent**, and that therefore the feature is significant to predict the output. $\frac{(\text{observed} - \text{expected})^2}{\text{expected}}$ For **feature ranking** no additional calculation is needed and one is **satisfied with the crude value**. #### MIFS (Mutual Information for Feature Selection) **Mutual information:** the amount by which the uncertainty in the output decreases after the input is known. In classification, Mutual Information is related to upper and lower bounds on the optimal **Bayes error rate**. Use mutual information **to estimate arbitrary dependencies** between qualitative or quantitative features. ## Ch. 9 - Specific nonlinear models (87-91) ### Logistic regression Used for **predicting the probability** of the outcome of a **categorical variable** from a set of recorded past events. - it really is a **technique for classification**, not regression ![](https://i.imgur.com/BQF8EvH.png) A `logistic curve` is a common **sigmoid function**: $f(x) = \frac{1}{(1 + e^{−x})}$ ![](https://i.imgur.com/XE9mjXB.png) ### Locally-weighted regression (LWR) Is a `lazy` **memory-based** technique, meaning that all points and evaluations are stored and a specific model is built on-demand only when a specified query point demands an output. Is a **non parametric** variant of linear regression (it requires entire training set to make a prediction) ![](https://i.imgur.com/VIKkw2J.png) ## Ch. 10 - Neural networks: multi-layer perceptrons (95-101) **Neural networks do not separate memory and processing but operate** via the flow of signals through the network connections. This chapter is focused on **feed-forward multilayer perceptron neural networks**, without feedback loops. **Neuron:** is modeled as a simple computing unit, a scalar product $w\cdot x$ (`pattern matching`) followed by a sigmoidal (`logistic`) function. The **”squashing” functions** are essential to introduce `nonlinearities` in the system. ### Multilayer Perceptrons (MLP) Is composed of a large number of **highly interconnected units** (neurons) working in parallel to solve a specific problem and organized in **layers** with a feed-forward information flow (no loops). ![](https://i.imgur.com/CKAEire.png) - the signals flow sequentially through the different layers from the input to the output layer - **hidden layers:** layers which are not visible at the input or at the output MLPs are **universal approximators**: an MLP with one hidden layer can **approximate any smooth function** to any desired accuracy, subject to a sufficient number of hidden nodes. ### Learning via backpropagation **Backpropagation of the error:** consists in `gradient calculation` and small step in the direction of the negative gradient. Learning **optimal MLPs** from examples: 1. take a **`guiding` function** to be optimized - e.g.: `sum-of-squared errors` on the training examples 2. Use **gradient descent** with respect to the weights to find the better and better weights 3. **Stop** the descent when results on a validation set are best - if **over-learning**, generalization can worsen. >Learning is not an end, but a means for generalizing. Sum-of-squared-differences **energy** function defined as: $$E(w) = \frac12\sum^P_{p=1}E_p =\frac12\sum^P_{p=1}(t_p − o_p(w))^2$$ - $t_p$ and $o_p$ are the target and the current **output values** for pattern $p$, respectively ### Gradient-based backpropagation techniques #### Batch backpropagation **Standard version:** the weights at the next iteration $k + 1$ are updated as follows: $$w_{k+1} = w_k − \epsilon\ g_k$$ - $g_k = ∇E(w_k)$ is the **gradient** - $\epsilon$ is the **learning rate** (constant in the standard version) - let the **initial weights** be randomly distributed **Adaptive learning rate version (**`bold driver`**):** - **problem:** How do we select the learning rate ε ? - small $\epsilon \implies$`learning time` increases - big $\epsilon \implies$`energy` oscillates wildly - **heuristic proposal:** ![](https://i.imgur.com/PhlFgw9.png) - $\rho$ is close to one ($\rho$ = 1.1) in order to avoid frequent ``“accidents”`` - $\sigma$ is chosen to provide a `rapid reduction` (σ = 0.5) - $l$ is the minimum integer such that the reduced rate $σ^l\epsilon(t − 1)$ succeeds in diminishing the energy **NB:** first the contributions $∇E_p(w_k)$ are summed, then the small step is taken. (different from on-line BP) #### On-line backpropagation (stochastic gradient descent) The stochastic on-line backpropagation update is given by: $$w_{k+1} = wk − \epsilon∇E_p(w_k)$$ - the pattern $p$ is chosen **randomly** from the training set at each iteration, and take a small step along a **single** negative $∇E_p(w_k)$ **immediately** after calculating it **Pro:** using `partial gradients` is **faster**. **Con:** less guarantee of convergence. ☀ END ☀