---
tags: Bachelor
---
# SUNTO - Algoritmi avanzati
**Indice**
[TOC]
## Ch. 1 - Introduction (1-8)
**LION = Learning + Intelligent OptimizatioN**
### Optimization
Decision making picks the best among a set of possible solutions which is given, **optimization actively creates new solutions**
Optimization fuels **automated creativity and innovation**
Optimizing goodness(x):
- easy using automation
- hard to define goodness(x)
- HERE COMES MACHINE LEARNING
### Machine Learning
**Can build goodness(x) from abundant data**
#### Method 1: Kringing
Kringe -> mining -> Gold(x), x=position
**Using previous history as data for learning**
**Estimate highest point in space using history, dig, repeat**

#### Method 2: Feedback from decision makers
Solving many if not most real-world problems requires **iterative processes with learning involved**
LION can:
- learn from the past
- learn while working
- cope with incomplete info
- quickly adapt to new situations
### Business Intelligence
- understand current and past performance
- predict outcome of decisions
- improve profit by optimizing factors (during production etc)
From **descriptive** to **predictive**
## Ch. 2 - Lazy learning: nearest neighbors (9-12)
### Supervised Learning
Uses **LABELED DATA**
- list of Examples
- Example = x (features) and y (label)
### Nearest Neighbor
- **LAZY**: stores all data, does nothing until interrogated with new point (new vector of features)
- Simple NN: label as the NN
- KNN: search the K Nearest, decide based on majority / unanimity. If doing **regression**, simply **average labels**
- Weighted KNN: find KNN, apply magic formula (usually for regression)

## Ch. 3 - Learning requires a method (21-31)
Learning is:
- unifying different cases by discovering the underlying explanatory laws
- generalization, the capability of explaining new cases
Before applying ML:
- **feature selection**: meaningful subset of original feature set
- **feature extraction**: combination (normalization/product/square, etc) of features
Two classes of problems:
- **classification**: Y is a set of labels
- **regression**: Y is a numerical value
Sometimes classification between two labels is transformed into regression on **probability** or **fuzzy membership**
### Minimization and generalization
Optimizing a set of weights W (usually one for each feature)
The free parameters are fixed by demanding that the learned model works correctly on the examples in the training set.
**Supervised learning => minimizing a specific error function**
### Bias-Variance dilemma
- too few parameters => inaccurate, large bias, lack flexibility
- too many parameters => inaccurate, large variance (same as in numerical instability), too sensitive to example details, *possible overfitting (NdR)*
- finding **right balance** is key

### Two types of classification
- **generative** methods: why/how are these labels generated from the examples? *learn how to generate other examples (NdR)*
- **discriminative** algorithms: given new point, find label
Examples of the latter: Perceptrons, SupportVectorMachines, ...
**The second option implies that accurate classifiers can be built without knowing or building a detailed model of the generation process**
### Learn :books:, Validate :microscope:, Test :boom:
1. split data
2. check performance using **RMS** on validation set
**Overfitting**:
Is the **equivalent of memorizing** examples, without learning
- small training set
- trained too much => huge difference between training RMS and validation RMS, or simply training RMS is close to zero
- too many parameters
**Cross validation**
Repeat training and validation using different subsets of data, to check the model's performance
### K-fold cross validation
1. split data in K subsets
2. train on K-1 subsets, validate using 1
3. repeat step 2 for each of the K subsets as validation
**Stratified K-fold cross validation**
In stratification, a separate slice is kept for each class to **maintain the relative proportion of cases of the classes**
1. split data in each label subset
2. split each label-sub in K subsets
3. train on the stratified union of K-1 and validate on stratified remaining 1
4. repeat step 3 for each of the K subsets as validation

### Test set
**Used only once for final performance**
In the standard case of a single train-validation cycle, the terms validation and testing are often used as synonyms.
### Errors :red_circle:

- Accuracy: how good at recognizing true values
- Precision: how many of an output are right
- Recall: how good at classifying a given input
### Confusion matrix

### GIST

## Ch. 4 - Linear models (35-45)
### Finding a linear dependence
Given W a vector of weights, try to optimize W so that

approximates the examples data as best as possible, IE:
**Minimize the model error (least squares approximation)**

Usually data is too big for zero-error optimization.
### Converting non-linear to linear
Simply add more dimensions to the input / modify existing ones:
EG: **for squared relations, use the squares as extra inputs**
**In general, f(X) becomes

where $\varphi(x)$ is a function that modifies original non-linear inputs to make them linear.**
### What about classification?
How to define the error function?
In the case of **2 labels**, rebuild the label-function so that

Basically **dividing the classes using an hyperplane**
*With more labels, use **logistic regression** (NdR)*
### How precise does the hyperplane need to be?
(Still with **only 2 labels**)
By **forcing the outputs to be far** (+1 and -1) and **penalizing squared errors, the fragility caused by accepting any separating hyperplane disappears.**

The margin of line A is larger than that of line B.
A large margin increases the probability that new examples will fall in the right side of the separator line.
### Why linear models?
The deep reason why linear models are so popular is the ***smoothness*** underlying many if not most of the physical phenomena:
**“Nature does not make jumps”**.
**Every smooth (differentiable) function can be approximated around an operating point because the first term of its Taylor series is always linear**
### Minimizing the ModelError function
In an error-free ideal world, simply solve the linear system
 one for each example.
In the real world, we need to **generalize the solution of a system of linear equations by allowing errors**
1. **Using the pseudo-inverse**
It's a **"one shot"** technique.

Basically exploits the fact that to find the minimum of a squared function (ModelError) we need to solve a linear system.
2. **Gradient descent**
In some cases, **if the number of examples is huge, an iterative technique based on gradient descent can be preferable**
Start from initial weights and keep moving them by executing small steps along the direction of the negative gradient
**Similar to our brain's iterative learning**
### Physical analogy for linear regression
Minimizing this system's potential energy

### Numerical Instability and Ridge Regression
When the training examples are almost in line, the resulting line can shift a lot because of numerical instability.
Stable vs instable

**Stability means that small changes in training data lead to small changes in the results.**
**RIDGE REGRESSION**
Add a **regulation term ( $\lambda$ )** to the error function:


Where $\lambda$ is a number and $I$ is the identity matrix.

This is the best way to cure **ill-posed problems**.
Ill-posed problems have **no unique solution because of lacking data / data not spread enough**
## Ch. 5 - Mastering generalized linear least-squares (47-51)
### Polinomial fitting

Optimize using the **chi-squared** error function

What degree, how many parameters?

Too many (degree=N of points) vs 3
On the left, **overfitting**
## Ch. 6 - Rules, decision trees, and forests (59-67)
**Rules:** way to condense nuggets of knowledge in a way amenable to human understanding.
- characterised by `precondition` and `conclusion`.
If the set of rules gets large, **complexities** can appear, like rules leading to **contradictory classifications**.
**Automated generation of rules is the key:** `decision trees`.
### Decision trees :evergreen_tree:
For a given input object, a decision tree **estimates an unknown property** of the object by asking successive questions about its known properties.
- each answer determines the next question
- the sequence of answers determines a **path**
- can be built in a **greedy manner**: (`most informative questions first`)
- aims at getting the children subsets from each question **as pure as possible**
- **faster purification** of the set **means faster decision**

**Decision reliability**
- **Pure subset:** when reaching the leaf, the only possible outcome is given
- **Impure subset:** in some cases one stops with slightly impure subsets, and an output probability for the different classes when cases reach a given leaf node
**Measuring purity** of a subset: `information gain` (or mutual information) vs `Gini impurity`.
### Shannon entropy
Measures the statistical uncertainty in the obtained class: $H(Y)=−\sum_{y∈Y}Pr(y)\space logPr(y)$

### Information gain
With this method the impurity of a set is **measured by the entropy** of the class probability distribution.
- let $S$ be the current set of examples, and let $S = S_{YES} ∪ S_{NO}$ be the splitting obtained after answering a question about an attribute:
$$IG = H(S) − \frac{|S_{YES}|}{|S|} H(S_{YES}) − \frac{|S_{NO}|}{|S|} H(S_{NO})$$
An optimal question maximizes IG, that is, reduces the average entropy as much as possible.
The **ideal question** leaves no indecision:
- all elements in $S_{YES}$ are cases of one class, while elements of $S_{NO}$ belong to another class, therefore the entropy of the two resulting subsets is zero
**NB:** the information gain is useful but not perfect (possible over-fitting,...).
### Gini impurity
Measures how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset.
Produces **zero errors if the set is pure**, and a small error rate if a single class gets the largest share of the set.
- given $m$ classes and $f_i$ the fraction of items labeled with value $i$ in the set:
$$GI(f) = \sum_{i=1}^mf_i(1−f_i)= \sum^m_{i=1}(f_i − f_i^2) = \sum^m_{i=1}f_i − \sum^m_{i=1}f_i^2 = 1 −\sum^m_{i=1}f_i^2$$
- it is computed as the expected value of the **mistake probability**
- IE, sum of (probability of choosing i)*(probability of wrong label for i)
### Posing questions
In general, questions should have a **binary output**.
- for a **categorical variable**, the test can be based on the variable having (or not) a subset of the possible values
- for **real-valued variables**, the tests can be on a `single variable` or on `simple combination` (linear) of a subset of variables
### Missing values
> In real-world data, **missing values** are abundant like snowflakes in winter - LIONBook

As a final remark, **do not confuse** the **construction** of the decision trees (using the labeled examples, the purity measures, the choice of appropriate questions) with the **usage** of a built tree.
### Decision forests
Use an ensemble of trees to make decisions in a democratic way.
- train different trees on **different sets** of examples with **bootstrap samples**
- **bagging (**`bootstrap aggregation`**):** the application of bootstrapping to the creation of different sample sets
- allow for **randomized** choices during training
**Democracy:** an isolated tree will be rather weak, however, collecting all outputs and `averaging` (regression) or `voting` (classification) will provide **reasonable answers**.
>A single tree casts a small shadow, hundreds of them can cool even the most torrid machine learning applications.
**Features:**
- like all trees they naturally **handle problems** with more than two classes and with missing attributes
- they provide `probabilistic output`, `probabilities` and `error bars`
- **good generalization** to previously unseen data without risking over-training
- **fast and efficient** thanks to their `parallelism` and `reduced set of tests` per data point
The **abundance of memory** and computing power permits training very large numbers of different trees.
## Ch. 7 - Ranking and selecting features (69-77)
**Feature extraction:** starts from an initial set of **raw data** and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps
- is a **dimensionality reduction** process
- while still accurately and completely describing the original data set
### Feature selection
**Feature selection (** or attribute selection, or variable subset selection **):** the process of selecting a subset of **relevant features** to be used in model construction.
- one must be sure that the input data have sufficient information to predict the outputs, **without excessive redundancy**
**Selecting a `small number of informative features` has advantages:**
1. ➖ dimensionality
2. ➖ memory usage
3. ➕ generalization
4. ➕ human understanding
**Methods for feature selection:**
1. **designer intuition** and **existing knowledge**
2. estimate the **relevance** or **discrimination power** of the individual features
3. top-down / bottom-up methods
- **top-down (**`truncation`**):** one starts with the complete set of features and progressively eliminates features while searching for the optimal performance point
- **bottom-up (**`greedy inclusion`**):** one gradually adds the ranked features in the order of their individual discrimination power and stops when the error rate stops decreasing
4. other methods
- **wrapper methods:** are built “around” a specific predictive model (measure error rate)
- computationally intensive
- usually provide the best performing feature set for the specific model
- **filter methods:** use a `proxy measure` instead of the error rate to score a feature subset
- common proxy measures: **mutual Information**, **correlation coefficient**
- **NB:** measuring individual features in isolation will discard mutual relationships and therefore the result will be approximated
- **embedded methods:** perform feature selection as an integral part of the model construction process
- **LASSO:** for constructing a linear method, `penalizes the regression coefficients`, shrinking many of them to zero so that the corresponding features can be eliminated
- **Recursive Feature Elimination:** commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights
### Proxy measures(?)
It is difficult to **rank individual features** without considering the specific `modeling method` and their `mutual relationships`.
#### Correlation coefficient 🔢▪🔢
**`Pearson` product-moment correlation coefficient:** the most widely used measure of `linear relationship` between **numeric variables**.
- is obtained by dividing the `covariance` of the two variables by the product of their `standard deviations`
$$ ρ_{X_i,Y} = \frac{cov[X_i , Y ]}{σ_{X_i}σ_Y} = \frac{E[(X_i − µ_{X_i}) (Y − µ_Y )]}{σ_{X_i}σ_Y} $$
- $ρ_{X_i,Y}$ is the `correlation coefficient` between the $i$-th input feature $X_i$ and the classifier’s outcome $Y$
- $µ_{X_i}$ and $µ_Y$ are the `expected values`
- $σ_{X_i}$ and $σ_Y$ are `standard deviations`
- the division by the standard deviations makes the correlation coefficient **independent of units**
- the correlation value varies **from `−1` to `1`** (de/in-creasing linear relationship)
**NB:** statistically independent variables have **zero correlation**, but zero correlation does not imply that the variables are independent.
- correlation coefficient detects **only `linear` dependencies** between two variables
**Trust** the correlation coefficient only if you have reasons to suspect linear relationships.
#### Correlation ratio 🔢▪🔠
Measures a `relationship` between a **numeric input** and a **categorical output**.
- the correlation ratio between the $i$-th feature and the outcome
$$η^2_{X_i,Y} = \frac{\sum\limits_{y∈Y}\ell_y(\bar{x}^{(i)}_y - \bar{x}^{(i)})^2}{\sum\limits_{y∈Y} \sum\limits^{\ell_y}_{j=1}(x^{(i)}_{jy} − \bar{x}^{(i)})^2}$$
- $\color{black}{\ell_y}$ is the number of times that outcome $y∈Y$ appears, so that one can **partition the sample input vectors** by their output
- $\color{black}{x^{(i)}_{jy}}$ is the $i$-th component (feature) of the $j$-th sample vector among the $\ell_y$ samples having outcome $y$
- $\color{black}{\bar{x}^{(i)}_y =\frac{1}{\ell_y}\sum\limits^{\ell_y}_{j=1}x^{(i)}_{jy}, \quad ∀y ∈ Y}$ is the average of the $i$-th feature within each output class
- $\color{black}{\bar{x}^{(i)} = \frac{1}{\ell}\sum\limits_{y∈Y}\ell_y\bar{x}^{(i)}_y}$ is the overall average
The correlation ratio **can capture nonlinear dependencies** but, if the relationship between values of the $i$-th feature component and values of the outcome is `linear`, then $η^2_{X_i,Y} = ρ^2_{X_i,C}.$
#### Chi-square test 🔠▪🔠
Is a **statistical hypothesis test:** a method of making statistical decisions by using experimental data.
Use **chi-square** to identify **possible dependencies** between inputs and output.
Measures a `relationship` between two **categorical features**:
$$\chi^2 = \sum_{c,t}\frac{[\text{count}_{c,t}−n·\text{Pr}(\text{class} = c) · \text{Pr}(\text{term} = t)]^2}{n · \text{Pr}(\text{class} = c) · \text{Pr}(\text{term} = t)}$$
- $\text{count}_{c,t}$ is the number of occurrences of the value $t$ given the class $c$
- The larger the $\chi^2$ value, the **higher the belief** that the two events are **depedent**, and that therefore the feature is significant to predict the output. $\frac{(\text{observed} - \text{expected})^2}{\text{expected}}$
For **feature ranking** no additional calculation is needed and one is **satisfied with the crude value**.
#### MIFS (Mutual Information for Feature Selection)
**Mutual information:** the amount by which the uncertainty in the output decreases after the input is known.
In classification, Mutual Information is related to upper and lower bounds on the optimal **Bayes error rate**.
Use mutual information **to estimate arbitrary dependencies** between qualitative or quantitative features.
## Ch. 9 - Specific nonlinear models (87-91)
### Logistic regression
Used for **predicting the probability** of the outcome of a **categorical variable** from a set of recorded past events.
- it really is a **technique for classification**, not regression

A `logistic curve` is a common **sigmoid function**: $f(x) = \frac{1}{(1 + e^{−x})}$

### Locally-weighted regression (LWR)
Is a `lazy` **memory-based** technique, meaning that all points and evaluations are stored and a specific model is built on-demand only when a specified query point demands an output.
Is a **non parametric** variant of linear regression (it requires entire training set to make a prediction)

## Ch. 10 - Neural networks: multi-layer perceptrons (95-101)
**Neural networks do not separate memory and processing but operate** via the flow of signals through the network connections.
This chapter is focused on **feed-forward multilayer perceptron neural networks**, without feedback loops.
**Neuron:** is modeled as a simple computing unit, a scalar product $w\cdot x$ (`pattern matching`) followed by a sigmoidal (`logistic`) function.
The **”squashing” functions** are essential to introduce `nonlinearities` in the system.
### Multilayer Perceptrons (MLP)
Is composed of a large number of **highly interconnected units** (neurons) working in parallel to solve a specific problem and organized in **layers** with a feed-forward information flow (no loops).

- the signals flow sequentially through the different layers from the input to the output layer
- **hidden layers:** layers which are not visible at the input or at the output
MLPs are **universal approximators**: an MLP with one hidden layer can **approximate any smooth function** to any desired accuracy, subject to a sufficient number of hidden nodes.
### Learning via backpropagation
**Backpropagation of the error:** consists in `gradient calculation` and small step in the direction of the negative gradient.
Learning **optimal MLPs** from examples:
1. take a **`guiding` function** to be optimized
- e.g.: `sum-of-squared errors` on the training examples
2. Use **gradient descent** with respect to the weights to find the better and better weights
3. **Stop** the descent when results on a validation set are best
- if **over-learning**, generalization can worsen.
>Learning is not an end, but a means for generalizing.
Sum-of-squared-differences **energy** function defined as:
$$E(w) = \frac12\sum^P_{p=1}E_p =\frac12\sum^P_{p=1}(t_p − o_p(w))^2$$
- $t_p$ and $o_p$ are the target and the current **output values** for pattern $p$, respectively
### Gradient-based backpropagation techniques
#### Batch backpropagation
**Standard version:** the weights at the next iteration $k + 1$ are updated as follows:
$$w_{k+1} = w_k − \epsilon\ g_k$$
- $g_k = ∇E(w_k)$ is the **gradient**
- $\epsilon$ is the **learning rate** (constant in the standard version)
- let the **initial weights** be randomly distributed
**Adaptive learning rate version (**`bold driver`**):**
- **problem:** How do we select the learning rate ε ?
- small $\epsilon \implies$`learning time` increases
- big $\epsilon \implies$`energy` oscillates wildly
- **heuristic proposal:**

- $\rho$ is close to one ($\rho$ = 1.1) in order to avoid frequent ``“accidents”``
- $\sigma$ is chosen to provide a `rapid reduction` (σ = 0.5)
- $l$ is the minimum integer such that the reduced rate $σ^l\epsilon(t − 1)$ succeeds in diminishing the energy
**NB:** first the contributions $∇E_p(w_k)$ are summed, then the small step is taken. (different from on-line BP)
#### On-line backpropagation (stochastic gradient descent)
The stochastic on-line backpropagation update is given by:
$$w_{k+1} = wk − \epsilon∇E_p(w_k)$$
- the pattern $p$ is chosen **randomly** from the training set at each iteration, and take a small step along a **single** negative $∇E_p(w_k)$ **immediately** after calculating it
**Pro:** using `partial gradients` is **faster**.
**Con:** less guarantee of convergence.
☀ END ☀