---
tags: Master
---
# Machine Learning - SUNTO
[TOC]
## 1. Introduction [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/01_machine_learning/talk.pdf)] 🖖
**What is Machine Learning:**
> A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience
>
> E. T. Mitchell
`Learning problem:` task, performance measure, training experience.
### Designing a machine learning system
1. `Formalize the learning task:` choose an appropriate **performance measure**
2. `Collect data:` often the most **cumbersome** part
3. `Extract features:` to provide relevant input to the learning system
- **too few** can lower the performance because of missing relevant information
- **too many** can lower the performance because of increased probability of having nosy and useless features
4. `Choose class of learning models:` bias-variance trade-off
5. `Train model:` improve generalization, **avoid overfitting**
6. `Evaluate model:` according to its ability to **generalize** to unseen examples
- can provide insights into the model weaknesses
- can imply comparing different models/learners
### Choice of Learning Algorithms
**information available:**
- **Full knowledge of probability distributions of data** $\Longrightarrow$
- `Bayesian decision theory` (Ch.7)
- **Form of probabilities known, parameters unknown** $\Longrightarrow$
- `Parameter estimation` from training data (Ch.8)
- **Form of probabilities unknown, training examples available** $\Longrightarrow$
- `discriminative methods:` do not model input data
- `generative methods:` learn a function predicting the desired output given the input
- Form of probabilities unknown, training examples unavailable (**only inputs**) $\Longrightarrow$
- `unsupervised methods:` clustering, dimensionality reduction (e.g. PCA), novelty detection
## 2. Decision Trees [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/02a_decision_trees/talk.pdf)] 🌲
- the **leaf** contains the label to be assigned to instances reaching it
- **logical formula** represented by the tree: the disjunction of all paths (conjunctions)

**Appropriate problems for decision trees:**
- binary or multiclass **classification tasks** (extensions to regressions also exist)
- instances represented as `attribute-value pairs`
- different explanations for the concept are possible (`disjunction`)
- some instances have `missing attributes`
- need for an `interpretable explanation` of the output
**Greedy top-down strategy for learning:**
$\forall$ node
1. choose `best attribute` to be evaluated
2. $\forall$ attribute value add a child
3. split node training set into children according to value of chosen attribute
`stop splitting` a node:
- if it contains examples from a single class (`pure`)
- if there are no more attributes to test
### Choosing the best attribute
#### Entropy
A measure of the amount of information contained in a collection of `instances` $S$ which can take a number $c$ of `possible values`.
$$
H(S)=-\sum\limits_{i=1}^cp_ilog_2(p_i)
$$
- in our case instances are `training examples` and values $i$ are `class labels`
- the entropy of a set of labeled examples **measures its label inhomogeneity**
#### Information gain
Expected reduction in entropy obtained by partitioning a `set` $S$ according to the value of a certain `attribute` $A$.
$$
\text{IG(}S,A\text{) = H(}S\text{)}-\sum\limits_{v\space\in\space Values(A)}\frac{|S_v|}{|S|}\text{H(}S_v\text{)}
$$
- an attribute with high information gain tends to produce homogeneous groups in terms of labels, thus favoring their classification
### Issues in decision tree learning
- **`Overfitting`:** requiring pure sets can imply very **complex trees**, majority rule can be used.
- `pre-pruning:` decide pruning while learning the tree
- `post-pruning:` learn the full tree first
- **`Continuous valued attributes:`** need to be **discretized** in order to be used in internal nodes tests
- using the threshold producing the **higher infogain**
- **`Attributes with missing values`:** when attribute $A$ is to be tested at node $n$
- `simple solution:` assign to $x$ the **most common** attribute values among examples in $n$
- `complex solution:` propagate $x$ to each of the children of $n$, with a fractional value equal to the proportion of examples with the corresponding attribute value
### Random forests
**Training:**
- given a training set of $N$ examples, **sample** $N$ examples **`with replacement`**
(i.e. same example can be selected multiple times)
- train a decision tree on the sample, selecting at each node $m$ `features at random among which to choose the best one`
- repeat steps $1$ and $2$ $M$ times in order to generate a forest of $M$ trees
**Testing:**
1. test the example with each tree in the forest
2. return the `majority class ` among the predictions
## 3. K-nearest neighbors [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/02b_nearest_neighbour/talk.pdf)] 🎯
…
## 4. Linear algebra [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/03_linear_algebra/talk.pdf)] 📈
- Some linear algebra recap…
- PCA (pag.30)
## 5. Probability theory [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/04_probability_theory/talk.pdf)] 🎲
- Some probability theory recap… mainly how to play with some rules
- variables, distributions, theorems (Bayes), laws (central limit), …
## 6. Evaluation [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/19_evaluation/talk.pdf)] 🔬
**Performance evaluation is needed for:**
- `tuning hyperparameters` of learning method (e.g. type of kernel and parameters, learning rate of perceptron)
- evaluating quality of learned predictor
- computing statistical significance of difference between learning algorithms
The **training loss function** measures the cost paid for predicting $f(x)$ for output $y$.
### Performance measures
- **Binary classification:**
- `ACC,Pre,Rec,F1` all measure performance of a classifier for a specific threshold
- `Precision-recall curve:` It is possible to change the threshold if interested in maximizing a specific performance (e.g. recall)
- **Multiclass classification:**
- more generic version of the previous one
- `ACC,Pre,Rec,F1` carry over to a **per-class measure** considering as negatives examples from other classes
- **Regression**



- **Hold-out procedure:** computing performance measure on training set would be biased,need:
- `Validation set:` when used to estimate **performance of different algorithmic settings** (i.e. hyperparameters)
- `Test set:` when used to estimate **final performance of selected model**
- **K-fold cross validation**
### Comparing learning algorithms
- **Hypothesis testing:** allows to test the statistical significance of a hypothesis

- **T-test:** the test statistics is given by the (also called `studentized`) **standardized mean**:

where $\overset{˜}{Var}\overset{¯}{[X]}$ is the `approximated variance` (using unbiased sample one)
## 7. Bayesian decision theory [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/05_bayesian_decision/talk.pdf)] 📚
- 💥Bayesian decision theory allows to **take optimal decisions in a fully probabilistic setting**
- 💥It assumes **all relevant probabilities are known**
- it allows to `provide upper bounds on achievable errors` and evaluate classifiers accordingly
- Bayesian reasoning **can be generalized** to cases when the probabilistic structure is not entirely known (see “Parameter estimation” chapter)

💠**Bayes rule** allows us to `predict the class given the input`, in other words, Bayes rule allows to compute the posterior probability as:
$$
posterior = \dfrac{likelihood × prior}{evidence}
$$

- **posterior:** $P(y|x)$ is the probability that class is $y$ given that $x$ was observed
- **likelihood:** $P(x|y)$ is the probability of observing $x$ given that its class is $y$
- **prior:** $P(y)$ is the prior probability of the class, without any evidence
- **evidence:** $P(x)$ is the probability of the observation, and by the `law of total probability` can be computed as the **sum of marginal probabilities over the possible classes**:
$$
P(x) = \sum^c_{i=1}P(x|y_i)P(y_i)
$$
**Bayes decision rule:** minimizes the probability that an example is incorrectly labeled. The hypothesis to choose depends on the biggest among their probabilities given an evidence:
- $y_B = \text{argmax}_{i∈\{1,...,c\}}g_i(x)$: takes the biggest result among a set of `discriminant functions` $g_i(x)$, there are many possibilities for the function:
- $g_i(x) = p(y_i|x) = p(x|y_i)P(y_i)$ (`Bayesian decision rule`)
- $P(x)$ as denominator is ignored because its a constant
- $g_i(x) = ln\text{p}(x|y_i) + ln\text{P}(y_i)$
- …
**Optimal rule:** probability of error given $x$
$$
P(\text{error}|x) = 1 − P(y_B|x)
$$
👁🗨 the `minimum error` is the complement of the maximum posterior probability
- Bayes decision rule maximizes $y_B$
### Classifiers representation

The feature space is divided into `decision regions` $R_1, . . . , R_c$ such that:
$$
x ∈ \mathcal{R}_i \quad \text{if } g_i(x) > g_j(x) \quad ∀ j \ne i
$$
👁🗨 Each discriminant function dominates for a nullable set of inputs. (dominates in a certain region)
### Discriminant functions for multivariate normal density

**Normal density function** over $d$ variables:
$$
f_X(x_1,...,x_d)= \frac{exp\bigl({−\frac{1}{2}(x − µ)^tΣ^{−1}(x − µ)\bigr)}}{\sqrt{(2π)^d|Σ|}}
$$
- $Σ$ is always **symmetric** and positive semi-definite
Choosing $g_i(x) = ln\text{p}(x|y_i) + ln\text{P}(y_i)$ as `discriminant function` for normal density:
$$
g_i(x)=−\frac{1}{2}(x − µ_i )^t Σ^{−1}_i (x − µ_i ) −\frac{d}{2} ln2π −\frac{1}{2} ln |Σ_i | + ln\text{P}(y_i)
$$
- 👁🗨 first member is from numerator, 2nd and 3rd from denominator
- $− \frac{d}{2} ln2π$ is a constant, does not impact on $\text{argmax}()$
**Special cases of covariance matrix $Σ_i$:**
- case $Σ_i = σ^2I$
- features are `statistically independent`
- all features have `same variance` $σ^2$
- covariance determinant $|Σ_i| = σ^{2d}$ can be ignored being independent of $i$
- the discriminant functions becomes:
$$
g_i(x) = − \frac{||x − µi||^2}{2σ^2}+ lnP(y_i)
$$

- case $Σ_i = Σ$

- case $Σ_i$ arbitrary

## 8. Parameter estimation [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/06_07_bayesian_learning/talk.pdf)] 👓
**Setting:**
- data is sampled from a **probability distribution** $p(x, y)$
- probability distribution $p$ **form is known**, but **parameters are unknown**
- training set $\mathcal{D} = \{(x_1, y_1), . . . ,(x_m, y_m)\}$ of examples sampled i.i.d. (**independent identically distributed**) according to $p(x, y)$
**Goal:** `estimate the unknown parameters` of $p$ from training data $\mathcal{D}$.
**i.i.d. sampling:**
- _independent:_ each example is sampled independently from the others
- _identically distributed:_ all examples are `sampled from the same distribution`
### Multi-class classification setting
- **divide** training set $\mathcal{D}$ based on `classes` (one for each)
- given a new example $x$ (not in training set) and $\mathcal{D}$, we compute the **posterior probability** of class $y_i$:
$$
P(y_i|x, \mathcal{D}) = \frac{p(x|y_i,\mathcal{D}_i)p(y_i|\mathcal{D})}{p(x|\mathcal{D})}
$$
👁🗨 We must estimate class-dependent parameters $θ_i$ for $p(x|y_i , \mathcal{D}_i)$.
### Maximum Likelihood vs Bayesian estimation
- **Maximum likelihood/a-posteriori estimation:** find $\theta$ that maximizes the probability of an input $x_0$, in other words, find the distribution parameters that maximize the probability of the given data
- assumes parameters $θ$ as `unknown fixed`
- obtained values are used to compute `probability for new examples` (update $\theta_i \forall x$)
- 👎optimization problem can get very complicated
- 👍with number of measures $x\rightarrow \infty$, MLE is `unbiased` and achieves the `smallest possible variance`
- **Bayesian estimation:**
- assumes $θ_i$ parameters as:
- `random variables`
- with some `known prior distribution`
- observing examples turns prior distribution over parameters into a `posterior distribution`
- $p(x|y_i , \mathcal{D}_i) = \int_{θ_i} p(x, θ_i |y_i , \mathcal{D}_i)dθ_i$
👁🗨 $p(x|y_i , \mathcal{D}_i) ≈ p(x|θ_i)$
### Maximum-likelihood estimation
$\theta_i^*$ is the set of best parameters that maximizes the probability of a dataset of class $i$.
- **Maximum a-posteriori estimation:** assumes a `prior distribution` for the parameters $p(θ_i)$ is available
- $θ^∗_i = \text{argmax}_{θ_i} p(θ_i |\mathcal{D}_i , y_i) = \text{argmax}_{θ_i} p(\mathcal{D}_i , y_i |θ_i)p(θ_i)$
- assumes a prior distribution for the parameters $p(\theta_i)$ is available
- **Maximum likelihood estimation:** `no assumption` about prior distributions for parameters
- $θ^∗_i= \text{argmax}_{θ_i} p(\mathcal{D}_i , y_i |θ_i)$
- given $n$ inputs of class $y$:
- $θ^* = \text{argmax}_θ p(\mathcal{D}|θ) = \text{argmax}_θ \prod \limits ^n_{j=1} p(x_j |θ)$
- **Maximum log-likelihood estimation:** usually easier to maximize
- $θ^∗ = \text{argmax}_θ \text{ln}p(\mathcal{D}|θ) = \text{argmax}_θ \prod \limits ^n_{j=1} \text{ln}p(x_j |θ)$
👁🗨 Each class $y_i$ is treated **independently**; replace $y_i , \mathcal{D}_i → \mathcal{D}$ for `simplicity`.

💥 Maximum likelihood estimates for Gaussian parameters are simply their empirical estimates over the samples:
- Gaussian mean is the `sample mean`
- Gaussian covariance matrix is the mean of the `sample covariances`
### Bayesian estimation
Predictions for new examples are obtained integrating over all possible values for the parameters:
$$
p(x|\mathcal{D}) = \int_θ p(x, θ|\mathcal{D})dθ = \int p(x|θ)p(θ|\mathcal{D})dθ
$$
- $p(x|θ)$ can be **easily computed** because we have both form and parameters of distribution (Gaussian,…)
- need to **estimate the parameter posterior density** given the training set:
$$
p(θ|\mathcal{D}) = \frac{p(\mathcal{D}|θ)p(θ)}{p(\mathcal{D})}
$$
- 👁🗨 $p(D)$ is a constant independent of $θ$, has no influence on the Bayesian decision
#### Bayesian estimation cases
- **Univariate normal case:** unknown $\mu$, known $σ^2$
- $p(x|\mu) ∼ N(\mu, σ^2 )$
- Mean posterior distribution varying sample size:


- **Multivariate normal case:** unknown $\mu$, known $\Sigma$

### Sufficient statistics
**Statistic:** any function on a set of samples $\mathcal{D}$.
- a statistic $\textbf{s} = \phi(\mathcal{D})$ is sufficient for some parameters $θ$ if:
$$
P(\mathcal{D}|\textbf{s}, θ) = P(\mathcal{D}|\textbf{s})
$$
- a `sufficient statistic` allows to **compress a sample** $\mathcal{D}$ into (possibly few) values
👁🗨 Sample `mean` and `covariance` are **sufficient statistics** for true mean and covariance of the **Gaussian distribution**.
### Conjugate priors
A prior distribution $p(θ)$ **is a conjugate prior for a likelihood function** $p(x|θ)$ if the posterior distribution $p(θ|x)$ is in the `same family` as the prior $p(θ)$.

## 9. Bayesian Networks [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/08_bayesian_networks/talk.pdf)] 🖇
Why:
- give `intuitive understanding` of the probabilistic model
- `discover properties` of the model by inspecting the graph (conditional independences, …)
- represent multiple probability distributions with the same graph, `abstracting from the quantitative aspects`
**Independence recap:**
- $a ⊥ b | ∅$ : two variables $a, b$ are `independent` if: $p(a, b) = p(a)p(b)$
- $a ⊥ b | c$ : two variables $a, b$ are `conditionally independent` given $c$ if: $p(a, b|c) = p(a|c)p(b|c)$

**Factorized probability:** $p(x_1,...,x_m)= \prod\limits_{i=1}^m p(x_i|Pa_{x_i})$
- the `factorization` of the graph $\mathcal{G}$ above is $p(x_1,...,x_7)=p(x_1)p(x_2)p(x_3)p(x_4|x_1,x_2,x_3)p(x_5|x_1,x_3)p(x_6|x_4)p(x_7|x_4,x_5)$
👁🗨 **Independence assumptions:** each variable is independent (⊥) of its non-descendants given its parents.
- $\mathcal{I}_\ell(\mathcal{G}) = \{∀i \ \ x_i ⊥ \text{NonDescendants}_{x_i}|Parents_{x_i} \}$
$\mathcal{G}$ is an **independency map** (`I-map`) for $p$ if $p$ satisfies the local independences in $\mathcal{G}$:
$$
\mathcal{I}_\ell(\mathcal{G}) ⊆ \mathcal{I}(p)
$$
- $p$ is the **joint distribution** over variables $\mathcal{X}$
- $\mathcal{I}(p)$ is the `set of independence assertions` holding in $p$
- **Minimal I-map:** cannot be reduced
- a minimal I-map for $p$ does not necessarily capture all the independences in $p$
**P-map (**perfect map**):** $\mathcal{I}(G) = \mathcal{I}(p)$, captures all and the only dependencies for $p$
- not all distributions have a P-map, some cannot be modeled exactly by the BN formalism
**I-equivalence:** $\mathcal{I}(G) = \mathcal{I}(G')$, if the graphs have same `skeleton` (undirected graph) and same set of `v-structures`

### D-Separation
Instead to `verify independence` assumptions by repeated applications of sum and product rules, d-separation rules can be used to verify directly from the graph.
**Markov Condition:** each variable is conditionally independent of its `non-descendants` given its `parents`.
- **tail-to-tail:**
- `joint` dist. $p(a,b,c)= p(a|c)p(b|c)p(c)$
- ⬜ $c$ not given $\implies$ $a$ and $b$ not independent
- 🟪 $c$ given $\implies$ $a$ and $b$ conditionally independent
- **head-to-tail:**
- `joint` dist. $p(a,b,c)= p(b|c)p(c|a)p(a) = p(b|c)p(a|c)p(a)$
- ⬜ $c$ not given $\implies$ $a$ and $b$ not independent
- 🟪 $c$ given $\implies$ $a$ and $b$ conditionally independent
- **head-to-head:**
- `joint` dist. $p(a,b,c)= p(c|a,b)p(a)p(b)$
- ⬜ $c$ not given $\implies$ $a$ and $b$ independent
- 🟪 $c$ given $\implies$ $a$ and $b$ not conditionally independent

👁🗨d-separation $\implies$ conditional independence between two variables.
**Markov Blanket:** a variable becomes `conditionally independent of any other variable` given the evidence of its Markov blanket.
- $\text{MarkovBlanket(X)= \{Parents(X), Children(X), Parents(Children(X)\}}$
## 10. Inference in BN [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/09_inference_in_bn/talk.pdf)] 💡
**Inference:** in practice is the execution of a query.
The most common query type is computing the `posterior probability` of a subset $X$ of the non-observed variables given the observations
- $p(X|E = e)$
- working with joint probabilities like $p(X|E = e) = \frac{p(X, E = e)}{p(E = e)}$ is $O(2^n)$ in the number of variables, more efficient inference is needed
Each outgoing message is obtained multiplying the incoming message by the “local” probability, and summing over the node values:

- $µ_α(X_n) = \sum\limits_{X_{n−1}} p(X_n|X_{n−1})µ_α(X_{n−1})$
- $µ_β(X_n) = \sum\limits_{X_{n+1}} p(X_{n+1}|X_n)µ_β(X_{n+1})$
**Marginal probability:** $p(X_n) = µ_α(X_n)µ_β(X_n)$
- is the product of the contributions coming from both ends
**Marginal probability with evidence:** if there are observed nodes, simply use their observed values instead of summing over all possible values when computing their messages. The message passing procedure computes the `joint probability` of the variable and the evidence.

**Conditional probability given evidence:** the message passing procedure computes the joint probability of the variable and the evidence, then normalized.
- $p(X_n|X_e = x_e) = \dfrac{p(X_n, X_e = x_e)}{ \sum_{x_n} p(X_n, X_e = x_e)}$
### Exact Inference on Trees

Efficient inference algorithms only for `tree-structured models`: only one possible path between two nodes.
- **Sum-product:** (_ndr: used for explaining factor graphs from now on_)
- for `tree-structured graphs`
- `message-passing` algorithm
- applicable to both `directed models` (Bayesian networks) as well as `undirected ones` (Markovian networks)
- **Max-product:** to find the **most probable configuration**
- for `tree-structured graphs`
- `message-passing` algorithm
- replace sum with max in the sum-product algorithm
- products of many small probabilities can lead to underflow problems $\rightarrow$ use $log$
- **Junction tree algorithm:**
- `message-passing` algorithm
- extension on `generic graphs`, but has **exponential complexity**
#### Factor Graphs
**Factor graph:** a better graphical representation for describing inference algorithms.

- **Factor** $f(X)$: a function over a set of variables $X= \{ X_1, X_2, . . . , X_k \}$ called `scope`, mapping each possible set of inputs to a real number
- can be thought of as a table of values (not necessarily probabilities)
- in `Bayesian networks`, the factors correspond to **conditional probability distributions**
- 👁🗨_Condition Probability Tables_ (`CPTs`) are factors
- in `Markov Random Fields`, the factors encode an **unnormalized distribution**
**Factor message:** starts from a factor, is the `product of node messages` incoming to the factor, `summed over all the possible values` of the factor variables other than $X$ (marginalization)
- $µ_{f_s→X} (X) = \text{marginalized} f_s \cdot \prod\limits_{X_{neightbour}\backslash X} (\text{incoming node messages})$
- `factor marginalization:` locally eliminates a set of variables from a factor, $f(X)=\sum\limits_{X_1}...\sum\limits_{X_n}f(X,X_1,...,X_n)$
- example: $f_s(X,Y)=P(X|Y)$ $\longrightarrow$ $f_{s}(X) = \sum\limits_YP(X|Y)$

**Node message:** starts from a node, is the `product of factor messages` incoming to the node
- $µ_{X_m→f_s} (X_m) = \prod\limits_{f_{neightbour} \backslash f_s}(\text{incoming factor messages})$
INFERENCE EXAMPLE WITH FACTOR GRAPH:

**Variable elimination algorithms:** eliminates one by one those variables irrelevant for the inference query, by putting them in **factor functions**.
- uses `dynamic programming`, $O(nk^2)$
### Approximate inference
A number of techniques for approximate inference exist:
- **loopy belief propagation:** `message passing` on the original graph even if it contains loops (it is not guaranteed to provide an exact solution)
- set a `message passing schedule`, information flows multiple times because of the loops, stop on convergence
- **variational methods:** deterministic approximations, assuming the posterior probability (`given the evidence`) factorizes in a particular way
- **sampling methods:** approximate posterior is obtained `sampling` from the network
## 11. Learning BN [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/10_learning_bn/talk.pdf)] 👨🏫
**Goal:** `induce a Bayesian Network` that best matches a dataset $\mathcal{D}$, given some or none prior information. (BN fragment, prior probabilities, ...)

### Complete Data and Known Structure Case
**Parameter estimation:** full BN + full data $\mathcal{D}$ to `estimate conditional probabilities tables` (one table for each node).
- estimator approaches:
- **MLE estimator (**frequentist**):**
- tends to `over-fit` the training set
- configurations not appearing in the training set will receive `zero probability`
- **Bayesian estimator**
### Incomplete Data and Known Structure Case
We need `approximate methods` to deal with incomplete data.
**Expectation-Maximization (**`EM`**):** iterative method
- **e-step:** fill-in missing data inferring them `using current parameters` (solve inference problem to get expected counts)
- **m-step:** `recompute parameters` maximizing likelihood of the complete dataset (using expected counts)
- can either use ML (Maximum Likelihood) or MAP (Maximum A Posteriori)
### Learning The Structure
**Learning graphical models structures:**
- `score-based`: score each possible structure, and find the best one
- `constraint-based`: test conditional independencies on the data and construct a model satisfying them
- `model-averaging`: weighted average over all possible structures
- **problem:** averaging over all possible structures is too expensive
## 12. Naive Bayes Classifier [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/11_naive_bayes/talk.pdf)] 🐵
Because of its `naive independence assumption`: all the attributes are **conditionally independent given the class**.
- to `simplify calculations`
- in practice the assumption is violated $\implies$ `high bias`
- bias of the parameter estimates can be reduced by improving the probability estimates
- but naive Bayes has `low variance`
**Classification rule:** $y*=\text{argmax}_{y_i \in Y}P(y_i|a_1,...a_m) = \text{argmax}_{y_i \in Y}P(y_i)P(a_1,...a_m|y_i) \ \ \underrightarrow{\text{cond. ind.}} \ \ \text{argmax}_{y_i \in Y}P(y_i)\prod\limits_{j=1}^m P(a_j|y_i)$
- the task is predicting the MAP (Maximum A Priori) target value given the instance
- classification $\implies$ use $\text{argmax}$
### Single Multinomial Distribution Case
If all attribute values come from the same distribution, the probability of an attribute value given the class can be modeled as a multinomial distribution over the $K$ possible values.
**Classificator:** $y*= \text{argmax}_{y_i \in Y}P(y_i)\prod\limits_{j=1}^m P(a_j|y_i) = \text{argmax}_{y_i \in Y}\prod\limits_{j=1}^{|X|}\prod\limits_{k=1}^{K} \theta_{ky_i}^{z_k(x[j])}\dfrac{|D_i|}{|D|}$
**Maximum-likelihood estimator (MLE)** for $\theta_{kc}=\dfrac{N_{kc}}{N_c}$, where $N_{kc}$ is the number of times the value $v_k$ was observed among training examples of class $c$.
**Bayesian posterior distribution estimator** for $\theta_{kc}=\dfrac{N_{kc}+\alpha_{kc}}{N_c+\alpha_c}$
👁🗨 With multiple distributions, attributes from different distributions have to be considered separately for parameter estimation.
- $\theta_{jkc}$ represents the probability of observing value $v_{jk}$ for the $j$-th attribute given class $c$
## 13. Linear discriminant functions [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/14_linear_discriminant/talk.pdf)] 📉

- `generative learning`: assumes knowledge of the distribution governing the data
- 👍 more flexible models
- 👍 can perform arbitrary inference tasks
- 👁🗨 generative classifiers:
- Naive Bayes
- Bayesian networks
- Markov random fields
- Hidden Markov Models (HMM)
- `discriminative learning`: focuses on directly modeling the **discriminant function**
- 👍 feasible even when modeling data distribution is difficult due to data complexity
- 👍 focuses parameters on the desired goal
- 👎 less flexible models
- 👎 can't do arbitrary inference tasks
- 👁🗨 discriminative classifiers:
- Logistic regression
- Support Vector Machine (SVM)
- Traditional neural networks
- Nearest neighbor
- Conditional Random Fields (CRF)s
**Linear discriminant function:** $f(x) = w^T x + w_0$, is a linear combination of example features, represents the boundary
- $w_0$ is the `threshold` (**bias**)
**Linear binary classifier:** $f(x) = \text{sign}(w^T x + w_0)$

- boundary is an `hyperplane` $H$
- weight vector $w$ is orthogonal to $H$
- **functional margin:** $f(x)$ value $\rightarrow$ **prediction confidence**
- **geometric margin:** $r^x = \dfrac{f(x)}{||w||}$, distance from $x$ to $H$
- 👁🗨 the furthest from the margin, the higher the confidence
### Perceptron
A **Perceptron** (neural node) is composed of:
- `linear combinator:` is $v_k$ in the image
- `weights:` one for each input
- `bias` value
- `activation function:` the final output, called $\phi$

**Parameter learning:** model weights and biases must be adjusted, need to choose a function to optimize them given an error scoring function
- `gradient descent`:

...
## 14. Linear Support Vector Machines [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/15_svm/talk.pdf)] 📊
- linear SVMs have linear complexity with number of dimensions
- do not suffer much overfitting (?)
**Support vectors:** points staying on the `minimal confidence` hyperplanes.
- SVM are generally `sparse` (few support vectors)
- only they contribute to the `final decision`
### Hard margin SVM

- guarantee that all points are correctly classified
- minimization corresponds to maximizing the (squared) margin
**Constrained optimization:**
- minimal confidence hyperplane: $y_i(w^T x_i + w_0 ) = 1$

**Lagrange multiplier (**$\lambda$**):** idea, look for where there is the `point of tangency` between the contour lines of problem function and constraint function.
$$
\nabla f(w) = \lambda \nabla g(w)
$$
- after setting a constant for $g(w)=c$ (boundary to respect), a system with this and the previous formula gives $\lambda$ and $w$ values for the best solution/s
**Lagrangian:** nice compact way to solve the constrained optimization problem
$$
\mathcal{L}(w,\lambda)=f(w)-\lambda \big(g(w)-c\big)
$$
- `equivalent solution:` find where $\nabla \mathcal{L}=0$ (solution is a `saddle point`)
- consider $(w^*,\lambda^*)$ as the tuple that gives the best solution for $f(w)$
- after doing partial derivatives wrt primal variables, and substituting them in $\mathcal{L}$, what remains is $\mathcal{L}(\lambda)$, the **equivalent constrained problem** (is the `dual formulation`) subject to $\lambda$
**Karush-Khun-Tucker conditions (**`KKT`**):** a constrained optimization problem can be addressed by converting it into an unconstrained problem with the same solution...
#### Primal formulation

- maximizing the margin corresponds to minimizing $||w||$
- `constraints` tell to have all predictions out of the boundaries (always correct classification)
- $y_i=\pm1$, each data point must lie on the correct side of the margin
#### Dual formulation
$\mathcal{L}(\lambda)=$ (he uses $\alpha$)

Decision function: $f(x)=w^Tx+w_0= \sum\limits _{i=1}^{m}\alpha_iy_ix_i^Tx+w_0$
- where $w=\sum\limits _{i=1}^{m}\alpha_iy_ix_i$
- 👍 `decision function:` linear combination of dot products
- **dot product** is kind of similarity between points
### Soft margin SVM
Address the problem of `outliers`.

#### Primal formulation

**Slack variables ($ξ$):** represent penalty for example $i$ if it does not satisfy margin constraints
**Regularization($C$):** a standard approach to **prevent overfitting**. (smaller weights, lower complexity, ...)
**Hinge loss:** loss cost is the negative of the prediction but not less than $0$
- $\big(y_i , f(x_i)\big) = |1 − y_i f(x_i)|_+$
- loss value increases linearly as the data point is further in the wrongs side of the margin

- corresponds to the slack variable
#### Dual formulation

## 15. Non-linear Support Vector Machines [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/16_nonlinear_svm/talk.pdf)] 🎢
❔`Problem:` **linear SVMs can only address linearly separable problems**
❕`Solution:` **map** input examples in a higher dimensional feature space and do linear classification in this higher dimensional space
- a linear separation in feature space corresponds to a non-linear separation in input space
- $\phi:\chi\rightarrow\mathcal{H}$ maps inputs to higher dimensions
- SVM: $f(x) = w^T \phi(x) + w_0$

**$\epsilon$-insensitive loss:** loss increases only if on the wrong side out of the boundary

**Primal formulation:**

- 💭 MEMO: the slack variable $ξ$ is the loss function
- one constraint for every side of the margin
**Dual formulation:**


**Smallest Enclosing Hypersphere:** one-class classification

## 16. Kernel Machines [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/17_kernel_machines/talk.pdf)] 🥜
**Kernel Trick:** find the `linear` decision hyperplane in higher dimensions than input without actually transforming the data, possible because SVMs need only to work with dot products
- SVM doesn’t need the actual vectors to work, dot products between them are sufficient. Expensive calculations of the new dimensions can be avoided
- 💥 only change the dot product to that of the desired space
Kernels are **generalization of dot products** to arbitrary domains: `transform non-linear space to linear space`, so we can use linear SVMs.
- a `dot product` can be seen as a measure of similarity between objects
- it is possible to design kernels over structured objects like sequences, trees or graphs
- ⚠ **Careful:** mapping to higher dimensions can overfit the model.

**Basic Kernels:**
- linear: $k(x, x' ) = x^T x'$
- polynomial: $k_{d,c}(x, x' ) = (x^T x' + c)^d$

**Homogeneous Kernel:** when $k(x, x' ) = (x^T x' )^d$
**Valid Kernel:** when $k(x, x' ) = \phi(x)^T \phi(x' )$
- how to check validity:
- prove its `positive definiteness` (difficult)
- find a corresponding `feature map`
- use `kernel combination` properties
**Gram matrix:** kernel value for each possible input pairs given a dataset?
**Positive definite Kernel:** function that gives a positive definite Gram matrix
- 👁🗨 is `necessary` and `sufficient` condition for a kernel to correspond to a dot product of some feature map $\phi$
**Kernel Combinations:** allows to `design complex kernels` on structures from simpler ones.
- 💥 easiest way to build a **valid kernel**
- `sum`: $(k_1 + k_2)(x, x' ) = k_1(x, x' ) + k_2(x, x' ) = \phi_1(x)^T Φ_1(x' ) + \phi_2(x)^T Φ_2(x')$
- `product`: $(k_1 × k_2)(x, x' ) = k_1(x, x' )k_2(x, x' ) = \sum\limits^n_{i=1} \phi_{1i}(x)\phi_{1i}(x' ) \sum\limits^m_{j=1} \phi_{2j}(x)\phi_{2j}(x')$
- `linear combination`: $k_{sum}(x, x' ) = \sum\limits^K_{k=1} β_k k_k (x, x' )$
- amounts at performing _kernel learning_
- `decomposition:` ...
- `convolution`: ...
- `set kernel`: ...
- `normalization`: ...
- `composition`: ...
## 17. Kernels on Structures [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/18_kernels/talk.pdf)] 🔹
- kernels allow to `generalize notion of dot product` (i.e. similarity) to arbitrary (non-vector) spaces
- 👁🗨 **decomposition kernels** suggest a constructive way to build kernels considering parts of objects
### Sequences
**Spectrum kernel:** gives idea of sequence similarity
- the cross product between the feature space vectors is the similarity value
- **feature space:** space of all possible $k$-grams (subsequences)

- 👎`problem:` feature space representation tends to be very sparse $\implies$ an example is only similar to itself
**Mismatch string kernel:** the score for a given $k$-gram is the sum of all $k$-grams that match it, with at most $m$ mismatches
- allows `approximate` matches between $k$-grams
- 👍 denser feature map
Spectrum vs Mismatch string:

### Trees
**Subset tree kernel:** corresponds to a feature map of all subset trees
- the result of the dot product is the number of all the matching sub-trees

- for each pre-terminal (that are not leaves) subgraph that match do **dot product** as follows $C(n_i , n'_j) = \prod\limits^{nc(n_i)}_{j=1} (1 + C(ch(n_i , j), ch(n'_j , j)))$
- $nc(n_i)$ is the number of children of $n_j$
- $ch(n_j,j)$ is the $j$-th child of $n_j$
- 👁🗨 kernel value depends on tree size $\implies$ normalize!
- `dominant diagonal problem:` different trees have rarely large identical portions, similarity of themselves is a lot higher
### Graphs
**Bag of subgraphs:** decompose to one feature for all possible subgraphs up to a certain size

- feature value is its number of occurrences
**Walk kernels:** gives the number of common length $n$ random walks between two graph
1. $n$-th power of the adjacency matrix gives the number of walks of length $n$ between any pair of nodes
2. construct direct product graph of $G$ and $G'$

3. count walks in product graph (because each walk is common to the original graphs)
**Weistfeiler-Lehman graph kernel:** for testing graph isomorphism
- boh

## 18. Deep Networks [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/21_nn/talk.pdf)] 🕸
Recap:
- **Perceptron**
- models `only linear` functions
- **Kernel Machines**
- `non-linearity` given by kernels
- need **kernel selection** phase
- **Multilayer Perceptron (**`MLP`**)**
- Input layer (input features)
- $\ge1$ hidden layers in the middle (learned features)
- output layer (predictions)
**Activation function** for neuron output after the linear combination of input weights and nodes:
- `Threshold activation:` $0$ if negative, $1$ if positive
- not suitable for gradient descent
- `Sigmoid:` has an "S" shape, a smooth version of the threshold activation
- decision threshold at $0.5$
- `Softmax:` for **multi-class classification**, final layer has one output for each possible class
- decision: highest scoring class
**Loss Functions:**
- binary classification $\longrightarrow$ `cross entropy`
- multi-class classification $\longrightarrow$ `cross entropy`
- regression $\longrightarrow$ `mean squared error`
👁🗨 Minimize entropy $=$ maximize likelihood.
💥 **Gradient backpropagation:**


Where:
- $\phi_i$ is the output of the $i$-th node
- $\dfrac{\partial E}{\partial W_j}$ is the computed error for the weight $W_j$
- product between how much the output function $\phi_j$ affects the error and how much the weight $W_j$ affects the output
- $\dfrac{\partial E}{\partial \phi_j}$ is the computed error for the output $\phi_j$

**Stopping criterion:** use a separate validation set to estimate performance of the network and choose when to stop training.

### Training Tricks
**Vanishing gradient problem:** gradient tends to $0$ in lower layers because activation functions (sigmoid, ...) are easily saturated
- `initialize weights` to small random values around $0$
- `standardize` inputs to avoid saturating hidden units
- `shuffle` training data before each epoch
- solution: input regularization
**Rectifier:** activation function $f(x) = \text{max}(0, w^T x)$, gives better training for deep neural networks
- ✔ `sparse activation:` many nodes with output $0$
- ✔ scale invariant
- ✔ better gradient propagation: no easy saturation
- ✔ linear like function (**much faster** than sigmoid)
- **ReLu:** neuron employing rectifier activation
- ❌ dying ReLu problem (vanishing gradiend like): node can die if inactive for all inputs
**Regularization:** add a penalty coefficient
- 👍 good alternative to cross validation for feature selection (handling overfitting)
- **L1 (**`lasso regression`**):**

- **L2 (**`ridge regression`**):**

**Initialization suggestions:**
- init random weights $\longrightarrow$ break symmetries between neurons
- sparse initialization (init many nodes to $0$) $\longrightarrow$ encourages diversity between neurons
**Batch size:** find good trade between learning rate and reliable convergence

- `full batch:` update parameters after one epoch
- `mini-batch:` mix
- `fully stochastic:` update parameters for each input
**Momentum:** accelerates the descent for dimensions whose gradients point in the same direction
- helps accelerating the descent, but can also skip a local minimum

**Adaptive gradient:** can constantly reduce the learning rate while approaching to local minimum, or be adaptive in base of the gradient slope (`Adagard`).
**Batch normalization:** helps against the` covariate shift problem` (input distribution changes over time)
- normalize node activations + tune shift and scaling parameters
- 👍 more robust parameter initialization
- 👍 allows faster learning rates without divergence
- 👍 regularizes the model
**Layer-wise pre-training:** try to make a small layer (compressor) to reproduce input in the output (can be unsupervised),

- `global refinement:`
1. swap auto-encoder output layer with a preferred one (e.g. one-hot for multi-class)
2. train
### Popular Architectures
**Convolutional Networks:** location invariance + compositionality
- `convolution filters:` extract **local features**
- **pooling** to provide invariance to local variations
- hierarchy of filters to compose complex features from simpler ones (e.g. pixels to edges to shapes)
- fully connected layers for **final classification**

**Long Short-Term Memory Networks (**`LSMT`**):** used a lot for speech recognition
- cell state propagated along chain
- `forget gate:` selectively forgets parts of the cell state Input gate selectively chooses parts of the candidate for cell update
- `output gate:` selectively chooses parts of the cell state for output

**Generative Adversarial Networks (**`GAN`**):**
- `generator network:` learns to generate items (e.g. images) from random noise
- `discriminator network:` learns to distinguish between real items and generated ones
- the two networks are jointly learned (`adversarial game`)
- **no human supervision** needed!!

## 19. Unsupervised Learning[[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/20_clustering/talk.pdf)] 🙈
To group unlabeled data into `clusters`.
### K-means learning
**K-means learning:** assumes $k$ clusters, each represented by a mean point

1. assign input to cluster with closest mean
2. recalculate clusters mean
3. until no changes
- 👁🗨 need to choose a good similarity (distance) function: `Euclidean`, `Minkowski`, `Cosine similarity`, ...
### Gaussian Mixture Model
**Gaussian Mixture Model (**`GMM`**):** clustering by mixture of Gaussian distributions.

- assumes $k$ Gaussians, estimate mean and possibly variance for each one
- parameter estimation:
- `Maximum likelihood` cannot be used because cluster assignment is unknown
- `Expectation-Maximization:`
1. compute expected cluster assignment given current parameter setting
2. estimate parameters given cluster assignment
3. iterate until convergence
### Number of Clusters
**Elbow method:** try to find the least required number of clusters that give a good score

1. increase number of clusters for each run
2. plot a clustering evaluation metric
3. choose the number of clusters $k$ where is the angle (elbow) in the plot
- Problem: **can be ambiguous**
**Average silhouette method:** idea, more clusters $\implies$ more homogeneous clusters

1. increase number of clusters for each run
2. plot average silhouette coefficient for different $k$
3. choose $k$ where the average silhouette coefficient is maximal
### Hierarchical clustering
Clustering does not need to be flat:
- `Top-down` approach (splitting)
- `Bottom-up` approach (aggregating)
**Dendrograms:** represent clusters by similarity score

👁🗨 **Similarity** between clusters can be given by `nearest-neighbor`, `furthest-neighbor`, or `average pairwise distance` between clusters elements.
🌞END🌞