Machine Learning - SUNTO

--- tags: Master --- # Machine Learning - SUNTO [TOC] ## 1. Introduction [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/01_machine_learning/talk.pdf)] 🖖 **What is Machine Learning:** > A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience > > E. T. Mitchell `Learning problem:` task, performance measure, training experience. ### Designing a machine learning system 1. `Formalize the learning task:` choose an appropriate **performance measure** 2. `Collect data:` often the most **cumbersome** part 3. `Extract features:` to provide relevant input to the learning system - **too few** can lower the performance because of missing relevant information - **too many** can lower the performance because of increased probability of having nosy and useless features 4. `Choose class of learning models:` bias-variance trade-off 5. `Train model:` improve generalization, **avoid overfitting** 6. `Evaluate model:` according to its ability to **generalize** to unseen examples - can provide insights into the model weaknesses - can imply comparing different models/learners ### Choice of Learning Algorithms **information available:** - **Full knowledge of probability distributions of data** $\Longrightarrow$ - `Bayesian decision theory` (Ch.7) - **Form of probabilities known, parameters unknown** $\Longrightarrow$ - `Parameter estimation` from training data (Ch.8) - **Form of probabilities unknown, training examples available** $\Longrightarrow$ - `discriminative methods:` do not model input data - `generative methods:` learn a function predicting the desired output given the input - Form of probabilities unknown, training examples unavailable (**only inputs**) $\Longrightarrow$ - `unsupervised methods:` clustering, dimensionality reduction (e.g. PCA), novelty detection ## 2. Decision Trees [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/02a_decision_trees/talk.pdf)] 🌲 - the **leaf** contains the label to be assigned to instances reaching it - **logical formula** represented by the tree: the disjunction of all paths (conjunctions) ![](https://i.imgur.com/kOZ1ou9.png) **Appropriate problems for decision trees:** - binary or multiclass **classification tasks** (extensions to regressions also exist) - instances represented as `attribute-value pairs` - different explanations for the concept are possible (`disjunction`) - some instances have `missing attributes` - need for an `interpretable explanation` of the output **Greedy top-down strategy for learning:** $\forall$ node 1. choose `best attribute` to be evaluated 2. $\forall$ attribute value add a child 3. split node training set into children according to value of chosen attribute `stop splitting` a node: - if it contains examples from a single class (`pure`) - if there are no more attributes to test ### Choosing the best attribute #### Entropy A measure of the amount of information contained in a collection of `instances` $S$ which can take a number $c$ of `possible values`. $$ H(S)=-\sum\limits_{i=1}^cp_ilog_2(p_i) $$ - in our case instances are `training examples` and values $i$ are `class labels` - the entropy of a set of labeled examples **measures its label inhomogeneity** #### Information gain Expected reduction in entropy obtained by partitioning a `set` $S$ according to the value of a certain `attribute` $A$. $$ \text{IG(}S,A\text{) = H(}S\text{)}-\sum\limits_{v\space\in\space Values(A)}\frac{|S_v|}{|S|}\text{H(}S_v\text{)} $$ - an attribute with high information gain tends to produce homogeneous groups in terms of labels, thus favoring their classification ### Issues in decision tree learning - **`Overfitting`:** requiring pure sets can imply very **complex trees**, majority rule can be used. - `pre-pruning:` decide pruning while learning the tree - `post-pruning:` learn the full tree first - **`Continuous valued attributes:`** need to be **discretized** in order to be used in internal nodes tests - using the threshold producing the **higher infogain** - **`Attributes with missing values`:** when attribute $A$ is to be tested at node $n$ - `simple solution:` assign to $x$ the **most common** attribute values among examples in $n$ - `complex solution:` propagate $x$ to each of the children of $n$, with a fractional value equal to the proportion of examples with the corresponding attribute value ### Random forests **Training:** - given a training set of $N$ examples, **sample** $N$ examples **`with replacement`** (i.e. same example can be selected multiple times) - train a decision tree on the sample, selecting at each node $m$ `features at random among which to choose the best one` - repeat steps $1$ and $2$ $M$ times in order to generate a forest of $M$ trees **Testing:** 1. test the example with each tree in the forest 2. return the `majority class ` among the predictions ## 3. K-nearest neighbors [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/02b_nearest_neighbour/talk.pdf)] 🎯 … ## 4. Linear algebra [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/03_linear_algebra/talk.pdf)] 📈 - Some linear algebra recap… - PCA (pag.30) ## 5. Probability theory [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/04_probability_theory/talk.pdf)] 🎲 - Some probability theory recap… mainly how to play with some rules - variables, distributions, theorems (Bayes), laws (central limit), … ## 6. Evaluation [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/19_evaluation/talk.pdf)] 🔬 **Performance evaluation is needed for:** - `tuning hyperparameters` of learning method (e.g. type of kernel and parameters, learning rate of perceptron) - evaluating quality of learned predictor - computing statistical significance of difference between learning algorithms The **training loss function** measures the cost paid for predicting $f(x)$ for output $y$. ### Performance measures - **Binary classification:** - `ACC,Pre,Rec,F1` all measure performance of a classifier for a specific threshold - `Precision-recall curve:` It is possible to change the threshold if interested in maximizing a specific performance (e.g. recall) - **Multiclass classification:** - more generic version of the previous one - `ACC,Pre,Rec,F1` carry over to a **per-class measure** considering as negatives examples from other classes - **Regression** ![](https://i.imgur.com/iOQPznL.png) ![](https://i.imgur.com/lU9l0hI.png) ![](https://i.imgur.com/xq5hksO.png) - **Hold-out procedure:** computing performance measure on training set would be biased,need: - `Validation set:` when used to estimate **performance of different algorithmic settings** (i.e. hyperparameters) - `Test set:` when used to estimate **final performance of selected model** - **K-fold cross validation** ### Comparing learning algorithms - **Hypothesis testing:** allows to test the statistical significance of a hypothesis ![](https://i.imgur.com/MHYIFL5.png) - **T-test:** the test statistics is given by the (also called `studentized`) **standardized mean**: ![](https://i.imgur.com/Ra2fKCB.png) where $\overset{˜}{Var}\overset{¯}{[X]}$ is the `approximated variance` (using unbiased sample one) ## 7. Bayesian decision theory [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/05_bayesian_decision/talk.pdf)] 📚 - 💥Bayesian decision theory allows to **take optimal decisions in a fully probabilistic setting** - 💥It assumes **all relevant probabilities are known** - it allows to `provide upper bounds on achievable errors` and evaluate classifiers accordingly - Bayesian reasoning **can be generalized** to cases when the probabilistic structure is not entirely known (see “Parameter estimation” chapter) ![](https://i.imgur.com/dEaNLbG.png) 💠**Bayes rule** allows us to `predict the class given the input`, in other words, Bayes rule allows to compute the posterior probability as: $$ posterior = \dfrac{likelihood × prior}{evidence} $$ ![](https://i.imgur.com/HD5NCqd.png) - **posterior:** $P(y|x)$ is the probability that class is $y$ given that $x$ was observed - **likelihood:** $P(x|y)$ is the probability of observing $x$ given that its class is $y$ - **prior:** $P(y)$ is the prior probability of the class, without any evidence - **evidence:** $P(x)$ is the probability of the observation, and by the `law of total probability` can be computed as the **sum of marginal probabilities over the possible classes**: $$ P(x) = \sum^c_{i=1}P(x|y_i)P(y_i) $$ **Bayes decision rule:** minimizes the probability that an example is incorrectly labeled. The hypothesis to choose depends on the biggest among their probabilities given an evidence: - $y_B = \text{argmax}_{i∈\{1,...,c\}}g_i(x)$: takes the biggest result among a set of `discriminant functions` $g_i(x)$, there are many possibilities for the function: - $g_i(x) = p(y_i|x) = p(x|y_i)P(y_i)$ (`Bayesian decision rule`) - $P(x)$ as denominator is ignored because its a constant - $g_i(x) = ln\text{p}(x|y_i) + ln\text{P}(y_i)$ - … **Optimal rule:** probability of error given $x$ $$ P(\text{error}|x) = 1 − P(y_B|x) $$ 👁‍🗨 the `minimum error` is the complement of the maximum posterior probability - Bayes decision rule maximizes $y_B$ ### Classifiers representation ![](https://i.imgur.com/ZkrkoDu.png) The feature space is divided into `decision regions` $R_1, . . . , R_c$ such that: $$ x ∈ \mathcal{R}_i \quad \text{if } g_i(x) > g_j(x) \quad ∀ j \ne i $$ 👁‍🗨 Each discriminant function dominates for a nullable set of inputs. (dominates in a certain region) ### Discriminant functions for multivariate normal density ![](https://i.imgur.com/eEw94Zf.png) **Normal density function** over $d$ variables: $$ f_X(x_1,...,x_d)= \frac{exp\bigl({−\frac{1}{2}(x − µ)^tΣ^{−1}(x − µ)\bigr)}}{\sqrt{(2π)^d|Σ|}} $$ - $Σ$ is always **symmetric** and positive semi-definite Choosing $g_i(x) = ln\text{p}(x|y_i) + ln\text{P}(y_i)$ as `discriminant function` for normal density: $$ g_i(x)=−\frac{1}{2}(x − µ_i )^t Σ^{−1}_i (x − µ_i ) −\frac{d}{2} ln2π −\frac{1}{2} ln |Σ_i | + ln\text{P}(y_i) $$ - 👁‍🗨 first member is from numerator, 2nd and 3rd from denominator - $− \frac{d}{2} ln2π$ is a constant, does not impact on $\text{argmax}()$ **Special cases of covariance matrix $Σ_i$:** - case $Σ_i = σ^2I$ - features are `statistically independent` - all features have `same variance` $σ^2$ - covariance determinant $|Σ_i| = σ^{2d}$ can be ignored being independent of $i$ - the discriminant functions becomes: $$ g_i(x) = − \frac{||x − µi||^2}{2σ^2}+ lnP(y_i) $$ ![](https://i.imgur.com/tWlKpmC.png) - case $Σ_i = Σ$ ![](https://i.imgur.com/EMf4soJ.png) - case $Σ_i$ arbitrary ![](https://i.imgur.com/qaPx7XG.png) ## 8. Parameter estimation [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/06_07_bayesian_learning/talk.pdf)] 👓 **Setting:** - data is sampled from a **probability distribution** $p(x, y)$ - probability distribution $p$ **form is known**, but **parameters are unknown** - training set $\mathcal{D} = \{(x_1, y_1), . . . ,(x_m, y_m)\}$ of examples sampled i.i.d. (**independent identically distributed**) according to $p(x, y)$ **Goal:** `estimate the unknown parameters` of $p$ from training data $\mathcal{D}$. **i.i.d. sampling:** - _independent:_ each example is sampled independently from the others - _identically distributed:_ all examples are `sampled from the same distribution` ### Multi-class classification setting - **divide** training set $\mathcal{D}$ based on `classes` (one for each) - given a new example $x$ (not in training set) and $\mathcal{D}$, we compute the **posterior probability** of class $y_i$: $$ P(y_i|x, \mathcal{D}) = \frac{p(x|y_i,\mathcal{D}_i)p(y_i|\mathcal{D})}{p(x|\mathcal{D})} $$ 👁‍🗨 We must estimate class-dependent parameters $θ_i$ for $p(x|y_i , \mathcal{D}_i)$. ### Maximum Likelihood vs Bayesian estimation - **Maximum likelihood/a-posteriori estimation:** find $\theta$ that maximizes the probability of an input $x_0$, in other words, find the distribution parameters that maximize the probability of the given data - assumes parameters $θ$ as `unknown fixed` - obtained values are used to compute `probability for new examples` (update $\theta_i \forall x$) - 👎optimization problem can get very complicated - 👍with number of measures $x\rightarrow \infty$, MLE is `unbiased` and achieves the `smallest possible variance` - **Bayesian estimation:** - assumes $θ_i$ parameters as: - `random variables` - with some `known prior distribution` - observing examples turns prior distribution over parameters into a `posterior distribution` - $p(x|y_i , \mathcal{D}_i) = \int_{θ_i} p(x, θ_i |y_i , \mathcal{D}_i)dθ_i$ 👁‍🗨 $p(x|y_i , \mathcal{D}_i) ≈ p(x|θ_i)$ ### Maximum-likelihood estimation $\theta_i^*$ is the set of best parameters that maximizes the probability of a dataset of class $i$. - **Maximum a-posteriori estimation:** assumes a `prior distribution` for the parameters $p(θ_i)$ is available - $θ^∗_i = \text{argmax}_{θ_i} p(θ_i |\mathcal{D}_i , y_i) = \text{argmax}_{θ_i} p(\mathcal{D}_i , y_i |θ_i)p(θ_i)$ - assumes a prior distribution for the parameters $p(\theta_i)$ is available - **Maximum likelihood estimation:** `no assumption` about prior distributions for parameters - $θ^∗_i= \text{argmax}_{θ_i} p(\mathcal{D}_i , y_i |θ_i)$ - given $n$ inputs of class $y$: - $θ^* = \text{argmax}_θ p(\mathcal{D}|θ) = \text{argmax}_θ \prod \limits ^n_{j=1} p(x_j |θ)$ - **Maximum log-likelihood estimation:** usually easier to maximize - $θ^∗ = \text{argmax}_θ \text{ln}p(\mathcal{D}|θ) = \text{argmax}_θ \prod \limits ^n_{j=1} \text{ln}p(x_j |θ)$ 👁‍🗨 Each class $y_i$ is treated **independently**; replace $y_i , \mathcal{D}_i → \mathcal{D}$ for `simplicity`. ![](https://i.imgur.com/X4U788i.png) 💥 Maximum likelihood estimates for Gaussian parameters are simply their empirical estimates over the samples: - Gaussian mean is the `sample mean` - Gaussian covariance matrix is the mean of the `sample covariances` ### Bayesian estimation Predictions for new examples are obtained integrating over all possible values for the parameters: $$ p(x|\mathcal{D}) = \int_θ p(x, θ|\mathcal{D})dθ = \int p(x|θ)p(θ|\mathcal{D})dθ $$ - $p(x|θ)$ can be **easily computed** because we have both form and parameters of distribution (Gaussian,…) - need to **estimate the parameter posterior density** given the training set: $$ p(θ|\mathcal{D}) = \frac{p(\mathcal{D}|θ)p(θ)}{p(\mathcal{D})} $$ - 👁‍🗨 $p(D)$ is a constant independent of $θ$, has no influence on the Bayesian decision #### Bayesian estimation cases - **Univariate normal case:** unknown $\mu$, known $σ^2$ - $p(x|\mu) ∼ N(\mu, σ^2 )$ - Mean posterior distribution varying sample size: ![](https://i.imgur.com/fueup04.png) ![](https://i.imgur.com/4YVB8ZF.png) - **Multivariate normal case:** unknown $\mu$, known $\Sigma$ ![](https://i.imgur.com/brxNXN2.png) ### Sufficient statistics **Statistic:** any function on a set of samples $\mathcal{D}$. - a statistic $\textbf{s} = \phi(\mathcal{D})$ is sufficient for some parameters $θ$ if: $$ P(\mathcal{D}|\textbf{s}, θ) = P(\mathcal{D}|\textbf{s}) $$ - a `sufficient statistic` allows to **compress a sample** $\mathcal{D}$ into (possibly few) values 👁‍🗨 Sample `mean` and `covariance` are **sufficient statistics** for true mean and covariance of the **Gaussian distribution**. ### Conjugate priors A prior distribution $p(θ)$ **is a conjugate prior for a likelihood function** $p(x|θ)$ if the posterior distribution $p(θ|x)$ is in the `same family` as the prior $p(θ)$. ![](https://i.imgur.com/iGdRiUd.png) ## 9. Bayesian Networks [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/08_bayesian_networks/talk.pdf)] 🖇 Why: - give `intuitive understanding` of the probabilistic model - `discover properties` of the model by inspecting the graph (conditional independences, …) - represent multiple probability distributions with the same graph, `abstracting from the quantitative aspects` **Independence recap:** - $a ⊥ b | ∅$ : two variables $a, b$ are `independent` if: $p(a, b) = p(a)p(b)$ - $a ⊥ b | c$ : two variables $a, b$ are `conditionally independent` given $c$ if: $p(a, b|c) = p(a|c)p(b|c)$ ![](https://i.imgur.com/UP8T6bm.png) **Factorized probability:** $p(x_1,...,x_m)= \prod\limits_{i=1}^m p(x_i|Pa_{x_i})$ - the `factorization` of the graph $\mathcal{G}$ above is $p(x_1,...,x_7)=p(x_1)p(x_2)p(x_3)p(x_4|x_1,x_2,x_3)p(x_5|x_1,x_3)p(x_6|x_4)p(x_7|x_4,x_5)$ 👁‍🗨 **Independence assumptions:** each variable is independent (⊥) of its non-descendants given its parents. - $\mathcal{I}_\ell(\mathcal{G}) = \{∀i \ \ x_i ⊥ \text{NonDescendants}_{x_i}|Parents_{x_i} \}$ $\mathcal{G}$ is an **independency map** (`I-map`) for $p$ if $p$ satisfies the local independences in $\mathcal{G}$: $$ \mathcal{I}_\ell(\mathcal{G}) ⊆ \mathcal{I}(p) $$ - $p$ is the **joint distribution** over variables $\mathcal{X}$ - $\mathcal{I}(p)$ is the `set of independence assertions` holding in $p$ - **Minimal I-map:** cannot be reduced - a minimal I-map for $p$ does not necessarily capture all the independences in $p$ **P-map (**perfect map**):** $\mathcal{I}(G) = \mathcal{I}(p)$, captures all and the only dependencies for $p$ - not all distributions have a P-map, some cannot be modeled exactly by the BN formalism **I-equivalence:** $\mathcal{I}(G) = \mathcal{I}(G')$, if the graphs have same `skeleton` (undirected graph) and same set of `v-structures` ![](https://i.imgur.com/wRT1I9r.png) ### D-Separation Instead to `verify independence` assumptions by repeated applications of sum and product rules, d-separation rules can be used to verify directly from the graph. **Markov Condition:** each variable is conditionally independent of its `non-descendants` given its `parents`. - **tail-to-tail:** - `joint` dist. $p(a,b,c)= p(a|c)p(b|c)p(c)$ - ⬜ $c$ not given $\implies$ $a$ and $b$ not independent - 🟪 $c$ given $\implies$ $a$ and $b$ conditionally independent - **head-to-tail:** - `joint` dist. $p(a,b,c)= p(b|c)p(c|a)p(a) = p(b|c)p(a|c)p(a)$ - ⬜ $c$ not given $\implies$ $a$ and $b$ not independent - 🟪 $c$ given $\implies$ $a$ and $b$ conditionally independent - **head-to-head:** - `joint` dist. $p(a,b,c)= p(c|a,b)p(a)p(b)$ - ⬜ $c$ not given $\implies$ $a$ and $b$ independent - 🟪 $c$ given $\implies$ $a$ and $b$ not conditionally independent ![](https://i.imgur.com/LYABZ9k.png) 👁‍🗨d-separation $\implies$ conditional independence between two variables. **Markov Blanket:** a variable becomes `conditionally independent of any other variable` given the evidence of its Markov blanket. - $\text{MarkovBlanket(X)= \{Parents(X), Children(X), Parents(Children(X)\}}$ ## 10. Inference in BN [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/09_inference_in_bn/talk.pdf)] 💡 **Inference:** in practice is the execution of a query. The most common query type is computing the `posterior probability` of a subset $X$ of the non-observed variables given the observations - $p(X|E = e)$ - working with joint probabilities like $p(X|E = e) = \frac{p(X, E = e)}{p(E = e)}$ is $O(2^n)$ in the number of variables, more efficient inference is needed Each outgoing message is obtained multiplying the incoming message by the “local” probability, and summing over the node values: ![](https://i.imgur.com/JcrwhMz.png) - $µ_α(X_n) = \sum\limits_{X_{n−1}} p(X_n|X_{n−1})µ_α(X_{n−1})$ - $µ_β(X_n) = \sum\limits_{X_{n+1}} p(X_{n+1}|X_n)µ_β(X_{n+1})$ **Marginal probability:** $p(X_n) = µ_α(X_n)µ_β(X_n)$ - is the product of the contributions coming from both ends **Marginal probability with evidence:** if there are observed nodes, simply use their observed values instead of summing over all possible values when computing their messages. The message passing procedure computes the `joint probability` of the variable and the evidence. ![](https://i.imgur.com/r2QbvZ5.png) **Conditional probability given evidence:** the message passing procedure computes the joint probability of the variable and the evidence, then normalized. - $p(X_n|X_e = x_e) = \dfrac{p(X_n, X_e = x_e)}{ \sum_{x_n} p(X_n, X_e = x_e)}$ ### Exact Inference on Trees ![](https://i.imgur.com/7AnJ6si.png) Efficient inference algorithms only for `tree-structured models`: only one possible path between two nodes. - **Sum-product:** (_ndr: used for explaining factor graphs from now on_) - for `tree-structured graphs` - `message-passing` algorithm - applicable to both `directed models` (Bayesian networks) as well as `undirected ones` (Markovian networks) - **Max-product:** to find the **most probable configuration** - for `tree-structured graphs` - `message-passing` algorithm - replace sum with max in the sum-product algorithm - products of many small probabilities can lead to underflow problems $\rightarrow$ use $log$ - **Junction tree algorithm:** - `message-passing` algorithm - extension on `generic graphs`, but has **exponential complexity** #### Factor Graphs **Factor graph:** a better graphical representation for describing inference algorithms. ![](https://i.imgur.com/q2taad2.png) - **Factor** $f(X)$: a function over a set of variables $X= \{ X_1, X_2, . . . , X_k \}$ called `scope`, mapping each possible set of inputs to a real number - can be thought of as a table of values (not necessarily probabilities) - in `Bayesian networks`, the factors correspond to **conditional probability distributions** - 👁‍🗨_Condition Probability Tables_ (`CPTs`) are factors - in `Markov Random Fields`, the factors encode an **unnormalized distribution** **Factor message:** starts from a factor, is the `product of node messages` incoming to the factor, `summed over all the possible values` of the factor variables other than $X$ (marginalization) - $µ_{f_s→X} (X) = \text{marginalized} f_s \cdot \prod\limits_{X_{neightbour}\backslash X} (\text{incoming node messages})$ - `factor marginalization:` locally eliminates a set of variables from a factor, $f(X)=\sum\limits_{X_1}...\sum\limits_{X_n}f(X,X_1,...,X_n)$ - example: $f_s(X,Y)=P(X|Y)$ $\longrightarrow$ $f_{s}(X) = \sum\limits_YP(X|Y)$ ![](https://i.imgur.com/pJzsRPT.png) **Node message:** starts from a node, is the `product of factor messages` incoming to the node - $µ_{X_m→f_s} (X_m) = \prod\limits_{f_{neightbour} \backslash f_s}(\text{incoming factor messages})$ INFERENCE EXAMPLE WITH FACTOR GRAPH: ![](https://i.imgur.com/E816rqP.png) **Variable elimination algorithms:** eliminates one by one those variables irrelevant for the inference query, by putting them in **factor functions**. - uses `dynamic programming`, $O(nk^2)$ ### Approximate inference A number of techniques for approximate inference exist: - **loopy belief propagation:** `message passing` on the original graph even if it contains loops (it is not guaranteed to provide an exact solution) - set a `message passing schedule`, information flows multiple times because of the loops, stop on convergence - **variational methods:** deterministic approximations, assuming the posterior probability (`given the evidence`) factorizes in a particular way - **sampling methods:** approximate posterior is obtained `sampling` from the network ## 11. Learning BN [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/10_learning_bn/talk.pdf)] 👨‍🏫 **Goal:** `induce a Bayesian Network` that best matches a dataset $\mathcal{D}$, given some or none prior information. (BN fragment, prior probabilities, ...) ![](https://i.imgur.com/MOSQN0A.png) ### Complete Data and Known Structure Case **Parameter estimation:** full BN + full data $\mathcal{D}$ to `estimate conditional probabilities tables` (one table for each node). - estimator approaches: - **MLE estimator (**frequentist**):** - tends to `over-fit` the training set - configurations not appearing in the training set will receive `zero probability` - **Bayesian estimator** ### Incomplete Data and Known Structure Case We need `approximate methods` to deal with incomplete data. **Expectation-Maximization (**`EM`**):** iterative method - **e-step:** fill-in missing data inferring them `using current parameters` (solve inference problem to get expected counts) - **m-step:** `recompute parameters` maximizing likelihood of the complete dataset (using expected counts) - can either use ML (Maximum Likelihood) or MAP (Maximum A Posteriori) ### Learning The Structure **Learning graphical models structures:** - `score-based`: score each possible structure, and find the best one - `constraint-based`: test conditional independencies on the data and construct a model satisfying them - `model-averaging`: weighted average over all possible structures - **problem:** averaging over all possible structures is too expensive ## 12. Naive Bayes Classifier [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/11_naive_bayes/talk.pdf)] 🐵 Because of its `naive independence assumption`: all the attributes are **conditionally independent given the class**. - to `simplify calculations` - in practice the assumption is violated $\implies$ `high bias` - bias of the parameter estimates can be reduced by improving the probability estimates - but naive Bayes has `low variance` **Classification rule:** $y*=\text{argmax}_{y_i \in Y}P(y_i|a_1,...a_m) = \text{argmax}_{y_i \in Y}P(y_i)P(a_1,...a_m|y_i) \ \ \underrightarrow{\text{cond. ind.}} \ \ \text{argmax}_{y_i \in Y}P(y_i)\prod\limits_{j=1}^m P(a_j|y_i)$ - the task is predicting the MAP (Maximum A Priori) target value given the instance - classification $\implies$ use $\text{argmax}$ ### Single Multinomial Distribution Case If all attribute values come from the same distribution, the probability of an attribute value given the class can be modeled as a multinomial distribution over the $K$ possible values. **Classificator:** $y*= \text{argmax}_{y_i \in Y}P(y_i)\prod\limits_{j=1}^m P(a_j|y_i) = \text{argmax}_{y_i \in Y}\prod\limits_{j=1}^{|X|}\prod\limits_{k=1}^{K} \theta_{ky_i}^{z_k(x[j])}\dfrac{|D_i|}{|D|}$ **Maximum-likelihood estimator (MLE)** for $\theta_{kc}=\dfrac{N_{kc}}{N_c}$, where $N_{kc}$ is the number of times the value $v_k$ was observed among training examples of class $c$. **Bayesian posterior distribution estimator** for $\theta_{kc}=\dfrac{N_{kc}+\alpha_{kc}}{N_c+\alpha_c}$ 👁‍🗨 With multiple distributions, attributes from different distributions have to be considered separately for parameter estimation. - $\theta_{jkc}$ represents the probability of observing value $v_{jk}$ for the $j$-th attribute given class $c$ ## 13. Linear discriminant functions [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/14_linear_discriminant/talk.pdf)] 📉 ![](https://i.imgur.com/z157N35.png) - `generative learning`: assumes knowledge of the distribution governing the data - 👍 more flexible models - 👍 can perform arbitrary inference tasks - 👁‍🗨 generative classifiers: - Naive Bayes - Bayesian networks - Markov random fields - ‌Hidden Markov Models (HMM) - `discriminative learning`: focuses on directly modeling the **discriminant function** - 👍 feasible even when modeling data distribution is difficult due to data complexity - 👍 focuses parameters on the desired goal - 👎 less flexible models - 👎 can't do arbitrary inference tasks - 👁‍🗨 discriminative classifiers: - ‌Logistic regression - Support Vector Machine (SVM) - ‌Traditional neural networks - ‌Nearest neighbor - Conditional Random Fields (CRF)s **Linear discriminant function:** $f(x) = w^T x + w_0$, is a linear combination of example features, represents the boundary - $w_0$ is the `threshold` (**bias**) **Linear binary classifier:** $f(x) = \text{sign}(w^T x + w_0)$ ![](https://i.imgur.com/I9VuJgV.png) - boundary is an `hyperplane` $H$ - weight vector $w$ is orthogonal to $H$ - **functional margin:** $f(x)$ value $\rightarrow$ **prediction confidence** - **geometric margin:** $r^x = \dfrac{f(x)}{||w||}$, distance from $x$ to $H$ - 👁‍🗨 the furthest from the margin, the higher the confidence ### Perceptron A **Perceptron** (neural node) is composed of: - `linear combinator:` is $v_k$ in the image - `weights:` one for each input - `bias` value - `activation function:` the final output, called $\phi$ ![](https://i.imgur.com/qxgPUbg.png) **Parameter learning:** model weights and biases must be adjusted, need to choose a function to optimize them given an error scoring function - `gradient descent`: ![](https://i.imgur.com/RCENv9i.png) ... ## 14. Linear Support Vector Machines [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/15_svm/talk.pdf)] 📊 - linear SVMs have linear complexity with number of dimensions - do not suffer much overfitting (?) **Support vectors:** points staying on the `minimal confidence` hyperplanes. - SVM are generally `sparse` (few support vectors) - only they contribute to the `final decision` ### Hard margin SVM ![](https://i.imgur.com/WUD7w78.png) - guarantee that all points are correctly classified - minimization corresponds to maximizing the (squared) margin **Constrained optimization:** - minimal confidence hyperplane: $y_i(w^T x_i + w_0 ) = 1$ ![](https://i.imgur.com/PZKJGmf.png) **Lagrange multiplier (**$\lambda$**):** idea, look for where there is the `point of tangency` between the contour lines of problem function and constraint function. $$ \nabla f(w) = \lambda \nabla g(w) $$ - after setting a constant for $g(w)=c$ (boundary to respect), a system with this and the previous formula gives $\lambda$ and $w$ values for the best solution/s **Lagrangian:** nice compact way to solve the constrained optimization problem $$ \mathcal{L}(w,\lambda)=f(w)-\lambda \big(g(w)-c\big) $$ - `equivalent solution:` find where $\nabla \mathcal{L}=0$ (solution is a `saddle point`) - consider $(w^*,\lambda^*)$ as the tuple that gives the best solution for $f(w)$ - after doing partial derivatives wrt primal variables, and substituting them in $\mathcal{L}$, what remains is $\mathcal{L}(\lambda)$, the **equivalent constrained problem** (is the `dual formulation`) subject to $\lambda$ **Karush-Khun-Tucker conditions (**`KKT`**):** a constrained optimization problem can be addressed by converting it into an unconstrained problem with the same solution... #### Primal formulation ![](https://i.imgur.com/r4FWwrS.png) - maximizing the margin corresponds to minimizing $||w||$ - `constraints` tell to have all predictions out of the boundaries (always correct classification) - $y_i=\pm1$, each data point must lie on the correct side of the margin #### Dual formulation $\mathcal{L}(\lambda)=$ (he uses $\alpha$) ![](https://i.imgur.com/JDcdawy.png) Decision function: $f(x)=w^Tx+w_0= \sum\limits _{i=1}^{m}\alpha_iy_ix_i^Tx+w_0$ - where $w=\sum\limits _{i=1}^{m}\alpha_iy_ix_i$ - 👍 `decision function:` linear combination of dot products - **dot product** is kind of similarity between points ### Soft margin SVM Address the problem of `outliers`. ![](https://i.imgur.com/Lm5LpU5.png) #### Primal formulation ![](https://i.imgur.com/xhqIBLS.png) **Slack variables ($ξ$):** represent penalty for example $i$ if it does not satisfy margin constraints **Regularization($C$):** a standard approach to **prevent overfitting**. (smaller weights, lower complexity, ...) **Hinge loss:** loss cost is the negative of the prediction but not less than $0$ - $\big(y_i , f(x_i)\big) = |1 − y_i f(x_i)|_+$ - loss value increases linearly as the data point is further in the wrongs side of the margin ![](https://i.imgur.com/SB9lSe4.png) - corresponds to the slack variable #### Dual formulation ![](https://i.imgur.com/RhCYIjc.png) ## 15. Non-linear Support Vector Machines [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/16_nonlinear_svm/talk.pdf)] 🎢 ❔`Problem:` **linear SVMs can only address linearly separable problems** ❕`Solution:` **map** input examples in a higher dimensional feature space and do linear classification in this higher dimensional space - a linear separation in feature space corresponds to a non-linear separation in input space - $\phi:\chi\rightarrow\mathcal{H}$ maps inputs to higher dimensions - SVM: $f(x) = w^T \phi(x) + w_0$ ![](https://i.imgur.com/lhw5bZL.png) **$\epsilon$-insensitive loss:** loss increases only if on the wrong side out of the boundary ![](https://i.imgur.com/CEfWRwp.png) **Primal formulation:** ![](https://i.imgur.com/AGpfJaN.png) - 💭 MEMO: the slack variable $ξ$ is the loss function - one constraint for every side of the margin **Dual formulation:** ![](https://i.imgur.com/1kC2zJ6.png) ![](https://i.imgur.com/wN9iTbT.png) **Smallest Enclosing Hypersphere:** one-class classification ![](https://i.imgur.com/UwrxHoW.png) ## 16. Kernel Machines [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/17_kernel_machines/talk.pdf)] 🥜 **Kernel Trick:** find the `linear` decision hyperplane in higher dimensions than input without actually transforming the data, possible because SVMs need only to work with dot products - SVM doesn’t need the actual vectors to work, dot products between them are sufficient. Expensive calculations of the new dimensions can be avoided - 💥 only change the dot product to that of the desired space Kernels are **generalization of dot products** to arbitrary domains: `transform non-linear space to linear space`, so we can use linear SVMs. - a `dot product` can be seen as a measure of similarity between objects - it is possible to design kernels over structured objects like sequences, trees or graphs - ⚠ **Careful:** mapping to higher dimensions can overfit the model. ![](https://i.imgur.com/TOI6biB.png) **Basic Kernels:** - linear: $k(x, x' ) = x^T x'$ - polynomial: $k_{d,c}(x, x' ) = (x^T x' + c)^d$ ![](https://i.imgur.com/3Ouh4tp.png) **Homogeneous Kernel:** when $k(x, x' ) = (x^T x' )^d$ **Valid Kernel:** when $k(x, x' ) = \phi(x)^T \phi(x' )$ - how to check validity: - prove its `positive definiteness` (difficult) - find a corresponding `feature map` - use `kernel combination` properties **Gram matrix:** kernel value for each possible input pairs given a dataset? **Positive definite Kernel:** function that gives a positive definite Gram matrix - 👁‍🗨 is `necessary` and `sufficient` condition for a kernel to correspond to a dot product of some feature map $\phi$ **Kernel Combinations:** allows to `design complex kernels` on structures from simpler ones. - 💥 easiest way to build a **valid kernel** - `sum`: $(k_1 + k_2)(x, x' ) = k_1(x, x' ) + k_2(x, x' ) = \phi_1(x)^T Φ_1(x' ) + \phi_2(x)^T Φ_2(x')$ - `product`: $(k_1 × k_2)(x, x' ) = k_1(x, x' )k_2(x, x' ) = \sum\limits^n_{i=1} \phi_{1i}(x)\phi_{1i}(x' ) \sum\limits^m_{j=1} \phi_{2j}(x)\phi_{2j}(x')$ - `linear combination`: $k_{sum}(x, x' ) = \sum\limits^K_{k=1} β_k k_k (x, x' )$ - amounts at performing _kernel learning_ - `decomposition:` ... - `convolution`: ... - `set kernel`: ... - `normalization`: ... - `composition`: ... ## 17. Kernels on Structures [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/18_kernels/talk.pdf)] 🔹 - kernels allow to `generalize notion of dot product` (i.e. similarity) to arbitrary (non-vector) spaces - 👁‍🗨 **decomposition kernels** suggest a constructive way to build kernels considering parts of objects ### Sequences **Spectrum kernel:** gives idea of sequence similarity - the cross product between the feature space vectors is the similarity value - **feature space:** space of all possible $k$-grams (subsequences) ![](https://i.imgur.com/198HZ83.png) - 👎`problem:` feature space representation tends to be very sparse $\implies$ an example is only similar to itself **Mismatch string kernel:** the score for a given $k$-gram is the sum of all $k$-grams that match it, with at most $m$ mismatches - allows `approximate` matches between $k$-grams - 👍 denser feature map Spectrum vs Mismatch string: ![](https://i.imgur.com/kyj7ABT.png) ### Trees **Subset tree kernel:** corresponds to a feature map of all subset trees - the result of the dot product is the number of all the matching sub-trees ![](https://i.imgur.com/kE5qYCo.png) - for each pre-terminal (that are not leaves) subgraph that match do **dot product** as follows $C(n_i , n'_j) = \prod\limits^{nc(n_i)}_{j=1} (1 + C(ch(n_i , j), ch(n'_j , j)))$ - $nc(n_i)$ is the number of children of $n_j$ - $ch(n_j,j)$ is the $j$-th child of $n_j$ - 👁‍🗨 kernel value depends on tree size $\implies$ normalize! - `dominant diagonal problem:` different trees have rarely large identical portions, similarity of themselves is a lot higher ### Graphs **Bag of subgraphs:** decompose to one feature for all possible subgraphs up to a certain size ![](https://i.imgur.com/eBOTsbj.png) - feature value is its number of occurrences **Walk kernels:** gives the number of common length $n$ random walks between two graph 1. $n$-th power of the adjacency matrix gives the number of walks of length $n$ between any pair of nodes 2. construct direct product graph of $G$ and $G'$ ![](https://i.imgur.com/rjRMnXQ.png) 3. count walks in product graph (because each walk is common to the original graphs) **Weistfeiler-Lehman graph kernel:** for testing graph isomorphism - boh ![](https://i.imgur.com/LmngE4C.png) ## 18. Deep Networks [[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/21_nn/talk.pdf)] 🕸 Recap: - **Perceptron** - models `only linear` functions - **Kernel Machines** - `non-linearity` given by kernels - need **kernel selection** phase - **Multilayer Perceptron (**`MLP`**)** - Input layer (input features) - $\ge1$ hidden layers in the middle (learned features) - output layer (predictions) **Activation function** for neuron output after the linear combination of input weights and nodes: - `Threshold activation:` $0$ if negative, $1$ if positive - not suitable for gradient descent - `Sigmoid:` has an "S" shape, a smooth version of the threshold activation - decision threshold at $0.5$ - `Softmax:` for **multi-class classification**, final layer has one output for each possible class - decision: highest scoring class **Loss Functions:** - binary classification $\longrightarrow$ `cross entropy` - multi-class classification $\longrightarrow$ `cross entropy` - regression $\longrightarrow$ `mean squared error` 👁‍🗨 Minimize entropy $=$ maximize likelihood. 💥 **Gradient backpropagation:** ![](https://i.imgur.com/1QvUXFT.png) ![](https://i.imgur.com/FB4HtRG.png) Where: - $\phi_i$ is the output of the $i$-th node - $\dfrac{\partial E}{\partial W_j}$ is the computed error for the weight $W_j$ - product between how much the output function $\phi_j$ affects the error and how much the weight $W_j$ affects the output - $\dfrac{\partial E}{\partial \phi_j}$ is the computed error for the output $\phi_j$ ![](https://i.imgur.com/10I5ezk.png) **Stopping criterion:** use a separate validation set to estimate performance of the network and choose when to stop training. ![](https://i.imgur.com/9Fsy1st.png) ### Training Tricks **Vanishing gradient problem:** gradient tends to $0$ in lower layers because activation functions (sigmoid, ...) are easily saturated - `initialize weights` to small random values around $0$ - `standardize` inputs to avoid saturating hidden units - `shuffle` training data before each epoch - solution: input regularization **Rectifier:** activation function $f(x) = \text{max}(0, w^T x)$, gives better training for deep neural networks - ✔ `sparse activation:` many nodes with output $0$ - ✔ scale invariant - ✔ better gradient propagation: no easy saturation - ✔ linear like function (**much faster** than sigmoid) - **ReLu:** neuron employing rectifier activation - ❌ dying ReLu problem (vanishing gradiend like): node can die if inactive for all inputs **Regularization:** add a penalty coefficient - 👍 good alternative to cross validation for feature selection (handling overfitting) - **L1 (**`lasso regression`**):** ![](https://i.imgur.com/X9TADDd.png) - **L2 (**`ridge regression`**):** ![](https://i.imgur.com/gFQljlI.png) **Initialization suggestions:** - init random weights $\longrightarrow$ break symmetries between neurons - sparse initialization (init many nodes to $0$) $\longrightarrow$ encourages diversity between neurons **Batch size:** find good trade between learning rate and reliable convergence ![](https://i.imgur.com/zIyncxE.png) - `full batch:` update parameters after one epoch - `mini-batch:` mix - `fully stochastic:` update parameters for each input **Momentum:** accelerates the descent for dimensions whose gradients point in the same direction - helps accelerating the descent, but can also skip a local minimum ![](https://i.imgur.com/0MJeJKT.png) **Adaptive gradient:** can constantly reduce the learning rate while approaching to local minimum, or be adaptive in base of the gradient slope (`Adagard`). **Batch normalization:** helps against the` covariate shift problem` (input distribution changes over time) - normalize node activations + tune shift and scaling parameters - 👍 more robust parameter initialization - 👍 allows faster learning rates without divergence - 👍 regularizes the model **Layer-wise pre-training:** try to make a small layer (compressor) to reproduce input in the output (can be unsupervised), ![](https://i.imgur.com/bGi3cuP.png) - `global refinement:` 1. swap auto-encoder output layer with a preferred one (e.g. one-hot for multi-class) 2. train ### Popular Architectures **Convolutional Networks:** location invariance + compositionality - `convolution filters:` extract **local features** - **pooling** to provide invariance to local variations - hierarchy of filters to compose complex features from simpler ones (e.g. pixels to edges to shapes) - fully connected layers for **final classification** ![](https://i.imgur.com/uHrj29p.png) **Long Short-Term Memory Networks (**`LSMT`**):** used a lot for speech recognition - cell state propagated along chain - `forget gate:` selectively forgets parts of the cell state Input gate selectively chooses parts of the candidate for cell update - `output gate:` selectively chooses parts of the cell state for output ![](https://i.imgur.com/G8z7S2B.png) **Generative Adversarial Networks (**`GAN`**):** - `generator network:` learns to generate items (e.g. images) from random noise - `discriminator network:` learns to distinguish between real items and generated ones - the two networks are jointly learned (`adversarial game`) - **no human supervision** needed!! ![](https://i.imgur.com/q6PD25J.png) ## 19. Unsupervised Learning[[slides](http://disi.unitn.it/~passerini/teaching/2019-2020/MachineLearning/slides/20_clustering/talk.pdf)] 🙈 To group unlabeled data into `clusters`. ### K-means learning **K-means learning:** assumes $k$ clusters, each represented by a mean point ![](https://i.imgur.com/WEOeEW5.png) 1. assign input to cluster with closest mean 2. recalculate clusters mean 3. until no changes - 👁‍🗨 need to choose a good similarity (distance) function: `Euclidean`, `Minkowski`, `Cosine similarity`, ... ### Gaussian Mixture Model **Gaussian Mixture Model (**`GMM`**):** clustering by mixture of Gaussian distributions. ![](https://i.imgur.com/LNzf2D7.png) - assumes $k$ Gaussians, estimate mean and possibly variance for each one - parameter estimation: - `Maximum likelihood` cannot be used because cluster assignment is unknown - `Expectation-Maximization:` 1. compute expected cluster assignment given current parameter setting 2. estimate parameters given cluster assignment 3. iterate until convergence ### Number of Clusters **Elbow method:** try to find the least required number of clusters that give a good score ![](https://i.imgur.com/LdusTWH.png) 1. increase number of clusters for each run 2. plot a clustering evaluation metric 3. choose the number of clusters $k$ where is the angle (elbow) in the plot - Problem: **can be ambiguous** **Average silhouette method:** idea, more clusters $\implies$ more homogeneous clusters ![](https://i.imgur.com/74ECvW1.png) 1. increase number of clusters for each run 2. plot average silhouette coefficient for different $k$ 3. choose $k$ where the average silhouette coefficient is maximal ### Hierarchical clustering Clustering does not need to be flat: - `Top-down` approach (splitting) - `Bottom-up` approach (aggregating) **Dendrograms:** represent clusters by similarity score ![](https://i.imgur.com/i9H4TJS.png) 👁‍🗨 **Similarity** between clusters can be given by `nearest-neighbor`, `furthest-neighbor`, or `average pairwise distance` between clusters elements. 🌞END🌞