---
tags: Master
---
# Sunto ML v2
## Decision trees
- Decision trees encode logical formulas
- Appropriate problems for decision trees
• classification or regression
• Instances represented as attribute-value pairs
• Different explanations for the concept are possible (disjunction)
• Some instances have missing attributes
• There is need for an interpretable explanation of the output
### learning
- greedy top down strategy
- choose best attribute
- split
- repeat
### choosing best attribute
- entropy:

- information gain:

An attribute with high information gain tends to produce homogeneous groups in terms of labels, thus favouring
their classification.
### issues
Avoid overfitting:
- each leaf needs to be pure => complex trees that overfit
- solution: keep impure leaves. How?
1. pre-pruning: decide whether to stop splitting even if impure
2. post-pruning:
0. split train to get a validation set
1. for each node, stop there, check performance on validation
2. if no possible improvement, STOP
3. else, remove node with highest improvement: replace subtree with leaf labeled as majority label
4. back to 1
Alternative attribute tests:
- IG prefers attributes with lots of possible values (e.g., the ID of an example splits perfectly => highest IG, but bad split)
- measure entropy wrt attribute instead of class: 
- Info Gain Ratio uses that: 
### Missing values
- simple: during splitting, assign missing values at the node based on majority
- complex:
- nn o cpt
### Random forests
1. Train multiple trees on random training subsets (with replacement)
2. Test all, return majority
## K nearest neighbours
- Euclidean Distance:

- **CLASSIFICATION:** The class of a new exapmple X is the most frequent class among the K nearest neighbours in the training set
- training set are (Xi, Y)
- distance: Euclidean distance -> (X, Xi)
- **REGRESSION:** same as before but average of K nearest neighbours Y
- instance-based learning
- lazy learning
- everything in classification phase
- local learning
- uniform weighting learning
## Evaluation
- Tuning Hyperparameter
- compute statistical significance of different learning algorithms
- quality of learned predictor
### Training Loss
- cost of predicting f(x) for y
- designed to boost efficiency of learning algorithms
- hinge loss for SVM
- NOT misclassification cost because piece-wise constant (1, 0)
- not derivable => no gradient descent
### Binary Classification
- accuracy TP+TN / ALL
- not good when strongly unbalanced dataset
- rebalancing, eg with different weight for positive
- precision TP / TP+FP
- recall TP / TP+FN
- coverage of the learner to predicting positives
- F-beta
- recall and prediction are complementary
- this combines precision and recall with a balance
- beta is the balance
- usually beta = 1
- armonic mean of prec and rec
### Multiclass Classification
- same as binary, more classes
- Acc, pr etc computed for each class
- Sum of all Acc, pr etc to obtain accuracy
### Regression
- Pearson correlation coefficient
- Root mean squared error

### Precision recall Curve
- classifiers use a confidence
- margin in SVM
- Acc, pr etc are computed for given margin
- testing different margin gives a curve for pr and rec
- the area defined by the curve is an aggregated value for every margin
### Procedure
- Computing performance on training set -> biased
- split dataset in training, validation, test
- Validation set -> change hyperparameters and algotz
- Test set -> compute final performance
- **K-fold cross validation**
- split D in k sets
- in a loop each k is used as test and the remaining as training/validation
- compute the score of each predictor
- compute average score
- we can compute variance with this formula

### Comparing algotithms
- K-fold for each algorithm (same K)
- Compute difference of scores $δ_i$ for each K
- Compute the average error difference (average of $δ_i$): $δ^-$
- Compute unbiased estimate of standard deviation

- $δ^-$ divided by this estimate is the **standardized mean error difference**
- set an alpha for the t distribution
- distribution similar to the gaussian
- alpha is the significance level (eg 0.05)
- if $t_{k-1, \alpha}$ < than the standardized mean error difference: classifier are different
- this is the null hypothesis to be rejected
## Parameter estimation
**Setting:**
- data is sampled from a **probability distribution** $p(x, y)$
- probability distribution $p$ **form is known**, but **parameters are unknown**
- training set $D = \{(x_1, y_1), . . . ,(x_m, y_m)\}$ of examples sampled i.i.d. (**independent identically distributed**) according to $p(x, y)$
**Goal:estimate the unknown parameters of $p$**.
i.i.d. = independent examples from SAME DISTRIBUTION
### Multi-class classification setting
- **divide** training set $D$ based on classes ($D_i$ for class $i$)
- given a new example $x$, we compute the **posterior probability** of class $y_i$:
$$
P(y_i|x, D) = \frac{p(x|y_i,D_i)p(y_i|D)}{p(x|D)}
$$
We must estimate class-dependent parameters $θ_i$ for $p(x|y_i , D_i)$.
### Maximum Likelihood vs Bayesian estimation
- **Maximum likelihood/a-posteriori estimation:**
- assumes parameters $θ$ as **unknown fixed**
- find $\theta$ that maximizes the probability of an input $x_0$: find the distribution parameters that **maximize** the probability of the given data
- obtained values are used to compute probability for new examples (update $\theta_i \forall x$)
- optimization problem can get very complicated
- with number of measures $x\rightarrow \infty$, MLE is unbiased and achieves the smallest variance possible
- **Bayesian estimation:**
- assumes $θ_i$ parameters as:
- random variables
- with some known prior distribution (conjugates?)
NB: observing examples turns prior distribution over parameters into a posterior distribution
### Maximum-likelihood estimation
$\theta_i^*$ is the set of parameters that maximizes the probability of a dataset of class $i$.
- **Maximum a-posteriori estimation**
- Assumes a prior distribution for the parameters $p(\theta_i)$ is available
- $θ^∗_i = \text{argmax}_{θ_i} p(θ_i |D_i , y_i) = \text{argmax}_{θ_i} p(D_i , y_i |θ_i)p(θ_i)$
- **Maximum likelihood estimation:**
- no assumption about prior distributions for parameters
- $θ^∗_i= \text{argmax}_{θ_i} p(D_i , y_i |θ_i)$
- given $n$ inputs of class $y$:
- $θ^* = \text{argmax}_θ p(D|θ) = \text{argmax}_θ \prod \limits ^n_{j=1} p(x_j |θ)$
- **Maximum log-likelihood estimation**:
- usually easier to maximize
- $θ^∗ = \text{argmax}_θ \text{ln}(p(D|θ)) = \text{argmax}_θ \sum \limits ^n_{j=1} \text{ln}(p(x_j |θ))$
### Parameters Estimation for Gaussian Distribution
- Gaussian mean is the sample mean
- Gaussian covariance matrix is the mean of the sample covariances
### Bayesian estimation
Predictions for new examples are obtained integrating over all possible values for the parameters:
$$
p(x|D) = \int_θ p(x, θ|D)dθ = \int p(x|θ)p(θ|D)dθ
$$
- $p(x|θ)$ can be **easily computed** because we have both form and parameters of distribution (Gaussian,…)
- need to **estimate the parameter posterior density** given the training set:
$$
p(θ|D) = \frac{p(D|θ)p(θ)}{p(D)}
$$
- $p(D)$ is a constant independent of $θ$, has no influence on the Bayesian decision
#### Bayesian estimation cases
- **Univariate normal case:** unknown $\mu$, known $σ^2$
- $p(x|\mu) ∼ N(\mu, σ^2 )$
- Mean posterior distribution varying sample size:
- **Multivariate normal case:** unknown $\mu$, known $\Sigma$
### Sufficient statistics
**Statistic:** any function on a set of samples $D$.
- a statistic $\textbf{s} = \phi(D)$ is sufficient for some parameters $θ$ if:
$$
P(D|\textbf{s}, θ) = P(D|\textbf{s})
$$
- a sufficient statistic allows to **compress a sample** $D$ into (possibly few) values
Sample mean and covariance are **sufficient statistics** for true mean and covariance of the **Gaussian distribution**.
### Conjugate priors
A prior distribution $p(θ)$ **is a conjugate prior for a likelihood function** $p(x|θ)$ if the posterior distribution $p(θ|x)$ is in the same family as the prior $p(θ)$.

## Bayesian networks
### Visualization
- visualize in simple and intuitive way
- properties of the model shown in graph
- complex computation as graphical manipulation
- multiple probabilty distributions in single graph
### Network
- directed graph
- each node is a variable x
- edges are dependencies of variables
- each variable is indipendent of its non descendents given its parents (local Markov property)
#### Imap
P is the joint distribution of all variables
$I(P)$ is the set of independencies holding in P
G is an imap of P if its local independencies hold in P: $I_l(G) \subseteq I(P)$
Minimal imap is an imap that cannot be reduced (removing edges) without losing the imap property (all local indeps are in $I(P)$)
Perfect imap (P-map) captures all indeps. Hard to find and not guaranteed to exist
#### Factorization
P factorizes according to G when $P(x) = \prod p(x|Parents(x))$
**G is imap <=> factorizes P**
### Definition
A Bayesian Network is a pair (G, p) where p factorizes over G and it is represented as a set of conditional probability distributions (cpd) associated with the nodes of G.
### D separation
Basic rules:

Given 3 sets of nodes A, B, C:
- A and B are d-separated by C
- All paths from A to B are blocked
- the path meets tail-to-tail or head-to-tail C (node observed)
- the path meets head-to-head a node n, n and its descendents are not in C ( not observed)
- A and B are indipendent given C if are d-separated by C
### Inference
- Given a certain evidende **e** for a subset of variables **E**
- Inference is computing the probability of a non observed subset X given evidence
- p(X | E=e)
- this can be computed as p(X, E)/ p(E)
- difficult with a lot of variables
- we can do inference from the graphical model instead
#### messages
given this type of graph chain:

- p(X) = p(x1)*p(x2|x1) * ... * p(Xn|Xn-1)
- this can be re imagened as a message that comes up from the last element of the chain to the desired Xn, giving:

- same thing with a message going from X1 down to Xn

- in the node Xi for which we want to compute the marginal probability this gives us:

- When a node is observed we don't need to iterate all possible values (summation), the value is enough
- This can be done on Bayesan networks
- Product sum algorithm (slides)
- Can be used to obtain X max, configurations of variable with highest probability
- simply use the Product max algorithm
- This gives underflow issues
- use the log and so the log sum max algorithm
### Aproximate Inference
- When exact inference is not computable we resort to aproximate inference
- loopy beliefpropagation:
- use original graph even if it has loops
- message pass more time (beacuse of the loops)
- hoping the algorithm converges (no more changes to messages)
- variational methods
- deterministic aproximation
- assuming posterior probability
- sampling methods
- aproximate posterior sampling from the network
- sample is an instantiation of X according to probability
### Learning parameters
Objective: given model structure, estimate probabilities from dataset
#### Maximum likelihood on complete data



- The parameters of each CPD can be estimated independently:

- A discrete CPD P(X|U), can be represented as a table, with:
- rows for configurations of X
- columns for configurations of U
- each table entry $\theta_{x|u}$
- each column sums to 1
- parameters of each column can be estimated independently
- zeroing the gradient gives:

**Therefore, the maximum likelihood parameters are simply the fraction of times in which the specific configuration was observed in the data**
##### Adding priors
- MLE tends to overfit
- configuration not appearing in training gets zero probability
- solution: use Maximum a-posteriori

Using dirichlet for priors
#### Estimation on incomplete data
- some examples have missing variables
- cannot compute configuration frequency
- cannot use Bayesian estimation because too complex
- **need to approximate**
##### Expectation Maximization
- Sufficient statistics (counts) cannot be computed (missing data)
- Fill-in missing data inferring them using current parameters (solve inference problem to get expected counts)
- Compute parameters maximizing likelihood (or posterior) of such expected counts
- Iterate the procedure to improve quality of parameter
Algorithm:
1. initialize $\theta$ to random values
1. **e-step**: compute expected sufficient statistics using joint P of X given D and current $\theta$
- if it's observed, it's either 0 or 1
- else use bayesian estimation
2. **m-step**: compute parameters that maximize the expected statistics, using max likelihood or max a-posteriori
3. ?? repeat ??
### Learning structure
#### Full bayesian approach (aka model averaging)
- Let S be the space of possible structures (DAGS) for the domain X.
- Let D be a dataset of observations
- Predictions for a new instance are computed marginalizing over both structures and parameters
- **too expensive**
#### Score based
- assign score to each structure
- ML score: leads to fully connected network => unusable
- MAP score: ?? bayesian dirichlet scoring ??
- choose best structure $S^*$
- assume $P(S^*|D)=1$
- **search approach**: start with random DAG, modify using hill climbing, best first, simulated annealing
#### Constraint based
- test conditional independencies on data
- choose any structure $S^*$ that satisfies them
- assume $P(S^*|D)=1$
### Naive bayes
- Each instance x is described by a conjunction of attribute values
- The target function can take any value from a finite set of Y
- The task is predicting the MAP (max a-post) target value
- problem: hard to learn contitionals

- solution: simplify by assumping independent attributes:

- New MAP definition:

- The probability of each attribute given the class P(a|y), can be modeled as a multinomial
## Discriminative learning
- generative learning assumes knowledge of distribution
- discriminative focuses on modelling discriminant function (boundaries rather than distribution)
Pros:
- modelling complex data is difficult
- if discrimination is the goal, distribution modeling is useless
Cons:
- learned model is less flexible
- no inference tasks
- cant generate new data
### Linear discriminant functions
- form :
- linear combination of features
- $w_0$ is bias
- simplest possible discriminant function
- it can be the best option, surely the first to try
### Linear binary classifier
- uses linear disc function
- 
- decision boundary where $f(x)=0$
- boundary is hyperplane $H$ orthogonal to $w$
- geometric margin is distance from boundary 
- a point can be expressed as its projection on the hyperplane and its (signed) distance from it
### Perceptron
1. linear combination of inputs (can incorporate bias into w vector)
2. threshold activation function (sign/sigmoid/..)
Represantational power:
- linearly separable sets with 1 layer
- all logic functions with 2 layers (but exponential number of hidden nodes for some functions)
Learning:
1. choose function of the parameters to be optimized 
- eg error on training set (loss)
2. gradient descent:
- initiate random $w$ (eg w=0)
- iterate until gradient approaches 0: 
- $\eta$ is learning rate. Can be tuned / made dynamic for faster and precise convergence
- guarantees **local** optimum
- loss function needs to be differentiable
#### Perceptron training rule
- missclassification loss is piecewise
- we cant use gradient descent

- $D_e$ is the set of wrongly labeled example (y\*f(x), different sign so negative)
- With this the error is the sum of functional margins of incorrectly labeled examples
#### Stochastic Perceptron training rule
1. Initial weights random
2. Iterate until everything labeled right

- Weights adjusted at each iteration for errors
- gradient step for each error
- steps are very fast
- can help avoiding local minimal
#### Perceptron Regression
- Matrix of input X
- Matrix of output Y
- regression is a set of linear equation
- Xw = y
- X matrix has more row than columns
- more equations than unknowns
- no exact solution always exist
#### Perceptron Regression: Mean squared error

- Can always be solve by gradient descent
- Can also use as classification loss
- Closed form:

- first part (excluding y) is left inverse
- feature must be linear indipendent (remove redundant)
#### Multiclass
### Generative linear classifiers
## SVMs
- linear (or not?) classifiers maximizing separation margin between classes
- solution depends on few training examples (support vectors)
- bounds/error based on margin
- easily extended to non-linear models (kernel machines)
### Maximum margin classifier
- measures:
- confidence margin as : it's the minimal margin
- geometric margin 
- canonical hyperplane:
H with highest confidence margin 1 => geometric margin p/|w|
### Learning Hard margin SVMs
Objective: minimize

**Quadratic problem with constraint**
#### Dual formulation
- With the lagragian we have a new formulation with variable $\alpha$
- maximazing $\alpha$ is the same as minimizing w

- The primal formulation has d+1 variables (number of features +1) 
- The dual formulation has m variables (number of training examples) 
- The dual formulation has simpler contraints (box), easier to solve

- **Choose between the two based on the problem**
#### Support Vectors
Looking at the lagrangian the second summation is 0 for all values on the saddle point.
This can be for:
- $\alpha=0$
- no contribution from training example to classify test
- or when y*f(x) = 1
- this are the examples closer to the margin, **support vectors**
### Soft margin SVM
**Allow misclassifications**
- Uses slack variable $\xi_i$ as penality for missclassification of example $x_i$
- Parameter $C>=0$ for trade-off data fitting and size of margin
- objective of optimization:
- usually $\xi_i = loss(y_i, f(x_i))$

- Regularized loss minimization problem
- The margin maximization term (first term) accounts for regularization (solutions with larger margin are preferred)
- The loss term (second) accounts for error minimization
- **prevents overfitting**
#### Hinge loss
**Common loss function**

- $|z|_+ = z$ if $z > 0$ and $0$ otherwise (positive part)
- it corresponds to the slack variable $\xi_i$ (violation of margin costraint)
- all examples not violating margin costraint have zero loss (sparse set of support vectors)
#### Dual formulation

- same procedure as hard margin (derivate for w, w0, and $\xi$)
- derivates in the lagrangian
- obtain same fual as hard
- **But:** $0<\alpha<C$
- Because $C-\alpha-\beta=0$
- Error in finding support vectors
### SGD for SVMs
**Way faster than primal/dual for large datasets**
## Kernels
Problem:
- Hard-margin SVM can address linearly separable problems
- Soft-margin SVM can address linearly separable problems with outliers
- Non-linearly separable problems need a higher expressive power (i.e. more complex feature combinations)
- We do not want to loose the advantages of linear separators (i.e. large margin, theoretical guarantees)
Solution:
1. **map input examples to a higher dimensional feature space**
2. perform linear classification on this higher dim space
### Feature maps

- $\Phi$ is a **function mapping** each example to a higher dimensional space H
- Examples x are replaced with their feature mapping $\Phi(x)$
- The feature mapping should increase the expressive power of the representation (e.g. introducing features which are combinations of input features)
- Examples should be (approximately) linearly separable in the mapped space
- formula is the same $\phi(x)$ for the old x
- $f(x)=w^T\phi(x)+w_0$
#### Polinomial mapping
- Homogeneous: all X combinations of degree d
- Inhomogeneous: all X combinations of degree <= d
#### Support Vector Regression
- e-insensitive loss:
- 
- tolerates small (e) deviations from true value, no penalty
- defines an e-tube of insensitiveness around true values
- trade off function complexity with data fitting
**Pior**:

**Dual**:

### Kernel Trick
- $\phi(x)$ can be very expensive
- it appears only in dot products of dual formulations
- replace them with kernel function:
- 
- 
#### Valid kernels
- A valid kernel is a (similarity) function defined in cartesian product of input space: 
- corresponds to the dot product: 
- produces a **gram matrix**: symmetric matrix K where 
- **kernel positive definite <=> there is a corresponding feature map (i.e. kernel is valid)**
Options:
- prove kernel gram matrix K is positive definite (hard)

**or**

- find corresponding feature map (as in polynomial)
- use kernel combinations
### Kernel machines
- Old stochastic perceptron
- 
- iterate and update based on mislabeled 
- New kernel perceptron
- 
- iterate and update based on mislabeled 
### Type of kernels
- linear kernel
- no transoformation

- polinomial kernel

- Gaussian kernel
- depends on width parameter $\sigma$
- small $\sigma$ -> similar to nearest neighbour
- universal kernel

### Kernel operations
**Simplest constructive approach to build valid kernels**
- Sum
- same as concatenation of the two feature spaces

- can sum kernels of different size
- Product
- same as cartesian product of the two feature spaces
[image?]
- can multiply kernels of different size
- Linear combination
- weighted sum of kernels
- 
- learning B weights together with alphas is **kernel learning**
- Decomposition:
- operation to break structure into parts 
- e.g. divide strings: 
- Convolution:
1. decompose x1 and x2 in D parts
2. iterate all combinations of decompositions
3. sum the product of the kernel on each part $k_d(x^1_d,x^2_d)$

- Set:
- for each combination of members $\xi^1 \in X^1, \xi^2 \in X^2$ of the two, sum $k_{member}(\xi^1,\xi^2)$
- 
- if k_member is delta kernel we get 
- normalization (needed)
- Composition:
- give a kernel it is always possible to use another kernel on top of it
### Kernels on structured data
- similarity between objects
- graphs, trees, strings
- match kernel
- delta kernel
- simplest kernel on structures
- x can be everything (not vector)

- kernel spectrum
- variable k
- feature space is frequency of each k-gram, substring
- very sparse (lots of 0)
- Solution: kernel mismatch string
- variables k and m
- frequency of eack k-gram, with m mismatch
- sum with m from 0 to m like
- k=3,m=0 + k=3,m=1 = 1 + 24 = 25

#### kernels on Trees
- Subset tree kernel
- matching of subtrees
- summation of recursive procedure C for al nodes of the trees t and t1

- Matching of subtrees?
- consider roots of subtrees n and n1
- if n=n1 and (children(n) = children(n1)) -> Match
- Procedure C?
- recursive
- if n and n1 dont match -> 0
- if n and n1 match but their children are only leaf -> 1
- else:

- when trees are large is difficult
- put a weight
- replace 1 with $0\le\lambda\le1$
#### kernels on graphs
1. Simple way is similar as spectrum for strings
- frequency of subgraph up to k
- dot product of frequencies
- good with small graphs
2. Walk kernel
- walk in graph: ways to go from one node to the other
- to efficiently count walks of lenght n use $A^n$
- where A is the adjacency matrix of the graph
- Compute walks between 2 labels:

- The kernel is the dot product of the 2 matrices

3. Look Weistfeiler-Lehman
# Deep networks
## Clustering