--- tags: Master --- # Sunto ML v2 ## Decision trees - Decision trees encode logical formulas - Appropriate problems for decision trees • classification or regression • Instances represented as attribute-value pairs • Different explanations for the concept are possible (disjunction) • Some instances have missing attributes • There is need for an interpretable explanation of the output ### learning - greedy top down strategy - choose best attribute - split - repeat ### choosing best attribute - entropy: ![](https://i.imgur.com/7exLbDM.png) - information gain: ![](https://i.imgur.com/r5jlYOz.png) An attribute with high information gain tends to produce homogeneous groups in terms of labels, thus favouring their classification. ### issues Avoid overfitting: - each leaf needs to be pure => complex trees that overfit - solution: keep impure leaves. How? 1. pre-pruning: decide whether to stop splitting even if impure 2. post-pruning: 0. split train to get a validation set 1. for each node, stop there, check performance on validation 2. if no possible improvement, STOP 3. else, remove node with highest improvement: replace subtree with leaf labeled as majority label 4. back to 1 Alternative attribute tests: - IG prefers attributes with lots of possible values (e.g., the ID of an example splits perfectly => highest IG, but bad split) - measure entropy wrt attribute instead of class: ![](https://i.imgur.com/o0USFq9.png) - Info Gain Ratio uses that: ![](https://i.imgur.com/479EPyz.png) ### Missing values - simple: during splitting, assign missing values at the node based on majority - complex: - nn o cpt ### Random forests 1. Train multiple trees on random training subsets (with replacement) 2. Test all, return majority ## K nearest neighbours - Euclidean Distance: ![](https://i.imgur.com/p3LYts8.png =200x) - **CLASSIFICATION:** The class of a new exapmple X is the most frequent class among the K nearest neighbours in the training set - training set are (Xi, Y) - distance: Euclidean distance -> (X, Xi) - **REGRESSION:** same as before but average of K nearest neighbours Y - instance-based learning - lazy learning - everything in classification phase - local learning - uniform weighting learning ## Evaluation - Tuning Hyperparameter - compute statistical significance of different learning algorithms - quality of learned predictor ### Training Loss - cost of predicting f(x) for y - designed to boost efficiency of learning algorithms - hinge loss for SVM - NOT misclassification cost because piece-wise constant (1, 0) - not derivable => no gradient descent ### Binary Classification - accuracy TP+TN / ALL - not good when strongly unbalanced dataset - rebalancing, eg with different weight for positive - precision TP / TP+FP - recall TP / TP+FN - coverage of the learner to predicting positives - F-beta - recall and prediction are complementary - this combines precision and recall with a balance - beta is the balance - usually beta = 1 - armonic mean of prec and rec ### Multiclass Classification - same as binary, more classes - Acc, pr etc computed for each class - Sum of all Acc, pr etc to obtain accuracy ### Regression - Pearson correlation coefficient - Root mean squared error ![](https://i.imgur.com/nxA3yFl.png =200x) ### Precision recall Curve - classifiers use a confidence - margin in SVM - Acc, pr etc are computed for given margin - testing different margin gives a curve for pr and rec - the area defined by the curve is an aggregated value for every margin ### Procedure - Computing performance on training set -> biased - split dataset in training, validation, test - Validation set -> change hyperparameters and algotz - Test set -> compute final performance - **K-fold cross validation** - split D in k sets - in a loop each k is used as test and the remaining as training/validation - compute the score of each predictor - compute average score - we can compute variance with this formula ![](https://i.imgur.com/zT8x1Wo.png =150x) ### Comparing algotithms - K-fold for each algorithm (same K) - Compute difference of scores $δ_i$ for each K - Compute the average error difference (average of $δ_i$): $δ^-$ - Compute unbiased estimate of standard deviation ![](https://i.imgur.com/wHK9C4g.png =200x) - $δ^-$ divided by this estimate is the **standardized mean error difference** - set an alpha for the t distribution - distribution similar to the gaussian - alpha is the significance level (eg 0.05) - if $t_{k-1, \alpha}$ < than the standardized mean error difference: classifier are different - this is the null hypothesis to be rejected ## Parameter estimation **Setting:** - data is sampled from a **probability distribution** $p(x, y)$ - probability distribution $p$ **form is known**, but **parameters are unknown** - training set $D = \{(x_1, y_1), . . . ,(x_m, y_m)\}$ of examples sampled i.i.d. (**independent identically distributed**) according to $p(x, y)$ **Goal:estimate the unknown parameters of $p$**. i.i.d. = independent examples from SAME DISTRIBUTION ### Multi-class classification setting - **divide** training set $D$ based on classes ($D_i$ for class $i$) - given a new example $x$, we compute the **posterior probability** of class $y_i$: $$ P(y_i|x, D) = \frac{p(x|y_i,D_i)p(y_i|D)}{p(x|D)} $$ We must estimate class-dependent parameters $θ_i$ for $p(x|y_i , D_i)$. ### Maximum Likelihood vs Bayesian estimation - **Maximum likelihood/a-posteriori estimation:** - assumes parameters $θ$ as **unknown fixed** - find $\theta$ that maximizes the probability of an input $x_0$: find the distribution parameters that **maximize** the probability of the given data - obtained values are used to compute probability for new examples (update $\theta_i \forall x$) - optimization problem can get very complicated - with number of measures $x\rightarrow \infty$, MLE is unbiased and achieves the smallest variance possible - **Bayesian estimation:** - assumes $θ_i$ parameters as: - random variables - with some known prior distribution (conjugates?) NB: observing examples turns prior distribution over parameters into a posterior distribution ### Maximum-likelihood estimation $\theta_i^*$ is the set of parameters that maximizes the probability of a dataset of class $i$. - **Maximum a-posteriori estimation** - Assumes a prior distribution for the parameters $p(\theta_i)$ is available - $θ^∗_i = \text{argmax}_{θ_i} p(θ_i |D_i , y_i) = \text{argmax}_{θ_i} p(D_i , y_i |θ_i)p(θ_i)$ - **Maximum likelihood estimation:** - no assumption about prior distributions for parameters - $θ^∗_i= \text{argmax}_{θ_i} p(D_i , y_i |θ_i)$ - given $n$ inputs of class $y$: - $θ^* = \text{argmax}_θ p(D|θ) = \text{argmax}_θ \prod \limits ^n_{j=1} p(x_j |θ)$ - **Maximum log-likelihood estimation**: - usually easier to maximize - $θ^∗ = \text{argmax}_θ \text{ln}(p(D|θ)) = \text{argmax}_θ \sum \limits ^n_{j=1} \text{ln}(p(x_j |θ))$ ### Parameters Estimation for Gaussian Distribution - Gaussian mean is the sample mean - Gaussian covariance matrix is the mean of the sample covariances ### Bayesian estimation Predictions for new examples are obtained integrating over all possible values for the parameters: $$ p(x|D) = \int_θ p(x, θ|D)dθ = \int p(x|θ)p(θ|D)dθ $$ - $p(x|θ)$ can be **easily computed** because we have both form and parameters of distribution (Gaussian,…) - need to **estimate the parameter posterior density** given the training set: $$ p(θ|D) = \frac{p(D|θ)p(θ)}{p(D)} $$ - $p(D)$ is a constant independent of $θ$, has no influence on the Bayesian decision #### Bayesian estimation cases - **Univariate normal case:** unknown $\mu$, known $σ^2$ - $p(x|\mu) ∼ N(\mu, σ^2 )$ - Mean posterior distribution varying sample size: - **Multivariate normal case:** unknown $\mu$, known $\Sigma$ ### Sufficient statistics **Statistic:** any function on a set of samples $D$. - a statistic $\textbf{s} = \phi(D)$ is sufficient for some parameters $θ$ if: $$ P(D|\textbf{s}, θ) = P(D|\textbf{s}) $$ - a sufficient statistic allows to **compress a sample** $D$ into (possibly few) values Sample mean and covariance are **sufficient statistics** for true mean and covariance of the **Gaussian distribution**. ### Conjugate priors A prior distribution $p(θ)$ **is a conjugate prior for a likelihood function** $p(x|θ)$ if the posterior distribution $p(θ|x)$ is in the same family as the prior $p(θ)$. ![](https://i.imgur.com/iGdRiUd.png =400x) ## Bayesian networks ### Visualization - visualize in simple and intuitive way - properties of the model shown in graph - complex computation as graphical manipulation - multiple probabilty distributions in single graph ### Network - directed graph - each node is a variable x - edges are dependencies of variables - each variable is indipendent of its non descendents given its parents (local Markov property) #### Imap P is the joint distribution of all variables $I(P)$ is the set of independencies holding in P G is an imap of P if its local independencies hold in P: $I_l(G) \subseteq I(P)$ Minimal imap is an imap that cannot be reduced (removing edges) without losing the imap property (all local indeps are in $I(P)$) Perfect imap (P-map) captures all indeps. Hard to find and not guaranteed to exist #### Factorization P factorizes according to G when $P(x) = \prod p(x|Parents(x))$ **G is imap <=> factorizes P** ### Definition A Bayesian Network is a pair (G, p) where p factorizes over G and it is represented as a set of conditional probability distributions (cpd) associated with the nodes of G. ### D separation Basic rules: ![](https://i.imgur.com/i53NM0m.png =500x) Given 3 sets of nodes A, B, C: - A and B are d-separated by C - All paths from A to B are blocked - the path meets tail-to-tail or head-to-tail C (node observed) - the path meets head-to-head a node n, n and its descendents are not in C ( not observed) - A and B are indipendent given C if are d-separated by C ### Inference - Given a certain evidende **e** for a subset of variables **E** - Inference is computing the probability of a non observed subset X given evidence - p(X | E=e) - this can be computed as p(X, E)/ p(E) - difficult with a lot of variables - we can do inference from the graphical model instead #### messages given this type of graph chain: ![](https://i.imgur.com/lzhAtJL.png =200x) - p(X) = p(x1)*p(x2|x1) * ... * p(Xn|Xn-1) - this can be re imagened as a message that comes up from the last element of the chain to the desired Xn, giving: ![](https://i.imgur.com/6V7bU0W.png =150x) - same thing with a message going from X1 down to Xn ![](https://i.imgur.com/bi6eeq4.png =150x) - in the node Xi for which we want to compute the marginal probability this gives us: ![](https://i.imgur.com/n4gxRVp.png =150x) - When a node is observed we don't need to iterate all possible values (summation), the value is enough - This can be done on Bayesan networks - Product sum algorithm (slides) - Can be used to obtain X max, configurations of variable with highest probability - simply use the Product max algorithm - This gives underflow issues - use the log and so the log sum max algorithm ### Aproximate Inference - When exact inference is not computable we resort to aproximate inference - loopy beliefpropagation: - use original graph even if it has loops - message pass more time (beacuse of the loops) - hoping the algorithm converges (no more changes to messages) - variational methods - deterministic aproximation - assuming posterior probability - sampling methods - aproximate posterior sampling from the network - sample is an instantiation of X according to probability ### Learning parameters Objective: given model structure, estimate probabilities from dataset #### Maximum likelihood on complete data ![](https://i.imgur.com/5idgS7e.png =400x) ![](https://i.imgur.com/LR8ABRe.png =500x) ![](https://i.imgur.com/TI6ME95.png =500x) - The parameters of each CPD can be estimated independently: ![](https://i.imgur.com/dp9naFo.png =400x) - A discrete CPD P(X|U), can be represented as a table, with: - rows for configurations of X - columns for configurations of U - each table entry $\theta_{x|u}$ - each column sums to 1 - parameters of each column can be estimated independently - zeroing the gradient gives: ![](https://i.imgur.com/RLRwpFP.png =220x) **Therefore, the maximum likelihood parameters are simply the fraction of times in which the specific configuration was observed in the data** ##### Adding priors - MLE tends to overfit - configuration not appearing in training gets zero probability - solution: use Maximum a-posteriori ![](https://i.imgur.com/7SKBSZ2.png =220x) Using dirichlet for priors #### Estimation on incomplete data - some examples have missing variables - cannot compute configuration frequency - cannot use Bayesian estimation because too complex - **need to approximate** ##### Expectation Maximization - Sufficient statistics (counts) cannot be computed (missing data) - Fill-in missing data inferring them using current parameters (solve inference problem to get expected counts) - Compute parameters maximizing likelihood (or posterior) of such expected counts - Iterate the procedure to improve quality of parameter Algorithm: 1. initialize $\theta$ to random values 1. **e-step**: compute expected sufficient statistics using joint P of X given D and current $\theta$ - if it's observed, it's either 0 or 1 - else use bayesian estimation 2. **m-step**: compute parameters that maximize the expected statistics, using max likelihood or max a-posteriori 3. ?? repeat ?? ### Learning structure #### Full bayesian approach (aka model averaging) - Let S be the space of possible structures (DAGS) for the domain X. - Let D be a dataset of observations - Predictions for a new instance are computed marginalizing over both structures and parameters - **too expensive** #### Score based - assign score to each structure - ML score: leads to fully connected network => unusable - MAP score: ?? bayesian dirichlet scoring ?? - choose best structure $S^*$ - assume $P(S^*|D)=1$ - **search approach**: start with random DAG, modify using hill climbing, best first, simulated annealing #### Constraint based - test conditional independencies on data - choose any structure $S^*$ that satisfies them - assume $P(S^*|D)=1$ ### Naive bayes - Each instance x is described by a conjunction of attribute values - The target function can take any value from a finite set of Y - The task is predicting the MAP (max a-post) target value - problem: hard to learn contitionals ![](https://i.imgur.com/4ExVe9S.png =150x) - solution: simplify by assumping independent attributes: ![](https://i.imgur.com/NK6K3yV.png =250x) - New MAP definition: ![](https://i.imgur.com/AXhkw4O.png =250x) - The probability of each attribute given the class P(a|y), can be modeled as a multinomial ## Discriminative learning - generative learning assumes knowledge of distribution - discriminative focuses on modelling discriminant function (boundaries rather than distribution) Pros: - modelling complex data is difficult - if discrimination is the goal, distribution modeling is useless Cons: - learned model is less flexible - no inference tasks - cant generate new data ### Linear discriminant functions - form ![](https://i.imgur.com/0HQjNoT.png): - linear combination of features - $w_0$ is bias - simplest possible discriminant function - it can be the best option, surely the first to try ### Linear binary classifier - uses linear disc function - ![](https://i.imgur.com/seMqnqM.png) - decision boundary where $f(x)=0$ - boundary is hyperplane $H$ orthogonal to $w$ - geometric margin is distance from boundary ![](https://i.imgur.com/bzQUt4b.png =120x) - a point can be expressed as its projection on the hyperplane and its (signed) distance from it ### Perceptron 1. linear combination of inputs (can incorporate bias into w vector) 2. threshold activation function (sign/sigmoid/..) Represantational power: - linearly separable sets with 1 layer - all logic functions with 2 layers (but exponential number of hidden nodes for some functions) Learning: 1. choose function of the parameters to be optimized ![](https://i.imgur.com/fD16kf8.png =80x) - eg error on training set (loss) 2. gradient descent: - initiate random $w$ (eg w=0) - iterate until gradient approaches 0: ![](https://i.imgur.com/03wpjF9.png =200x) - $\eta$ is learning rate. Can be tuned / made dynamic for faster and precise convergence - guarantees **local** optimum - loss function needs to be differentiable #### Perceptron training rule - missclassification loss is piecewise - we cant use gradient descent ![](https://i.imgur.com/OSMLZE3.png =150x) - $D_e$ is the set of wrongly labeled example (y\*f(x), different sign so negative) - With this the error is the sum of functional margins of incorrectly labeled examples #### Stochastic Perceptron training rule 1. Initial weights random 2. Iterate until everything labeled right ![](https://i.imgur.com/jlZ8RqQ.png =150x) - Weights adjusted at each iteration for errors - gradient step for each error - steps are very fast - can help avoiding local minimal #### Perceptron Regression - Matrix of input X - Matrix of output Y - regression is a set of linear equation - Xw = y - X matrix has more row than columns - more equations than unknowns - no exact solution always exist #### Perceptron Regression: Mean squared error ![](https://i.imgur.com/DTMsu8G.png =150x) - Can always be solve by gradient descent - Can also use as classification loss - Closed form: ![](https://i.imgur.com/LLcU7f9.png =150x) - first part (excluding y) is left inverse - feature must be linear indipendent (remove redundant) #### Multiclass ### Generative linear classifiers ## SVMs - linear (or not?) classifiers maximizing separation margin between classes - solution depends on few training examples (support vectors) - bounds/error based on margin - easily extended to non-linear models (kernel machines) ### Maximum margin classifier - measures: - confidence margin as ![](https://i.imgur.com/mIRYHh2.png =200x): it's the minimal margin - geometric margin ![](https://i.imgur.com/2v3CGT4.png =200x) - canonical hyperplane: H with highest confidence margin 1 => geometric margin p/|w| ### Learning Hard margin SVMs Objective: minimize ![](https://i.imgur.com/HKKj5NW.png =300x) **Quadratic problem with constraint** #### Dual formulation - With the lagragian we have a new formulation with variable $\alpha$ - maximazing $\alpha$ is the same as minimizing w ![](https://i.imgur.com/KPhgiTw.png =300x) - The primal formulation has d+1 variables (number of features +1) ![](https://i.imgur.com/BVPYeGL.png) - The dual formulation has m variables (number of training examples) ![](https://i.imgur.com/aKrCHht.png =300x) - The dual formulation has simpler contraints (box), easier to solve ![](https://i.imgur.com/yr2oiyA.png =200x) - **Choose between the two based on the problem** #### Support Vectors Looking at the lagrangian the second summation is 0 for all values on the saddle point. This can be for: - $\alpha=0$ - no contribution from training example to classify test - or when y*f(x) = 1 - this are the examples closer to the margin, **support vectors** ### Soft margin SVM **Allow misclassifications** - Uses slack variable $\xi_i$ as penality for missclassification of example $x_i$ - Parameter $C>=0$ for trade-off data fitting and size of margin - objective of optimization:![](https://i.imgur.com/gq7cjE1.png =420x) - usually $\xi_i = loss(y_i, f(x_i))$ ![](https://i.imgur.com/Nsp55hc.png =350x) - Regularized loss minimization problem - The margin maximization term (first term) accounts for regularization (solutions with larger margin are preferred) - The loss term (second) accounts for error minimization - **prevents overfitting** #### Hinge loss **Common loss function** ![](https://i.imgur.com/Zn1E14W.png) - $|z|_+ = z$ if $z > 0$ and $0$ otherwise (positive part) - it corresponds to the slack variable $\xi_i$ (violation of margin costraint) - all examples not violating margin costraint have zero loss (sparse set of support vectors) #### Dual formulation ![](https://i.imgur.com/IZeENOG.png =200x) - same procedure as hard margin (derivate for w, w0, and $\xi$) - derivates in the lagrangian - obtain same fual as hard - **But:** $0<\alpha<C$ - Because $C-\alpha-\beta=0$ - Error in finding support vectors ### SGD for SVMs **Way faster than primal/dual for large datasets** ## Kernels Problem: - Hard-margin SVM can address linearly separable problems - Soft-margin SVM can address linearly separable problems with outliers - Non-linearly separable problems need a higher expressive power (i.e. more complex feature combinations) - We do not want to loose the advantages of linear separators (i.e. large margin, theoretical guarantees) Solution: 1. **map input examples to a higher dimensional feature space** 2. perform linear classification on this higher dim space ### Feature maps ![](https://i.imgur.com/Mm5CDqQ.png) - $\Phi$ is a **function mapping** each example to a higher dimensional space H - Examples x are replaced with their feature mapping $\Phi(x)$ - The feature mapping should increase the expressive power of the representation (e.g. introducing features which are combinations of input features) - Examples should be (approximately) linearly separable in the mapped space - formula is the same $\phi(x)$ for the old x - $f(x)=w^T\phi(x)+w_0$ #### Polinomial mapping - Homogeneous: all X combinations of degree d - Inhomogeneous: all X combinations of degree <= d #### Support Vector Regression - e-insensitive loss: - ![](https://i.imgur.com/YtlY2rC.png) - tolerates small (e) deviations from true value, no penalty - defines an e-tube of insensitiveness around true values - trade off function complexity with data fitting **Pior**: ![](https://i.imgur.com/deDrCBT.png =500x) **Dual**: ![](https://i.imgur.com/M9yr4su.png =450x) ### Kernel Trick - $\phi(x)$ can be very expensive - it appears only in dot products of dual formulations - replace them with kernel function: - ![](https://i.imgur.com/yZV1rCl.png) - ![](https://i.imgur.com/cFVLYFj.png) #### Valid kernels - A valid kernel is a (similarity) function defined in cartesian product of input space: ![](https://i.imgur.com/JhRns0X.png) - corresponds to the dot product: ![](https://i.imgur.com/5xptgmc.png) - produces a **gram matrix**: symmetric matrix K where ![](https://i.imgur.com/j3jWc2w.png) - **kernel positive definite <=> there is a corresponding feature map (i.e. kernel is valid)** Options: - prove kernel gram matrix K is positive definite (hard) ![](https://i.imgur.com/6Jnym5A.png =300x) **or** ![](https://i.imgur.com/RpHAnKF.png) - find corresponding feature map (as in polynomial) - use kernel combinations ### Kernel machines - Old stochastic perceptron - ![](https://i.imgur.com/MG1VY17.png =150x) - iterate and update based on mislabeled ![](https://i.imgur.com/2q1zGCy.png) - New kernel perceptron - ![](https://i.imgur.com/GHtouS1.png =200x) - iterate and update based on mislabeled ![](https://i.imgur.com/6UrU8OO.png) ### Type of kernels - linear kernel - no transoformation ![](https://i.imgur.com/17c35cN.png =150x) - polinomial kernel ![](https://i.imgur.com/UNgnH9t.png =150x) - Gaussian kernel - depends on width parameter $\sigma$ - small $\sigma$ -> similar to nearest neighbour - universal kernel ![](https://i.imgur.com/6jsDskr.png =250x) ### Kernel operations **Simplest constructive approach to build valid kernels** - Sum - same as concatenation of the two feature spaces ![](https://i.imgur.com/MEddKKg.png =400x) - can sum kernels of different size - Product - same as cartesian product of the two feature spaces [image?] - can multiply kernels of different size - Linear combination - weighted sum of kernels - ![](https://i.imgur.com/1ZsSQhC.png =300x) - learning B weights together with alphas is **kernel learning** - Decomposition: - operation to break structure into parts ![](https://i.imgur.com/81Uj36W.png =200x) - e.g. divide strings: ![](https://i.imgur.com/qPtzTt4.png =300x) - Convolution: 1. decompose x1 and x2 in D parts 2. iterate all combinations of decompositions 3. sum the product of the kernel on each part $k_d(x^1_d,x^2_d)$ ![](https://i.imgur.com/tnoZTqe.png =500x) - Set: - for each combination of members $\xi^1 \in X^1, \xi^2 \in X^2$ of the two, sum $k_{member}(\xi^1,\xi^2)$ - ![](https://i.imgur.com/x9LHpO5.png =400x) - if k_member is delta kernel we get ![](https://i.imgur.com/azBVGoI.png =200x) - normalization (needed) - Composition: - give a kernel it is always possible to use another kernel on top of it ### Kernels on structured data - similarity between objects - graphs, trees, strings - match kernel - delta kernel - simplest kernel on structures - x can be everything (not vector) ![](https://i.imgur.com/OkrKUSV.png =220x) - kernel spectrum - variable k - feature space is frequency of each k-gram, substring - very sparse (lots of 0) - Solution: kernel mismatch string - variables k and m - frequency of eack k-gram, with m mismatch - sum with m from 0 to m like - k=3,m=0 + k=3,m=1 = 1 + 24 = 25 ![](https://i.imgur.com/Ev3QIAB.png =200x) #### kernels on Trees - Subset tree kernel - matching of subtrees - summation of recursive procedure C for al nodes of the trees t and t1 ![](https://i.imgur.com/XSN1FVS.png =200x) - Matching of subtrees? - consider roots of subtrees n and n1 - if n=n1 and (children(n) = children(n1)) -> Match - Procedure C? - recursive - if n and n1 dont match -> 0 - if n and n1 match but their children are only leaf -> 1 - else: ![](https://i.imgur.com/hCS0Xho.png =220x) - when trees are large is difficult - put a weight - replace 1 with $0\le\lambda\le1$ #### kernels on graphs 1. Simple way is similar as spectrum for strings - frequency of subgraph up to k - dot product of frequencies - good with small graphs 2. Walk kernel - walk in graph: ways to go from one node to the other - to efficiently count walks of lenght n use $A^n$ - where A is the adjacency matrix of the graph - Compute walks between 2 labels: ![](https://i.imgur.com/zCS4MWY.png =180x) - The kernel is the dot product of the 2 matrices ![](https://i.imgur.com/Gahvx8B.png =220x) 3. Look Weistfeiler-Lehman # Deep networks ## Clustering