Machine Learning Essentials

hi hello :::danger check all Xs, a bunch of them (if not all) have to be bold in latex ::: # organisational stuff 3 parts: lecture, homework, exam >JOOOOOOOOOOO XEL SIEHSU DS NODAS SIEHT SWAGGY AUS >YESSSSSSSSSSSSSS >wie hat man kommentare gemacht >diese zitatblöcke ## How to Points - 2-3 people - 50% of points on both for exam - **weekly exercises** - 40 points per sheet (0, 25%, ..., 100%) - each task graded individually - if task more difficult then more points - 1 week for homework - 1/2 week to correct with sample solution (comments, no changing the code itself) - Monday Thursday alternating - **cross-examination** - feedback should be helpful - 0 points if no feedback or unhelpful - 1 point if good feedback ## Mampf :::info https://mampf.mathi.uni-heidelberg.de/lectures/193 ::: - publish homework & sample solutions & lecture blackboards - upload your hand-ins - join solutions on mampf - python introduction > man braucht nen paar minuten zum hochladen, we should not ignore that > agreed ## Müsli - points - notification emails ## HeiCo - final results ## Discord - discussions - questions - teamfinding ## Exam - mini-research project - implement AI agent that performs well in tournament of [Bomber Man](https://en.wikipedia.org/wiki/Bomberman) game - use machine learning - "reinforcement learning" - teams of 2 or 3 - start: end of july - effective working time: 3 weeks - deadline: - 23.09. for agent code - 30.09. final report (similar to a bachelor thesis) - 4000 words per teammember (roughly 10 pages per teammember) - this very important for final grade - october tournament - november award ceremony - tournament only small percentage of points, results there not really relevant > i am getting bored > aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. # Chapter 1 - What is Machine Learning? - **Setting:** We would like to know some quantile $y$, but cannot (easily) measure it directly (may be impossible, illegal, unethical, too expecsive, only happens in the future,...) - **Problems:** - Sometimes physically impossible (travel to the center of the world etc.) - sometimes too expensive - crash test on car for stability (u dont want that necessarily on every car) - may be illegal (no forcing people to do things, no torture, no gaslighting :(, ...) - too time-consuming - measuremeant would kill the animal (we dont want that) - **Solution:** - We measure different quantitites $x$ and calculate $y=f(x)$ - ML Jargon: $y$ response, $x$ features (usually multiple $\implies x$ is a vector) - should be informative about $y$ - looking for function $f$ to calculate $y$ - **traditional way** to get $f(x)$: ask an expert - expensive - slow - many problems where experts fail (ie Biology because too messy) - **machine-learning way:** derive $f(x)$ from data (+ prior knowledge) - supervised learning: measure $y$ in a "lab" setting - define a generic function family $F$ that contains many functions $f_1(x), f_2(x),...$ $\implies$ formula is fixed but coefficients can be chosen freely e.g. neural network architecture jargon: $\Theta$: coefficient ("weight") vector $\implies$ idea: choose $F$ such that it contains a good $f(x)$ and learn $\Theta$ such that $f_\Theta(x)\approx y$ > sind das kleine x und y oder große X und Y? > genuinely not sure anymore example: $F : f_{\theta}(x) = a\cdot x^2 + b \cdot x + c~~~ ( x\in \mathbb{R} \Theta = (a,b,c)$ dataset 1:  ![image](https://hackmd.io/_uploads/SJhDb76eR.png) $$\Theta_1=(a=1,b=0,c=0.5)$$ dataset 2:  > der lädt ja bilder der tafel hoch i think oder es sind da welche hochgeladen da kann man einfach die einfügen später > we have a nemo tho > but we could have his drawings > fairrr $$\Theta_2=(a=0,b=1.2,c=0)$$ Too many coefficients can lead to ["overfitting"](https://www.ibm.com/topics/overfitting) ## basic machine learning workflow 1. Collect data: $\underline{seperate}$ training and test set > most important message: two data sets a training and a test set > [name= köthe] 2. use prior knowledge to pick $F$ 3. use training set $(TS)$ to learn $\hat{\Theta}$ (notation $\hat{\cdot}~\hat{=}$ predicted/learned solution) 4. use test set to evaluate quality of $\hat{y}=f_\Theta$ (notation $\cdot ^{*}:$ true value/solution) if not good enough iterate (goto 1. collect more or better data 2. pick a better $F$ 3. use better training algorithms ) danger test eventually becomes part of the training 5. if model is good $\Rightarrow$ deploy in practice > why rightarrow why not implies $\implies$ > fair enough > because i like it better, is shorter, i use both but for different use-cases ## Overfitting - $f_{\hat{\Theta}}(x)$ learns also the unimportant properties of the training data, e.g. noise, or just memorises the TS $\implies$ generalises badly to new data: - training accuracy high $\Leftarrow$ training accuracy is **not** informative - out-of-training accuracy low $\Leftarrow$ need test set to recognize this  $F$: polynomial of degree 7 TS: 8 points $\implies$ fit TS exactly $f_{\hat{\Theta}}(x)\leftarrow$ interpolates TS exactly $\equiv$ remaining error=0 test points are far from $f_\hat{\Theta}(x)$: huge test error this must be deleted and avoided, need seperate test set --- ## Intro continues ML: learn function from feature $x$ (observable) to response $y$ (not easily observable) $y=\hat{f}(x)$, but sometimes this is not good enough: response may be uncertain $\hat{=}$ no single correct $y$, but multiple possibilities solution: function returns set. $\{y_1,y_2,y_3\}=f(x)$ (if 3 possible outcomes) to also return plausibility of different possible outcomes: **probabilistic prediction** $\implies$ learn a conditional probability distribution ($y$ discrete) / density ($y$ continuous) :::info $Y$ war die ganze zeit klein, ab jetzt auch großes exists ::: $p(Y=y|X=x)$ takes values from set of possible values :::info $x$: photograph $y$: tree species $\in \{\text{oak, willow birch}\}$ $X=x=$[insert drawing of willow] $p(Y=\text{willow}|X=x)=0.1$ $p(Y=\text{birch}|X=x)=0.08$ $p(Y=\text{oak}|X=x)=0.02$ $X$: satellite image $X=x=$[insert drawing of dots in box as satellite high res image] $p(Y=\text{willow}|X=x)=0.6$ ::: probability prediction is a strict generalization of deterministic prediction - determination is a special case 1. only one possible outcome: delta distribution: $p(Y=y|X=x)=\delta(x-f(x))$ [insert graph, see mampf] 2. we are only interested in the most plausible outcome: how to get a deterministic prediction from a probabilistic one: **arg max operator** [insert graph that looks like a kinda boring rollercoaster] $\underset{Y}{\max}~p(Y=y|X=x)$ $\underset{Y}{\text{argmax}}~p(Y=y|X=x)$ $\hat{f}(x)=\underset{Y}{\text{argmax}}~p(Y=y|X=x)$ other deterministic reductions: - mean: $$\hat{y}=f(x)=\mathbb{E}_{y\sim p(Y|X=x)}[Y]=\begin{cases}\frac{1}{c}\sum^c_{k=1}kp(Y=k|X=\lambda)\\\int yp(Y=y|X=\lambda\end{cases}$$ - median or any suitable set representative [more drawings] ## Where does the uncertainty come from? 1. **intrinsic non-determinism of the world**: quantum mechanict, chaos theory, ... 2. **data are not perfect**: noise, missing data, outliers 3. **data may not be fully informative**: finite datasets, information loss in $X$ relative to $Y$, quantities may not be measurable 4. **models ($\hat{=}$ our interpration of data) are not perfect**: simplify reality, ignorance, nature changes $\Rightarrow$ model may be outdated, model family too weak or poorly converged 5. **uncertainty about the goals**: what exactly is the "best" solution? $\implies$ proper treatment of uncertainty is key challenge of AI ## Notation :::success says he will be consistent apologizes for X most uppercase X will have _i also all X bold in latex ::: - $\mathbf{x}$ features: $\mathbf{x}\in\mathbb{R}^D$ - $D$ number of features - $N$ instances in training set TS, indexed by $i=1,\dots,N$ (or $i',i''$) - $\mathbf{X}_i$ features of instance $i$, features of all instances: matrix $X\in\mathbb{R}^{N\times D}$ ($\mathbf X_i$: row $i$ of $X$) - $\mathbf X_j$ column $j$ of $X$: feature vector $j$ of all instances $\mathbf X_j=\mathbb{R}^N$ ($j=1,\dots, D$) - $y$ - continuous: $y\in\mathbb{R}$ (scalar, could be vector $y\in\mathbb{R}^M$, but we usually avoid this) - exception to "y is a single value": output of a probability vector - $Y_k=p(Y=k|\mathbf X=\mathbf x)$ $y\in\mathbb{R}^C$ - discrete: - ordinal = ordered discrete values, e.g. tiny < small < medium < big (means sortable) - categorial = not sortable, e.g. oak, willow, birch - $C$ number of categories (labels) - $Y=k$, $k=1, \dots, C$ :::info | i | gender | height | weight ($X_j=3$) | | :---: | :---: | :---: | :---: | | 1 ($X_i=1$) | m | 1.8 m | 80 kg | | 2 | f | 1.7 m | 65 kg | | 3 | d | 1.6 m | 90 kg | what to do if feature is discrete: "one-hot encoding" == probability with all 0 except for the true value | i | m | f | d | height | weight | | :---: | :---: | :---: | :---: | :---: | :---: | | 1 | 1 | 0 | 1 | 1.8 m | 80 kg | | 2 | 0 | 1 | 0 | 1.7 m | 65 kg | | 3 | 0 | 0 | 1 | 1.6 m | 90 kg | ::: ## basic approaches to machine learning 1. supervised learning in TS, both $x$ and $y$ are known. TS$=\{(X_i,Y_i)\}^N_{i=1}$ - where do the $Y_i$ come from? - human annotation - lab measurement that is not possible in the wild - wait for the future to unfold (use historical data) - fundamental tasks - regression: $y$ continuous - classification $y$ discrete 2. unsupervised learning: TS=$\{X_i\}^N_{i=1}$ - tasks - hidden variable: determin $\hat{Y}_i\in$TS and $\hat{f}$ jointly (== we know what $y$ is) - discovery: find any structure in the data, "AI scientist" ("datamining", careful, structure can happen accidentally) - dimension reduction $\overset{\sim}{x}\in\mathbb{R}^{D'}$ $D'<D$ - clustering: group similar things 3. weakly supervised learning: reduce the cost of TS annotations (hot topic) - semi-supervised: - $Y_i$ is known by some $i$ - weakly-supervised: - $Y_i$: fully labeled image (1 class per pixel) == unknown - $\overset{\sim}{Y}_i$: object category per image (helps to localize object precisely) - active: - algorithm has some $Y_i$ and actively asks for $y$ of most informative $i'$ - self-supervised: - instead of solving $y=f(x)$ pitch some other target response $\overset{\sim}{y}$ and learn $\overset{\sim}{y}=\overset{\sim}{f}(x)$ instead such that $\overset{\sim}{y}_i$ is easy to obtain $\implies$ chat GPT, for images: SIMCLR >PAIIIIIIIIIIIIIIIIIIIIIIIIIIIN >[name= köthe]  ## Perceptron (Continuation) drawing Seperating line/plane "decision bound" goal: line should seperate classes $\underbrace{\overset{\text{step 0}}{\to}}_{\text{centralize }\overset{\sim}{x_j}= x_j-\max(x_j)}$ drAWING $b=0\implies$decision line goes through origin $\underbrace{\overset{\text{step 1}}{\to}}_{\text{mirror data }}$ DRAWING $\overset{\approx}{x}_1= y_i^*\cdot x_i^*$ goal: rotate decision line such that all points are on one side $\underbrace{\overset{\text{step 2}}{\to}}_{\text{find } \underbrace{\hat{\beta}}_{\beta \text{ normal vector of line/plane}}}$ DRAWING $\bar{x}_{err}=\frac{1}{N_{err}} \sum\limits_{i:\hat{y}_1\neq y_1^*}\overset{\approx}{x}_i$ $\beta_{new}=\beta_{old}- \tau \frac{\partial loss}{\partial \beta}|_{\beta_{old}}= \overset{\approx}{x}_i^T$ gradient descent $=\beta_{old} + \frac{\tau}{N} \sum\limits_{i:\hat{y}_i\neq y_i^*}y_i^*\tilde x_1^T$ $=\beta_{old} + \underbrace{\tau \frac{N_{err}}{N}}_{=\tau_{eff}}\cdot \underbrace{\frac{1}{N_{err}}\sum\limits_{i=\hat{y}_i \neq y_i^*}\overset{\approx}{x}_i^T}_{\bar{x}_{err}^T}$ :::info $\underbrace{1}_{2} \text{ \underbrace{1}_{2}}$ $\overbrace{3}^{4} \text{ \overbrace{3}^{4}}$ ::: In formula: $loss (\hat{y}_i,y_i^*)=\begin{cases} 0 ~~~~~~~~~~~~~ \text{if } \hat{y}_i = y_i^* \iff \underbrace{sign(x_i\beta) =y_i^*}_{1}\iff y_i^*x_i\beta>0\\ \underbrace{|x_i\beta|}_{2. contracted} ~\text{ if } \hat{y}_i \neq y_i^*\end{cases}$ 1. $\begin{split} &sign(x_i\beta) = y_i^*\\ &y_i^* sign(x_i \beta)=1 >0 \\ &y_i⁺x_i\beta>0\end{split}$ 2. $|x_i\beta|=\left\{\begin{matrix} x_i\beta \text{ if } x_i\beta> 0 \iff \hat{y}_i = 1, y_i^*=-1\\ -x_i\beta \text{ if } x_i\beta<0\iff \hat{y}_i =1, y_i^* = 1\end{matrix}\right \} =-y_1* x_i \beta$ $loss_(\hat{y}_i,y_i^*)=\left\{\begin{matrix} 0 ~~~~~~~~~~~~\text{ if } y_i^* x_i\beta>0\\ -y_i^* x_i\beta ~~~~~~~~~~~~~\text{ else } \end{matrix} \right \}= ReLU(-t) \text{ with } t= y_i^*x_i \beta$ DRAWING $\frac{\partial loss (\hat{y}_i, y_i^*)}{\partial \beta}=\begin{cases} 0 \text{ if }y_i^* x_i\beta > 0 \\ -y_i^* x_i^T\text{ else } \end{cases}$ average loss gradient: $\frac{\partial loss(TS)}{\partial \beta} = \frac{1}{N} \sum\limits_{i:\hat{y}_i\neq y_i^*}-y_i^*x_i^T$ :::success $\beta_{new}=\beta_{old}-\tau \frac{\partial loss(TS)}{\partial \beta} = \beta_{old}+\frac{\tau}{N}\sum\limits_{i:\hat{y}_i\neq y_i^*} y_i^*x_i^T$ Iterate until no i is incorrectly classified, only averages if data are linearaly seperale, otherwise oscillation $\implies$ reduce $\tau$ ::: ## Linear Support Vector Machine (SVM) DRAWING Solution idea: keep safety margin around TS DRAWING - solutions intersecting safety margin are now forbidden - the intuitve "middle line" $\underline{maximises}$ margin $\hat{=}$ maximize distance of nearest point DRAWING penalize if $sign(x_i\beta)=y_i^*$ but $sign((x_i+\varepsilon)\beta)\neq y_i^*, \varepsilon$ small $y_i^* x_i \beta$ formalize the idea: - $\beta$ is not uniquely defined, because $sign(x_i\beta) = sign(x_i(\alpha\cdot\beta))\text{ for } \alpha>0$ $\implies$ "equivalence classes" of solutions $H= \{ \beta: \beta=\alpha \cdot \underbrace{\beta_H}_{\text{"representative of H" arbitrary}} (\alpha>0)\}$ - calculate distance of a point $X_i$ from the plance with normal vector $\beta$ $$dist(x_i,H) =\frac{|x_i\beta_H|}{||\beta_H||_2}$$ - distance of TS from H "Hausdorff distance" $\hat{=}$ distance of nearest point $m_H= \min\limits_i \overbrace{\frac{|x_i\beta_H|}{||\beta_H||}}^{m_H \text{only depends on angle}}$, nearest index $\hat{i}=\arg \min\limits_i\frac{|x_i\beta_H|}{||\beta_H||}$ - $|x_i\beta|= y_i^* x_i\beta= m_H \cdot ||\beta_H||\overset{!}{=}1\iff$ choose $\beta_H$ such that $||\beta_H||=\frac{1}{m_H}$ ### learning objective $\hat{H}=\arg\max\limits_H \underbrace{\min\limits_i \frac{|x_i\beta_H|}{||\beta_H||}}_{m_H}= \arg\max\limits_H\frac{1}{||\beta_H||}\underbrace{\min\limits_i|x_i\beta_H|}_{=1\text{ by construction of } \beta_H}$ $\hat{\beta_H}=\arg\max\limits_{\beta_H}\frac{1}{||\beta_H||}$ such that for all $i$: $y_i^*x_i \hat{\beta_H}>1$ :::success $\beta_Hm_H=\min\limits_iy_i^*x_i\beta_H\overset{!}{=} 1$ ::: equivalent but simpler formulations $\hat{\beta_H}=\arg\min\limits_{\beta_H} ||\beta_H||$ sucht that --""-- $\iff \hat\beta_H=\arg\min\limits_{\beta_H} \frac12 ||\beta_H||^2$ s.t. as before incorporate constraints into objective via Lagrange multiplier $\lambda$ - rerite constraints in terms of ReLU: $y_i^*x_i\beta_H \geq 1 \iff ReLU (1-y_i^*x_i\beta_H)=0$ :::success $\implies \hat{\beta}_H= \arg\min\limits_{\beta_H}\underbrace{\frac12 \beta_H^T\beta_H}_{\text{regularization term: picks solution with largest margin}} + \lambda \underbrace{\frac1N \sum\limits_{i=1}^N ReLU (1-Y_i^*x_i\beta_H)}_{\text{data term: ensures correct calculation}}$ ::: .

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.