hi
hello
:::danger
check all Xs, a bunch of them (if not all) have to be bold in latex
:::
# organisational stuff
3 parts: lecture, homework, exam
>JOOOOOOOOOOO XEL SIEHSU DS NODAS SIEHT SWAGGY AUS
>YESSSSSSSSSSSSSS
>wie hat man kommentare gemacht
>diese zitatblöcke
## How to Points
- 2-3 people
- 50% of points on both for exam
- **weekly exercises**
- 40 points per sheet (0, 25%, ..., 100%)
- each task graded individually
- if task more difficult then more points
- 1 week for homework
- 1/2 week to correct with sample solution (comments, no changing the code itself)
- Monday Thursday alternating
- **cross-examination**
- feedback should be helpful
- 0 points if no feedback or unhelpful
- 1 point if good feedback
## Mampf
:::info
https://mampf.mathi.uni-heidelberg.de/lectures/193
:::
- publish homework & sample solutions & lecture blackboards
- upload your hand-ins
- join solutions on mampf
- python introduction
> man braucht nen paar minuten zum hochladen, we should not ignore that
> agreed
## Müsli
- points
- notification emails
## HeiCo
- final results
## Discord
- discussions
- questions
- teamfinding
## Exam
- mini-research project
- implement AI agent that performs well in tournament of [Bomber Man](https://en.wikipedia.org/wiki/Bomberman) game
- use machine learning
- "reinforcement learning"
- teams of 2 or 3
- start: end of july
- effective working time: 3 weeks
- deadline:
- 23.09. for agent code
- 30.09. final report (similar to a bachelor thesis)
- 4000 words per teammember (roughly 10 pages per teammember)
- this very important for final grade
- october tournament
- november award ceremony
- tournament only small percentage of points, results there not really relevant
> i am getting bored
> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.
# Chapter 1 - What is Machine Learning?
- **Setting:** We would like to know some quantile $y$, but cannot (easily) measure it directly (may be impossible, illegal, unethical, too expecsive, only happens in the future,...)
- **Problems:**
- Sometimes physically impossible (travel to the center of the world etc.)
- sometimes too expensive
- crash test on car for stability (u dont want that necessarily on every car)
- may be illegal (no forcing people to do things, no torture, no gaslighting :(, ...)
- too time-consuming
- measuremeant would kill the animal (we dont want that)
- **Solution:**
- We measure different quantitites $x$ and calculate $y=f(x)$
- ML Jargon: $y$ response, $x$ features (usually multiple $\implies x$ is a vector)
- should be informative about $y$
- looking for function $f$ to calculate $y$
- **traditional way** to get $f(x)$: ask an expert
- expensive
- slow
- many problems where experts fail (ie Biology because too messy)
- **machine-learning way:** derive $f(x)$ from <font color="red">data</font> (+ prior knowledge)
- supervised learning: measure $y$ in a "lab" setting
- define a <font color = "red">generic function family $F$ </font>that contains many functions $f_1(x), f_2(x),...$
$\implies$ formula is fixed but coefficients can be chosen freely
e.g. neural network architecture
jargon: $\Theta$: coefficient ("weight") vector
$\implies$ idea: choose $F$ such that it contains a good $f(x)$ and learn $\Theta$ such that $f_\Theta(x)\approx y$
> sind das kleine x und y oder große X und Y?
> genuinely not sure anymore
example: $F : f_{\theta}(x) = a\cdot x^2 + b \cdot x + c~~~ ( x\in \mathbb{R} \Theta = (a,b,c)$
dataset 1:
<!--- insert screenshot here later--->

$$\Theta_1=(a=1,b=0,c=0.5)$$
dataset 2:
<!--- insert screenshot here later--->
> der lädt ja bilder der tafel hoch i think oder es sind da welche hochgeladen da kann man einfach die einfügen später
> we have a nemo tho
> but we could have his drawings
> fairrr
$$\Theta_2=(a=0,b=1.2,c=0)$$
Too many coefficients can lead to ["overfitting"](https://www.ibm.com/topics/overfitting)
## basic machine learning workflow
1. Collect data: $\underline{seperate}$ training and test set
> most important message: two data sets a training and a test set
> [name= köthe]
2. use prior knowledge to pick $F$
3. use training set $(TS)$ to learn $\hat{\Theta}$ (notation $\hat{\cdot}~\hat{=}$ predicted/learned solution)
4. use test set to evaluate quality of $\hat{y}=f_\Theta$ (notation $\cdot ^{*}:$ true value/solution) if not good enough iterate
(goto
1. collect more or better data
2. pick a better $F$
3. use better training algorithms
) <font color = "red"> danger</font> test eventually becomes part of the training
5. if model is good $\Rightarrow$ deploy in practice
> why rightarrow why not implies $\implies$
> fair enough
> because i like it better, is shorter, i use both but for different use-cases
## Overfitting
- $f_{\hat{\Theta}}(x)$ learns also the unimportant properties of the training data, e.g. noise, or just memorises the TS
$\implies$ generalises badly to new data:
- training accuracy high $\Leftarrow$ training accuracy is **not** informative
- out-of-training accuracy low $\Leftarrow$ need test set to recognize this
<!--- drawing --->
$F$: polynomial of degree 7
TS: 8 points
$\implies$ fit TS exactly
$f_{\hat{\Theta}}(x)\leftarrow$ interpolates TS exactly $\equiv$ remaining error=0
test points are far from $f_\hat{\Theta}(x)$: huge test error
this must be deleted and avoided, need seperate test set
---
## Intro continues
ML: learn function from feature $x$ (observable) to response $y$ (not easily observable) $y=\hat{f}(x)$, but sometimes this is not good enough: response may be uncertain
$\hat{=}$ no single correct $y$, but multiple possibilities
solution: function returns set. $\{y_1,y_2,y_3\}=f(x)$ (if 3 possible outcomes)
to also return plausibility of different possible outcomes: **probabilistic prediction**
$\implies$ learn a conditional probability distribution ($y$ discrete) / density ($y$ continuous)
:::info
$Y$ war die ganze zeit klein, ab jetzt auch großes exists
:::
$p(Y=y|X=x)$ takes values from set of possible values
:::info
$x$: photograph
$y$: tree species $\in \{\text{oak, willow birch}\}$
$X=x=$[insert drawing of willow]
$p(Y=\text{willow}|X=x)=0.1$
$p(Y=\text{birch}|X=x)=0.08$
$p(Y=\text{oak}|X=x)=0.02$
$X$: satellite image
$X=x=$[insert drawing of dots in box as satellite high res image]
$p(Y=\text{willow}|X=x)=0.6$
:::
probability prediction is a strict generalization of deterministic prediction
- determination is a special case
1. only one possible outcome: delta distribution: $p(Y=y|X=x)=\delta(x-f(x))$ [insert graph, see mampf]
2. we are only interested in the most plausible outcome:
how to get a deterministic prediction from a probabilistic one:
**arg max operator**
[insert graph that looks like a kinda boring rollercoaster]
$\underset{Y}{\max}~p(Y=y|X=x)$
$\underset{Y}{\text{argmax}}~p(Y=y|X=x)$
$\hat{f}(x)=\underset{Y}{\text{argmax}}~p(Y=y|X=x)$
other deterministic reductions:
- mean: $$\hat{y}=f(x)=\mathbb{E}_{y\sim p(Y|X=x)}[Y]=\begin{cases}\frac{1}{c}\sum^c_{k=1}kp(Y=k|X=\lambda)\\\int yp(Y=y|X=\lambda\end{cases}$$
- median or any suitable set representative
[more drawings]
## Where does the uncertainty come from?
1. **intrinsic non-determinism of the world**: quantum mechanict, chaos theory, ...
2. **data are not perfect**: noise, missing data, outliers
3. **data may not be fully informative**: finite datasets, information loss in $X$ relative to $Y$, quantities may not be measurable
4. **models ($\hat{=}$ our interpration of data) are not perfect**: simplify reality, ignorance, nature changes $\Rightarrow$ model may be outdated, model family too weak or poorly converged
5. **uncertainty about the goals**: what exactly is the "best" solution?
$\implies$ proper treatment of uncertainty is key challenge of AI
## Notation
:::success
says he will be consistent
apologizes for X
most uppercase X will have _i
also all X bold in latex
:::
- $\mathbf{x}$ features: $\mathbf{x}\in\mathbb{R}^D$
- $D$ number of features
- $N$ instances in training set TS, indexed by $i=1,\dots,N$ (or $i',i''$)
- $\mathbf{X}_i$ features of instance $i$, features of all instances: matrix $X\in\mathbb{R}^{N\times D}$ ($\mathbf X_i$: row $i$ of $X$)
- $\mathbf X_j$ column $j$ of $X$: feature vector $j$ of all instances $\mathbf X_j=\mathbb{R}^N$ ($j=1,\dots, D$)
- $y$
- continuous: $y\in\mathbb{R}$ (scalar, could be vector $y\in\mathbb{R}^M$, but we usually avoid this)
- exception to "y is a single value": output of a probability vector
- $Y_k=p(Y=k|\mathbf X=\mathbf x)$ $y\in\mathbb{R}^C$
- discrete:
- ordinal = ordered discrete values, e.g. tiny < small < medium < big (means sortable)
- categorial = not sortable, e.g. oak, willow, birch
- $C$ number of categories (labels)
- $Y=k$, $k=1, \dots, C$
:::info
| i | gender | height | weight ($X_j=3$) |
| :---: | :---: | :---: | :---: |
| 1 ($X_i=1$) | m | 1.8 m | 80 kg |
| 2 | f | 1.7 m | 65 kg |
| 3 | d | 1.6 m | 90 kg |
what to do if feature is discrete: "one-hot encoding" == probability with all 0 except for the true value
| i | m | f | d | height | weight |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 1 | 0 | 1 | 1.8 m | 80 kg |
| 2 | 0 | 1 | 0 | 1.7 m | 65 kg |
| 3 | 0 | 0 | 1 | 1.6 m | 90 kg |
:::
## basic approaches to machine learning
1. supervised learning in TS, both $x$ and $y$ are known. TS$=\{(X_i,Y_i)\}^N_{i=1}$
- where do the $Y_i$ come from?
- human annotation
- lab measurement that is not possible in the wild
- wait for the future to unfold (use historical data)
- fundamental tasks
- regression: $y$ continuous
- classification $y$ discrete
2. unsupervised learning: TS=$\{X_i\}^N_{i=1}$
- tasks
- hidden variable: determin $\hat{Y}_i\in$TS and $\hat{f}$ jointly (== we know what $y$ is)
- discovery: find any structure in the data, "AI scientist" ("datamining", careful, structure can happen accidentally)
- dimension reduction $\overset{\sim}{x}\in\mathbb{R}^{D'}$ $D'<D$
- clustering: group similar things
3. weakly supervised learning: reduce the cost of TS annotations (hot topic)
- semi-supervised:
- $Y_i$ is known by some $i$
- weakly-supervised:
- $Y_i$: fully labeled image (1 class per pixel) == unknown
- $\overset{\sim}{Y}_i$: object category per image (helps to localize object precisely)
- active:
- algorithm has some $Y_i$ and actively asks for $y$ of most informative $i'$
- self-supervised:
- instead of solving $y=f(x)$ pitch some other target response $\overset{\sim}{y}$ and learn $\overset{\sim}{y}=\overset{\sim}{f}(x)$ instead such that $\overset{\sim}{y}_i$ is easy to obtain $\implies$ chat GPT, for images: SIMCLR
>PAIIIIIIIIIIIIIIIIIIIIIIIIIIIN
>[name= köthe]
<!---26.04.2024--->
## Perceptron (Continuation)
drawing
Seperating line/plane
"decision bound"
goal: line should seperate classes $\underbrace{\overset{\text{step 0}}{\to}}_{\text{centralize }\overset{\sim}{x_j}= x_j-\max(x_j)}$
drAWING
$b=0\implies$decision line goes through origin $\underbrace{\overset{\text{step 1}}{\to}}_{\text{mirror data }}$
DRAWING
$\overset{\approx}{x}_1= y_i^*\cdot x_i^*$
goal: rotate decision line such that all points are on one side $\underbrace{\overset{\text{step 2}}{\to}}_{\text{find } \underbrace{\hat{\beta}}_{\beta \text{ normal vector of line/plane}}}$
DRAWING
$\bar{x}_{err}=\frac{1}{N_{err}} \sum\limits_{i:\hat{y}_1\neq y_1^*}\overset{\approx}{x}_i$
$\beta_{new}=\beta_{old}- \tau \frac{\partial loss}{\partial \beta}|_{\beta_{old}}= \overset{\approx}{x}_i^T$ gradient descent
$=\beta_{old} + \frac{\tau}{N} \sum\limits_{i:\hat{y}_i\neq y_i^*}y_i^*\tilde x_1^T$
$=\beta_{old} + \underbrace{\tau \frac{N_{err}}{N}}_{=\tau_{eff}}\cdot \underbrace{\frac{1}{N_{err}}\sum\limits_{i=\hat{y}_i \neq y_i^*}\overset{\approx}{x}_i^T}_{\bar{x}_{err}^T}$
:::info
$\underbrace{1}_{2} \text{ \underbrace{1}_{2}}$
$\overbrace{3}^{4} \text{ \overbrace{3}^{4}}$
:::
In formula:
$loss (\hat{y}_i,y_i^*)=\begin{cases} 0 ~~~~~~~~~~~~~ \text{if } \hat{y}_i = y_i^* \iff \underbrace{sign(x_i\beta) =y_i^*}_{1}\iff y_i^*x_i\beta>0\\ \underbrace{|x_i\beta|}_{2. contracted} ~\text{ if } \hat{y}_i \neq y_i^*\end{cases}$
1.
$\begin{split} &sign(x_i\beta) = y_i^*\\ &y_i^* sign(x_i \beta)=1 >0 \\ &y_i⁺x_i\beta>0\end{split}$
2.
$|x_i\beta|=\left\{\begin{matrix} x_i\beta \text{ if } x_i\beta> 0 \iff \hat{y}_i = 1, y_i^*=-1\\ -x_i\beta \text{ if } x_i\beta<0\iff \hat{y}_i =1, y_i^* = 1\end{matrix}\right \} =-y_1* x_i \beta$
$loss_(\hat{y}_i,y_i^*)=\left\{\begin{matrix} 0 ~~~~~~~~~~~~\text{ if } y_i^* x_i\beta>0\\ -y_i^* x_i\beta ~~~~~~~~~~~~~\text{ else } \end{matrix} \right \}= ReLU(-t) \text{ with } t= y_i^*x_i \beta$
DRAWING
$\frac{\partial loss (\hat{y}_i, y_i^*)}{\partial \beta}=\begin{cases} 0 \text{ if }y_i^* x_i\beta > 0 \\ -y_i^* x_i^T\text{ else } \end{cases}$
average loss gradient:
$\frac{\partial loss(TS)}{\partial \beta} = \frac{1}{N} \sum\limits_{i:\hat{y}_i\neq y_i^*}-y_i^*x_i^T$
:::success
$\beta_{new}=\beta_{old}-\tau \frac{\partial loss(TS)}{\partial \beta} = \beta_{old}+\frac{\tau}{N}\sum\limits_{i:\hat{y}_i\neq y_i^*} y_i^*x_i^T$
Iterate until no i is incorrectly classified, only averages if data are linearaly seperale, otherwise oscillation $\implies$ reduce $\tau$
:::
## Linear Support Vector Machine (SVM)
DRAWING
Solution idea: keep safety margin around TS
DRAWING
- solutions intersecting safety margin are now forbidden
- the intuitve "middle line" $\underline{maximises}$ margin
$\hat{=}$ maximize distance of nearest point
DRAWING
penalize if $sign(x_i\beta)=y_i^*$ but $sign((x_i+\varepsilon)\beta)\neq y_i^*, \varepsilon$ small $y_i^* x_i \beta$
formalize the idea:
- $\beta$ is not uniquely defined, because $sign(x_i\beta) = sign(x_i(\alpha\cdot\beta))\text{ for } \alpha>0$
$\implies$ "equivalence classes" of solutions $H= \{ \beta: \beta=\alpha \cdot \underbrace{\beta_H}_{\text{"representative of H" arbitrary}} (\alpha>0)\}$
- calculate distance of a point $X_i$ from the plance with normal vector $\beta$ $$dist(x_i,H) =\frac{|x_i\beta_H|}{||\beta_H||_2}$$
- distance of TS from H "Hausdorff distance" $\hat{=}$ distance of nearest point $m_H= \min\limits_i \overbrace{\frac{|x_i\beta_H|}{||\beta_H||}}^{m_H \text{only depends on angle}}$, nearest index $\hat{i}=\arg \min\limits_i\frac{|x_i\beta_H|}{||\beta_H||}$
- $|x_i\beta|= y_i^* x_i\beta= m_H \cdot ||\beta_H||\overset{!}{=}1\iff$ choose $\beta_H$ such that $||\beta_H||=\frac{1}{m_H}$
### learning objective
$\hat{H}=\arg\max\limits_H \underbrace{\min\limits_i \frac{|x_i\beta_H|}{||\beta_H||}}_{m_H}= \arg\max\limits_H\frac{1}{||\beta_H||}\underbrace{\min\limits_i|x_i\beta_H|}_{=1\text{ by construction of } \beta_H}$
$\hat{\beta_H}=\arg\max\limits_{\beta_H}\frac{1}{||\beta_H||}$ such that for all $i$: $y_i^*x_i \hat{\beta_H}>1$
:::success
$\beta_Hm_H=\min\limits_iy_i^*x_i\beta_H\overset{!}{=} 1$
:::
equivalent but simpler formulations
$\hat{\beta_H}=\arg\min\limits_{\beta_H} ||\beta_H||$ sucht that --""-- $\iff \hat\beta_H=\arg\min\limits_{\beta_H} \frac12 ||\beta_H||^2$ s.t. as before
incorporate constraints into objective via Lagrange multiplier $\lambda$
- rerite constraints in terms of ReLU: $y_i^*x_i\beta_H \geq 1 \iff ReLU (1-y_i^*x_i\beta_H)=0$
:::success
$\implies \hat{\beta}_H= \arg\min\limits_{\beta_H}\underbrace{\frac12 \beta_H^T\beta_H}_{\text{regularization term: picks solution with largest margin}} + \lambda \underbrace{\frac1N \sum\limits_{i=1}^N ReLU (1-Y_i^*x_i\beta_H)}_{\text{data term: ensures correct calculation}}$
:::
.