---
title: Machine Learning - Questions from lecture
---
<span style="color:red">Wir sollten wohl auch die Algorithmen reinwerfen:</span>
<img src="https://hackmd.io/_uploads/H1sEbC9Dn.png" style="border:1px solid red;">
---
- What is the main difference between regression and classification? #VO1 #s20
<details>
<summary style="font-weight:bold">Solution</summary>
The result of the classification is a discrete value and the result of the regression is a continues value.
</details>
- Prove that the parameters to **maximize the likelihood** is same as to **minimize least square error**. <span style="color:blue">Hint, use log-likelihood instead of likelihood.</span> $$L(\boldsymbol{w}) = \prod\limits^n_{i=1}\frac1{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y_i-\boldsymbol{x_i}\boldsymbol{w})^2}{2\sigma^2}\right)$$ #VO2 #s52
<details>
<summary style="font-weight:bold">Solution</summary>
$$\begin{aligned}L(\boldsymbol{w}) &= \underbrace{\dfrac 1{(\sqrt{2\pi}\sigma)^n}}_{const}\exp\left(\sum\limits_{i=1}^n-\frac{(y_i-\boldsymbol{x_i}\boldsymbol{w})^2}{2\sigma^2}\right) \\\\
\ln L(\boldsymbol{w}) &= const \color{blue}{+}\sum\limits_{i=1}^n-\frac{(y_i-\boldsymbol{x_i}\boldsymbol{w})^2}{2\sigma^2} \\
&= const + \dfrac 1{2\sigma^2}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{w}\right)^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{w}\right) \\
&= const + \dfrac 1{2\sigma^2}\left(\boldsymbol{y}^T\boldsymbol{y} - \boldsymbol{y}^T\boldsymbol{X}\boldsymbol{w} -\boldsymbol{w}^T\boldsymbol{X}^T\boldsymbol{y}+\boldsymbol{w}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{w}\right) \\
&= const + \dfrac 1{2\sigma^2}\left(\boldsymbol{y}^T\boldsymbol{y} - 2\boldsymbol{w}^T\boldsymbol{X}^T\boldsymbol{y} +\boldsymbol{w}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{w}\right) \\
\dfrac \partial{\partial\boldsymbol{w}}\ln L(\boldsymbol{w}) &= \dfrac 1{2\sigma^2}\left(-2\boldsymbol{X}^T\boldsymbol{y}+\boldsymbol{w}^T\boldsymbol{X}^T\boldsymbol{X}+\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{w}\right) \\
&= \dfrac 1{\sigma^2}\left(-\boldsymbol{X}^T\boldsymbol{y} + \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{w}\right) \overset{!}{=} 0 \\
\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{w} &= \boldsymbol{X}^T\boldsymbol{y} \\
\boldsymbol{w}^\ast &= \left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}
\end{aligned}$$
$\color{blue}{+}$ wegen $\ln$ Rechenregel
$\boldsymbol{X}$ alle x Werte
$(\boldsymbol{A}\boldsymbol{B})^T = \boldsymbol{B}^T\boldsymbol{A}^T$
</details>
- Bayes Risk: An Example (see also <a href="#emu1">Emu1</a>)<span id="marcus1"></span> </br> Consider a classification problem with two classes defined by the following likelihood functions. </br> when $p(\omega_1)=p(\omega_2)=0.5$, $c_{11}=c_{22}=0$, $c_{12}=1$ and $c_{21}=3^{\frac12}$, determine a decision rule that minimizes the bayes risk. #VO3 #s82
<details>
<summary style="font-weight:bold">Solution</summary>
$$\begin{align}
\frac{p(x|\omega_1)}{p(x|\omega_2)} = \frac{(c_{12}-c_{22})p(\omega_2)}{(c_{21}-c_{11})p(\omega_1)} \\
\frac{1}{\sqrt{3}}\frac{\exp\left(-\frac 12\frac{x^2}3\right)}{\exp\left(-\frac 12(x-2)^2\right)} = \frac{1}{\sqrt{3}} \\
\frac {x^2}3 = (x-2)^2 \\
x^2=3x^2-12x+12 \\
0 = x^2-6x+6 \\
\text{TR} \Rightarrow x=\{1.27, 4.73\}
\end{align}$$
<a href="http://tpcg.io/_0IQQ0Q"></span>
<img src="https://hackmd.io/_uploads/BJUfcgqvh.png" width=500px></a>
</details>
- How to choose the "worst" cluster? #VO4 #s130
<details>
<summary style="font-weight:bold">Solution</summary>
We choose the longest cluster size length for the worst cluster.
</details>
- How to split clusters? #VO4 #s130
<details>
<summary style="font-weight:bold">Solution</summary>
With non uniform binary split algorithm
<img src="https://hackmd.io/_uploads/H1GajZlOh.png" width=600px>
</details>
- What is the goal of LDA? #VO5 #s157
<details>
<summary style="font-weight:bold">Solution</summary>
Perform dimension reduction while preserving as much of the class discriminatory information as possible
LBG algorithm
- Combination of k-means and Binary split
- Proposed by Linde, Buzo, and Gray (1980)
- Instead of random initial codebook, let’s use codebook from binary split
- Faster convergence and better quality codebook than k-means algorithm
<img src="https://hackmd.io/_uploads/SJRBhWl_3.png" width=600px>
</details>
- PCA - simple example <span id="marcus2"></span></br> compute the principal components for the foloowing two-dimensional dataset $$\boldsymbol X = (x_1,x_2)=(1,2),(3,3), (3,5), (5,4), (5,6), (6,5), (8,7), (9,8)$$ #IPL5 #s182 (Answer: see <a href="https://hackmd.io/RpAEKfWtRXWbRfVgcgd2CA?view#Lec-5-Dimensionality-Reduction-Quiz-Questions">Lec 5 Question 3</a>)
- Example 1: Gaussian with unknown mean #VO6 #s193
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/BkYK3Wx_h.png" width=600px>
</details>
- Example 2: Univariate Gaussian with unknown $\mu, \sigma$ #VO6 #s194
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/Bkw23Wgdn.png" width=600px>
</details>
- Example 3: Multivariate Gaussian with unknown $\mu, \Sigma$ #VO6 #s195
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/Sk6pnZeO2.png" width=600px>
</details>
- What are limitations and challenges with the EM algorithm? #IPL7 #s233
<details>
<summary style="font-weight:bold">Solution</summary>
What are limitations and challenges with the EM algorithm?
- Initialization
- Local optima
</details>
- Where can the EM algorithm be used? #IPL7 #s233
<details>
<summary style="font-weight:bold">Solution</summary>
Where can the EM algorithm be used?
- Missing data imputation
- Clustering
- Density estimation
- network models
</details>
- Toy Example of EM
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/ryTE6bxun.png" width=600px>
<img src="https://hackmd.io/_uploads/SyBUabl_n.png" width=600px>
</details>
- Is Maximum Likelihood Estimate (MLE) unbiased estimate? #IPL7 #s234
<details>
<summary style="font-weight:bold">Solution</summary>
- In general, MLE is an unbiased estimator if the estimator is
consistent and has finite variance.
- MLE can be biased, due to the presence of high leverage points,
outliers, or other data anomalies. MLE can be biased in small sample sizes or when the sample size is comparable to the number of parameters being estimated.
- ML estimate of Gaussian mean is unbiased.
- ML estimate of Gaussian variance is unbiased.
- An estimator of a parameter is unbiased if the expected value of
the estimate is the same as the true value of the parameters.
</details>
- Parzen Estimator Simple Example </br> Given the dataset below, use Parzen windows to estimate the density $p(x)$ at $x = 3,10,15$. Use a bandwidth of $h=4$.
$$\boldsymbol{X}=\{x_1,x_2,...x_n\}=\{4,5,5,6,12,14,15,15,16,17\}$$
- Estimate $p(x=3), p(x=10), p(x=15)$ #VO8 #s246
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/HJ9gOr2vh.png" width=600px>
The formula for the Parzen window estimate is:
$\hat{p}(x) = \frac{1}{nh}\sum\limits_{i=1}^n k\left(\frac{x-x_i}{h}\right)$
where n is the number of data points, $x^{(i)}$ are the data points, and $k$ is the rectangular kernel:
$k(u) = \begin{cases} 1 & \text{if } |u| \leq 0.5 \\ 0 & \text{otherwise} \end{cases}$
Plugging in the values of $n=10$, $h=4$, we get:
$\hat{p}(x) = \frac{1}{40}\sum\limits_{i=1}^{10} k\left(\frac{x-x^{(i)}}{4}\right)$
Now we can evaluate this expression at x = 3, 10, 15:
<iframe src="https://mathembed.online/embed/CGGe0RC5qxinXnthfX3J" height="550px" width="1300px" title="Solution of Parzen Example"></iframe>
These are the Parzen window estimates of the density at x = 3, 10, 15 using a rectangular window function and a bandwidth of h = 4. They match the solutions you provided
<span style="color:red;font-weight:bold">Tipp:</span>
You can also solve it simply by head by sliding the rectangle with the width $h=4$ over the entire data and always counting how often at least one data point is within the parzen window at the same time as the point you are looking for (in our case $x=3,10,15$). This then corresponds exactly to the sum $\sum\limits_{i=1}^{N}k\Big(\frac{x-x^{(i)}}{h}\Big)$
<img src="https://hackmd.io/_uploads/Sy3EQaCP3.png">
</details>
</details>
# EMU
- All Slides + In-presence Lecture in one PDF < 120 pages, < 10 MB: <a href="https://www.dropbox.com/s/26r0r2ysqjze7z0/slides_ml_plus_inl_OCR_4_1_co.pdf?dl=1">Download</a>. Good for <a href="https://www.chatpdf.com/">Chatpdf</a>.</span>
- All Slides + In-presence Lecture in one PDF + table of content: <a href="https://www.dropbox.com/s/ol62ruhsxi6tfnc/slides_ml_plus_inl_OCR_inhaltsverz.pdf?dl=1">Download</a>.
---
## Lec 1 (Linear Regression, Gradient Descent): ==Quiz Questions==
1. Is a spam filter an example for machine learning?
<details>
<summary style="font-weight:bold">Solution</summary>
True: Spam filter is an example for Machine learning
</details>
2. Is download the whole wikipedia an example for machine learning?
<details>
<summary style="font-weight:bold">Solution</summary>
False: The wiki download not!
</details>
3. Difference between regression and classification
<details>
<summary style="font-weight:bold">Solution</summary>
- Her Answer:
- Regression: approximation of function
- Classification: to detect classes
- Regression:
- Goal: to make quantitative (real valued) predictions on the basis of a vector of features or attributes
- Examples: house prices, stock values, survival time, fuel efficiency of cars, etc.
- Classificiation:
- Goal: to partition feature space into class-labeled decision regions
- A classifier can be represented as a set of discriminant functions
</details>
4. What is the overfitting, what is underfitting?
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/Bk2uyzeO3.png" width=800px>
</details>
---
## Lec 2 (Regression):
<span style="color:red">Anmk.: Es waren nicht wirklich Quiz Questions, sondern wir haben das nur kurz in der Vorlesung besprochen.</span>
1. Prove that the parameters to maximize the likelihood is same as to minimize least square error. Hint: use log-likelihood instead of likelihood ==#in_lecture#==
$$L(\mathbf w) = \prod\limits_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\Big(-\frac{(y^{(i)} - \mathbf x^{(i)}\mathbf w)^2}{2\sigma^2}\Big)$$
- Solution: see <a href="#marcus1">Marcus 1</a>
<details>
<summary style="font-weight:bold">Solution</summary>
<img src = "https://hackmd.io/_uploads/HkfaJnFDh.png">
Here one can see that $l(\mathbf w)$ depends on $-E(\mathbf w)$. From optimization, we know that we can express a maximization problem as a minimization problem by simply adding a "-" before the cost function. Therefore, maximizing $l(\mathbf w)$ corresponds to minimizing $E(\mathbf w)$ as much as possible. That is, if we calculate the maximum of $l(\mathbf w)$, we get the same $\mathbf w^*$ as when we calculate the minimum of $E(\mathbf w)$ due to the last expression. Hence, maximizing the likelihood function is equivalent to minimizing the quadratic error.
</details>
2. What can we do in case of overfitting or underfitting? ==#discussed_in_lecture==
<details>
<summary style="font-weight:bold">Solution</summary>
In Case of Overfitting
- Get more training data
- Decrease the complexetary of the model
- Use regularization methods<br><br>
In Case of Underfitting
- Increase the order of the model / complexity
- Adjust the input error feature ⇒ Features don’t capture concept!
- Decrease the regularization method, we have to high regularization
</details>
3. What are regression problems? ==#discussed_in_lecture==
<details>
<summary style="font-weight:bold">Solution</summary>
Goal: make predictions based on vector of features or parameter. These kind of problems can be linear or non-linear. It is an optimization problem, where a loss function has to minimized respect to the the model parameter $\mathbf{w} = [w_0,w_1, ..., w_m]^T$.
- Goal: make quantitative (real valued) predictions on the basis of a vector of features or attributes
- Examples: house prices, stock values, surval time, fuel effiency of cars, …
- Questions:
- what can we assume
- how to formalise
- how to evaluate predictions
</details>
4. How does gradient decent work? ==#in_lecture==
<details>
<summary style="font-weight:bold">Solution</summary>
Matrix notation of the problem
$$
\mathbf{w} := \mathbf{w} - \alpha\dfrac{\partial}{\partial \mathbf{w}}E(\mathbf{w})
$$
With the partial derivative calculation $\frac{\partial}{\partial \mathbf{w}}E(\mathbf{w}) = 2\mathbf{X}^T\left(\mathbf{X}\mathbf{w} - \mathbf{y}\right)$ the gradient descent simplifies to
$$
\mathbf{w} := \mathbf{w} - 2\alpha\mathbf{X}^T\left(\mathbf{X}\mathbf{w} - \mathbf{y}\right)
$$
<img src="https://hackmd.io/_uploads/rJbpKTcD3.png">
</details>
5. What is linear regression? How can the weight be calculated by using least squares? ==#in_lecture==
<details>
<summary style="font-weight:bold">Solution</summary>
- Linear regression assumes that the expected value of the output given an input is linear. This means that we have a linear relation between input $x$ and output y, and $\epsilon$ is the noise.
The equation for this is: $y^{(i)}=wx^{(i)} + \epsilon$
- The superscript $i$ means the $i$-th input/output
- The training set size is $n$
- There might be some noise $\epsilon$ in the data, which represents the deviation of the response from its mean
- Properties
- Linear regression is extremely sensitive to noise in the training data measurements, so you cannot usually trust the resulting model
- There are some assumptions that need to be made about the noise term in order to specify the model mathematically. A common assumption is that the noise follows a Gaussian distribution with zero mean and variance
- Linear regression also assumes equal variance of $y$ ($\sigma$ is the same for all values of $x$
- The sample data used for regression are the observed values of $y$ and $x$. The response $y$ to a given $x$ is a random variable, and the regression model describes the mean and standard deviation of this random variable
- Formulas
The relation between output $\mathbf{y}^{(i)}$ and input $\mathbf{x}^{(i)}$ is linear. There might be some noise $\boldsymbol{\epsilon}$ .
$$
\mathbf{y}^{(i)} = w\mathbf{x}^{(i)} + \boldsymbol{\epsilon}
$$
The common approach for the loss function is the least squares
$$
\begin{aligned}
E &= \dfrac 1 N||\mathbf{y}-w\mathbf{x}||^2 = (\mathbf{y}-w\mathbf{x})^T(\mathbf{y}-w\mathbf{x}) \\
&= \dfrac 1 N \left(\mathbf{y}^T\mathbf{y} - w\mathbf{x}^T\mathbf{y} -w\mathbf{y}^T\mathbf{x}+ w^2\mathbf{x}^T\mathbf{x} \right)\\
&= \dfrac 1 N\left(\mathbf{y}^T\mathbf{y} - 2w\mathbf{x}^T\mathbf{y} + w^2\mathbf{x}^T\mathbf{x}\right)
\end{aligned}
$$
To find the optimum, one need to derive this function respect to the the parameter $w$ and set it to 0.
$$
\begin{aligned}
\dfrac{\partial E}{\partial w} &= -2\mathbf{x}^T\mathbf{y} + 2w\mathbf{x}^T\mathbf{x} \overset{!}{=} 0 \\
&-\mathbf{x}^T\mathbf{y} + w\mathbf{x}^T\mathbf{x} = 0 \\
& w = \dfrac{\mathbf{x}^T\mathbf{y}}{\mathbf{x}^T\mathbf{x}} = \dfrac{\sum\limits_{i=1}^n \mathbf{x}^{(i)}\mathbf{y}^{(i)}}{\mathbf{x}^{(i)^2}}
\end{aligned}
$$
The general model is
$$
\mathbf{y} = \mathbf{X}\mathbf{w}
$$
with $\mathbf{X} = [1,\mathbf{x}^{(1)}, ... ,\mathbf{x}^{(m)}]$ . The quadratic loss function for this model type is
$$
\begin{aligned}
E &= \dfrac 1 N ||\mathbf{y}-\mathbf{X}\mathbf{w}||^2 = \dfrac 1 N \left(\mathbf{y}-\mathbf{X}\mathbf{w}\right)^T\left(\mathbf{y}-\mathbf{X}\mathbf{w}\right) \\
&= \dfrac 1 N \left(\mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y}+\mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w}\right)
\end{aligned}
$$
Analog to the previous calculation the optimal set for $\mathbf{w}$ is
$$
\begin{aligned}
\dfrac{\partial E}{\partial \mathbf{w}} &= -2\mathbf{X}^T\mathbf{y} + \mathbf{X}^T\mathbf{X}\mathbf{w} + \mathbf{w}^T\mathbf{X}^T\mathbf{X} \\
&= -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{w} \overset{!}{=} 0 \\
\mathbf{w}^\ast &= \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}
\end{aligned}
$$
This type of minimization is one possible approach to solve the problem. Another way is the GRADIENT DESCENT.
<img src="https://hackmd.io/_uploads/Sy7SATqw3.png">
</details>
---
## Lec 3 (Classification):
<span style="color:red">Anmk.: Es waren nicht wirklich Quiz Questions, ich hab mir hier nichts notiert, daher stellen die Fragen hier eher eine kurze Zusammenfassung der Folien dar</span>
1. What is a feature?
- A vector with information about a problem
- Any distinctive aspect, quality, or characteristics
2. What is a classifier? What is a linear classifier?
- The task of a classifier is to partition feature space into class-labeled decision regions
- A classifier can be represented as a set of discriminant function
- a classfier assigns that assigns a feature $\mathbf x$ to a class $\omega$ by using linear discrimant function.
- a linear classifier is a
3. How does the Bayesian decision theory work?
<details>
<summary style="font-weight:bold">Solution</summary>
Bayesian Decision Theory
- Bayesian Decision Theory is a fundamental statistical approach to the problem of pattern classification.
- quantifies the tradeoffs between various classifications using probability and the costs that accompany such decisions/classifications.
- Assumptions
- Decision problem is posed in probabilistic terms.
- All relevant probability values are known.
<img src="https://hackmd.io/_uploads/S1-z133Dh.png" width = 500px>
Important: You have to know each probabilites
</details>
4. What is the decision boundary?
<details open>
<summary style="font-weight:bold">Solution</summary>
decision boundary: borders between decision
<img src="https://hackmd.io/_uploads/SkyjGMgO2.png" width =400px>
<img src="https://hackmd.io/_uploads/H1cxMzl_n.png">
<img src="https://hackmd.io/_uploads/HyTGfzedn.png">
</details>
5. What is the likelihood ratio test (LRT)?
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/ry7pJ_yd2.png" width = 500px>
</details>
6. What is Bayes risk?
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/rJrMXzluh.png" width = 500px>
</details>
7. Bayes Risk: An Example: Consider a classification problem with two classes defined by the following likelihood functions. <span id="emu1"></span> (see <a href="#marcus1">Marcus 1</a>)
<a href="http://tpcg.io/_K7T3WY"><img src="https://hackmd.io/_uploads/ByD3Cx5P2.png" width=300px></a>
When the following is given:
$p(\omega_1) = p(\omega_2) = 0.5,\\ c_{11} = c_{22}=0, c_{12} = 1$ and
$c_{21}=\sqrt3$.
Determine a decision rule that minimizes the Bayes risk
<details>
<summary style="font-weight:bold">Solution</summary>
- see Marcus solution<span id="emu_bayes"></span>:
From the figure you have to read off the following equations (we assume gauss functions and reading off the $\mu_{1} = 0$ and $\mu_{2} = 2$ values.):
$$\begin{aligned}
p(x|\omega_1) &= \frac{1}{\sqrt{2\pi}}\exp\left(-\frac{x^2}{2}\right) \\
p(x|\omega_2) &= \frac{1}{\sqrt{2\pi3}}\exp\left(-\frac{(x-2)^2}{6}\right)
\end{aligned}$$
after that we use the Bayes Risk equation
$$\frac{p(x|\omega_1)}{p(x|\omega_2)} = \frac{(c_{12}-c_{22})p(\omega_2)}{(c_{21}-c_{11})p(\omega_1)}$$
and by inserting the following equations we get
$$\begin{aligned}
\frac{1}{\sqrt{3}}\frac{\exp\left(-\frac 12\frac{x^2}3\right)}{\exp\left(-\frac 12(x-2)^2\right)} &= \frac{1}{\sqrt{3}} \\
\frac {x^2}3 &= (x-2)^2 \\
x^2&=3x^2-12x+12 \\
0 &= x^2-6x+6
\end{aligned}$$
By solving this equation we can be sure if
$$x \in(1.27, 4.73)$$
then $x$ belongs to $\omega_{2}$.
Therefore, we can see that from the equation
$$\frac{p(x|\omega_1)}{p(x|\omega_2)}\stackrel{\displaystyle >^{\omega_1}}{_{\displaystyle<_{\omega_2}}} \frac{(c_{12}-c_{22})p(\omega_2)}{(c_{21}-c_{11})p(\omega_1)}$$
that if the inequation
$$\frac{p(x|\omega_1)}{p(x|\omega_2)} > \frac{(c_{12}-c_{22})p(\omega_2)}{(c_{21}-c_{11})p(\omega_1)}$$
is valid, then $x$ belongs to $\omega_{1}$ and if the inequation
$$\frac{p(x|\omega_1)}{p(x|\omega_2)} < \frac{(c_{12}-c_{22})p(\omega_2)}{(c_{21}-c_{11})p(\omega_1)}$$
is valid, then $x$ belongs to $\omega_{2}$.
<a href="http://tpcg.io/_0IQQ0Q"><img src="https://hackmd.io/_uploads/BJUfcgqvh.png" width=500px></a>
</details>
---
## Lec 4 (Clustering):
<span style="color:red">Anmk.: Es waren nicht wirklich Quiz Questions, ich hab mir hier nichts notiert, daher stellen die Fragen hier eher eine kurze Zusammenfassung der Folien dar</span>
1. Supervised vs Unsupervised Learning
2. Which approaches of unsupervised learning exists?
3. What are similarity measures? Which methods are there?
4. ... see [Notion 2](https://www.notion.so/Ausarbeitung-von-Machine-Learning-SS2023-8e723ca8bec24ed4b133819c0a4d529d?pvs=4#202f0a971831478caceff14965e51a07)
---
## Lec 5 (Dimensionality Reduction): ==Quiz Questions==
1. How many Dimensions are necessary? ==#in_lecture==
<details>
<summary style="font-weight:bold">Solution</summary>
→ Eigenvalue Ellbow with dimensions..
(Gavish and Donoho 2014 paper)
- From Slides:
<img src = "https://hackmd.io/_uploads/HyZ1SnFw2.png" width=500px>
- From in_lecture l5:
<img src = "https://hackmd.io/_uploads/By-77Rqvh.png">
<img src = "https://hackmd.io/_uploads/r1UJNA5Dh.png">
</details>
2. How does the maximum variance formulation work? ==#in_lecture==
<details>
<summary style="font-weight:bold">Solution</summary>
<img src = "https://hackmd.io/_uploads/rJtc3giP2.png">
</details>
3. PCA - simple example (see <a href="#marcus2">Marcus 2</a>) <span id="emu2"></span>
Compute the principal components for the following two-dimensional dataset $$\boldsymbol X = (x_1,x_2)=(1,2),(3,3), (3,5), (5,4), (5,6), (6,5), (8,7), (9,8)$$
<span style="color:red">We should do this by hand</span>
<details>
<summary style="font-weight:bold">Solution</summary>
If $\mathbf X$ is the data matrix and is consists of a noise matrix and a true matrix (and for the noise is zero mean and 1 variance meant, $\mathcal N(0,1)$:
$\mathbf X=\mathbf X_{true} + \mathbf \gamma X_{noise}$
whereby $\mathbf X_{true}$ is a low dimensional matrix. If we look at the eigenvalues of $\mathbf X$ and $\mathbf X_{noise}$ and looks for the largest eigenvalue of $\mathbf X_{noise}$
- Solution by hand:
<img src = "https://hackmd.io/_uploads/H1lLxljv3.png">
- Slide:
<img src = "https://hackmd.io/_uploads/BJR27xiP3.png">
- Numpy:
<img src = "https://hackmd.io/_uploads/r15GHnFvh.png" width=400px>
<a href="http://tpcg.io/_R51V34"><img src = "https://hackmd.io/_uploads/r1TaDxowh.png" width=400px></a>
</details>
---
## Lec 6 & 7 (Density Estimation): ==Quiz Questions== and ==Exam Questions==
1. What are limitations and challenges with the EM algorithm?
<details>
<summary style="font-weight:bold">Solution</summary>
The EM algorithm has several limitations and challenges, including:
- Convergence to local optima: The EM algorithm is not guaranteed to converge to the global maximum likelihood estimate, and can get stuck in local optima.
- Sensitivity to initialization: The algorithm's performance can be sensitive to the initial values of the parameters, and different initializations can lead to different results.
- Computationally expensive: The EM algorithm can be computationally expensive, especially for large datasets or complex models [Good Page 1](https://www.javatpoint.com/em-algorithm-in-machine-learning).
- Requires complete data: The algorithm assumes that the data is complete, meaning that there are no missing values. If there are missing values, the algorithm may not work well. [Good Page 2](https://www.geeksforgeeks.org/ml-expectation-maximization-algorithm/).
</details>
2. Where can the EM algorithm be used?
<details>
<summary style="font-weight:bold">Solution</summary>
Slide 233:
- Missing data imputation: The algorithm can be used to fill in missing data in a sample.
- Clustering: The algorithm can be used to cluster data into groups based on their similarity.
- Density estimation: The algorithm can be used to estimate the probability density function of a dataset.
- Network models: The algorithm can be used to model complex networks, such as social networks and biological networks.<br><br>
optional:
- Unsupervised learning of clusters: The algorithm can be used as the basis of unsupervised learning of clusters.
- Image processing: The algorithm can be used for image processing tasks such as image segmentation and object recognition.
- Natural language processing: The algorithm can be used for natural language processing tasks such as part-of-speech tagging and named entity recognition.
</details>
3. Is Maximum Likelihood Estimate (MLE) unbiased estimate?
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/SJmrlrjD2.png" width="500px">
The maximum likelihood estimate (MLE) is not necessarily an unbiased estimate. In fact, MLE can be biased in some cases. However, under certain conditions, MLE can be shown to be asymptotically unbiased, meaning that as the sample size increases, the bias approaches zero.<br><br>
Her Answer:
- In general MLE is an unbiased estimator if the estimator is consistent and has finite variance
- MLE can be biased, due to the presence of high leverage points, outliers, or other data anomalies. MLE can be biased in small sample sizes or when sample size is comparable to the number of parameters being estimated.
<img src="https://hackmd.io/_uploads/SJcceBsD2.png" width=500px>
</details>
4. Density estimation 1: ==#exam question==
<img src="https://hackmd.io/_uploads/HyEbOhYDn.png" width="400px">
Given the following set of data points in a two-dimensional case, we want to build a probability distribution that represents them. For this aim, we have chosen a mixture model with two Gaussians,
$$\sum\limits_{k=1}^2(\pi_k \mathcal N (x|\mu_k , \sigma_k^2 I)$$
and to optimize its parameters we will $k=1$ use the EM algorithm. The initial random means for the Gaussians are given by $\mu_1$ and $\mu_2$ and the covariance matrices are initially equal, but constrained to the form $\sigma_k^2 I$.
a) Explain the EM algorithm steps and draw on the figure approximately the component parameters (the mean and the shape of the covariance) after one interation. ( Points)
<details>
<summary style="font-weight:bold">Solution of 4a)</summary>
<ul>
- From Book: Richard O. Duda Chap 3.8:
<img src = "https://hackmd.io/_uploads/r19ygWiD3.png" width=400px>
- Explaination: (<a href="https://www.javatpoint.com/em-algorithm-in-machine-learning">source</a>)
1. Step: The very first step is to initialize the parameter values. Further, the system is provided with incomplete observed data with the assumption that data is obtained from a specific model.
2. Step: This step is known as Expectation or E-Step, which is used to estimate or guess the values of the missing or incomplete data using the observed data. Further, E-step primarily updates the variables.
3. Step: This step is known as Maximization or M-step, where we use complete data obtained from the 2nd step to update the parameter values. Further, M-step primarily updates the hypothesis.
4. Step: The last step is to check if the values of latent variables are converging or not. If it gets "yes", then stop the process; else, repeat the process from step 2 until the convergence occurs.
<img src="https://hackmd.io/_uploads/S16L5WCDn.png" width = 500px>
<img src="https://hackmd.io/_uploads/SJgH0Eivn.png" width = 500px>
- Slide:
<img src = "https://hackmd.io/_uploads/ryEGxbswh.png" width=500px>
- Input Data:
<a href="http://tpcg.io/_4AE6TH"><img src="https://hackmd.io/_uploads/Syjt8zjvh.png"></a>
- Result after 1 Iteration:
<iframe src="https://trinket.io/embed/python3/6855df2fe5?toggleCode=true&start=result" width="100%" height="550" frameborder="0" marginwidth="0" marginheight="0" allowfullscreen></iframe>
- Result if converged:
<iframe src="https://trinket.io/embed/python3/3234702f98?toggleCode=true&start=result" width="100%" height="550" frameborder="0" marginwidth="0" marginheight="0" allowfullscreen></iframe>
</ul>
</details>
b) Derive the maximum likelihood re-estimation of the component parameters $\sigma_k$ given a set of data points.
$$\mathcal N (x|\mu, \Sigma) = \frac{1}{(2\pi)^{D/2} |\Sigma|^{1/2}} e^{−1/2(x − \mu)^T \Sigma^{−1} (x − µ)}$$
<details open>
<summary style="font-weight:bold">Solution of 4b)</summary>
To derive the maximum likelihood re-estimation of the component parameters $\sigma_k$, we start with the log-likelihood function of the GMM:
$l(\theta) = \ln p(\mathbf X|\boldsymbol\pi,\boldsymbol\mu,\boldsymbol\sigma) = \sum\limits_{i=1}^n \ln \sum\limits_{k=1}^K \pi_k \mathcal{N}(x^{(i)}|\mu_k, \sigma_k^2I)$
where $\theta$ represents the parameters of the GMM, including the mixing coefficients $\pi_k$, means $\mu_k$, and covariance matrices $\sigma_k^2I$.
<img src="https://hackmd.io/_uploads/SyLayYWO3.png" width=500px>
- Responability Term:
<img src="https://hackmd.io/_uploads/BkO5lFWO2.png" width=150px>
- Normalization Term:
<img src="https://hackmd.io/_uploads/SJ1sxK-_h.png" width=150px>
- Bayes Rule:
<img src="https://hackmd.io/_uploads/rydoxKWu2.png" width=600px>
To find the maximum likelihood estimate of the component parameters $\sigma_k$, we differentiate the log-likelihood function with respect to $\sigma_k$ and set it to zero:
$\frac{\partial l(\theta)}{\partial \sigma_k} = 0$
Taking the derivative with respect to $\sigma_k$, we get:
<img src="https://hackmd.io/_uploads/B1J9ou-dh.png">
=> Man muss sich hier einfach merken, dass man bei der äußere Ableitung die gesamte Summe abgeleitet wird, d.h.
$\ln(\sum\limits_{k=1}^{K}\pi_k\mathcal N(x^{(i)}|\mu_k,\sigma_k^2 I)) = \ln(z)$
und damit bleibt
$\frac{\mathrm d}{\mathrm dz}\ln(z) = \frac{1}{z} = \frac{1}{\sum\limits_{k=1}^{K}\pi_k\mathcal N(x^{(i)}|\mu_k,\sigma_k^2 I)} = \frac{1}{p(x^{(i)}|\theta)}$
übrig.
Damit erhalten wir:
$\begin{aligned}\frac{\partial l(\theta)}{\partial \sigma_k} &= \underbrace{\frac{\partial}{\partial z}\ln\left(z\right)\Big{|}_{z=\sum\limits_{k=1}^{K}\pi_k\mathcal N(x^{(i)}|\mu_k,\sigma_k^2 I)}}_{\text{äußere Ableitung}}\cdot \underbrace{\frac{\partial}{\partial \sigma_k} \sum\limits_{k=1}^{K}\pi_k\mathcal N(x^{(i)}|\mu_k,\sigma_k^2 I)}_{\text{innere Ableitung}}\\&= \sum\limits_{i=1}^n \frac{\pi_k \mathcal{N}(x^{(i)}|\mu_k, \sigma_k^2I)}{p(x^{(i)}|\theta)}\left(\frac{1}{\sigma_k}\right) \left[\left(\frac{1}{\sigma_k^2}\right) (x^{(i)} - \mu_k)^2 -1\right]\\ &=\sum\limits_{i=1}^n p(\omega_k|x^{(i)},\theta) \left(\frac{1}{\sigma_k}\right) \left[\left(\frac{1}{\sigma_k^2}\right) (x^{(i)} - \mu_k)^2 -1\right] \stackrel{!}{=} 0\end{aligned}$
Because we are deviating after only the $k$-th Element $\sigma_k$ the sum disappears, because only the $k$-th component of $\sigma_k$ lefts. We can multiply the last equation by $\sigma_k$ to get the following equation:
$\sum\limits_{i=1}^n p(\omega_k|x^{(i)},\theta) (x^{(i)} - \mu_k)^2 = \sigma_k^2\sum\limits_{i=1}^n p(\omega_k|x^{(i)},\theta)$
We isolate the component parameters $\sigma_k^2$:
$\sigma_k^2 = \frac{\sum\limits_{i=1}^n p(\omega_k|x^{(i)},\theta) (x^{(i)} - \mu_k)^2}{\sum\limits_{i=1}^n p(\omega_k|x^{(i)},\theta)}$
- or see this wonderful slide:
<img src="https://hackmd.io/_uploads/S1MHSUiwh.png">
</details>
c) Explain the difference between the re-estimation of the mean of the clusters in the K-means algorithm and the EM algorithm.
<details>
<summary style="font-weight:bold">Solution of 4c)</summary>
- In Short:
- K-means: The re-estimation of the mean of the clusters in the K-means algorithm involves computing the mean of the data points assigned to each centroid under the optimization criteria that the sum of the inter-cluster variances is minimized.
- EM: The re-estimation of the mean of the clusters in the EM algorithm involves computing the weighted mean of the data points assigned to each component, where the weights are given by the posterior probabilities
- In summary:
- The K-means algorithm performs a direct calculation of the mean based on assigned data points, while the EM algorithm incorporates probabilistic modeling and updates the mean based on the expected values of the latent variables.
- K-means algorithm assigns each data point to a single cluster (hard assignment), while EM algorithm assigns each data point to multiple clusters with different probabilities (soft assignment).
<details>
<summary style="font-weight:bold">More Info</summary>
- Chatpdf:
- In the K-means algorithm, the re-estimation of the mean of the clusters involves computing the centroid of each cluster based on the data points assigned to that cluster. The centroid is the arithmetic mean of all the data points in the cluster. This process is performed in the M-step of the algorithm.
- On the other hand, in the EM (Expectation-Maximization) algorithm, the re-estimation of the mean of the clusters is also part of the M-step. In the E-step, the algorithm computes the posterior probabilities or responsibilities of each data point belonging to each cluster. These probabilities are then used in the M-step to calculate the weighted mean of the data points for each cluster, where the weights are given by the posterior probabilities.
- The main difference between the two algorithms lies in their underlying assumptions and objectives.
K-means is a hard clustering algorithm that aims to minimize the within-cluster sum of squares, seeking to assign each data point to a single cluster. In K-means, the mean re-estimation is a straightforward calculation based on the assigned data points.
On the other hand, the EM algorithm is a soft clustering algorithm that works with probabilistic models, such as Gaussian Mixture Models (GMMs). It aims to estimate the parameters of the model, including the means of the clusters, by maximizing the likelihood of the observed data. The EM algorithm iteratively updates the parameters based on the expected values of the latent variables (responsibilities) in the E-step and then maximizes the likelihood in the M-step. The mean re-estimation in the EM algorithm takes into account the uncertainty of data point assignments through the posterior probabilities.
- In summary, the K-means algorithm performs a direct calculation of the mean based on assigned data points, while the EM algorithm incorporates probabilistic modeling and updates the mean based on the expected values of the latent variables.
- Perplexity:
The K-means algorithm and the Expectation-Maximization (EM) algorithm are both clustering algorithms used to group similar data points together. However, there are some differences between the re-estimation of the mean of the clusters in the K-means algorithm and the EM algorithm:
- K-means algorithm:
- The K-means algorithm attempts to detect clusters within the dataset under the optimization criteria that the sum of the inter-cluster variances is minimized.
- The algorithm starts by randomly selecting k initial centroids, where k is the number of clusters.
- Each data point is then assigned to the nearest centroid, and the mean of the data points assigned to each centroid is computed.
- The centroids are then updated to the new means, and the process is repeated until convergence.
- The re-estimation of the mean of the clusters in the K-means algorithm involves computing the mean of the data points assigned to each centroid.
- EM algorithm:
- The EM algorithm is an iterative numerical optimization technique used to maximize the likelihood function with respect to the parameters of the Gaussian mixture model (GMM).
- The algorithm starts by randomly selecting initial values for the parameters of the GMM, including the mixing coefficients, means, and covariance matrices.
- In the E-step, the posterior probability of each data point belonging to each component is computed, given the current estimate of the component parameters.
- In the M-step, the component parameters are updated to maximize the expected value of the log-likelihood function with respect to the component parameters.
- The re-estimation of the mean of the clusters in the EM algorithm involves computing the weighted mean of the data points assigned to each component, where the weights are given by the posterior probabilities.
- Differences:
- K-means algorithm is a simpler algorithm than EM algorithm.
- K-means algorithm is limited to spherical clusters, while EM algorithm can handle different covariance matrices, allowing for more flexibility in cluster shapes.
- K-means algorithm assigns each data point to a single cluster, while EM algorithm assigns each data point to multiple clusters with different probabilities.
- K-means algorithm is faster than EM algorithm, but EM algorithm is more accurate and flexible.
</details>
</details>
5. Density Estimation 2: ==#exam_question==
Given an m-dimensional dataset, we desire to estimate the density of the set at a point x. By using non-parametric density estimation, the density at x is given by
$$p(x) = \frac{K}{NV} \qquad\qquad(1)$$
a) Please explain the variables involved in (1) $(K, N, V )$ and all the assumptions under which this relationship is valid.
<details>
<summary style="font-weight:bold">Solution of 5b)</summary>
In equation (1), the density estimation at point $\mathbf x$ is given by:
$p(x) = \frac{K}{NV}$
Let's break down the variables involved and the assumptions under which this relationship is valid:
- $K\ldots$ The total number of points inside the hypercube, given by
$K=\sum\limits_{i=1}^N k(\frac{\mathbf x-\mathbf x^{(i)}}{h})$
using a kernel function $k$. $K$ can be also understood as the total number of observations that fall **in the region $\mathcal R$** containing $\mathbf x$
- $N\ldots$ The total number of data points **in the dataset** (= total number of examples or the total number of observations in the dataset).
- $V \ldots$ This represents the volume of the region $\mathcal R$ around point $\mathbf x$. $V$ is the volume of the m-dimensional hypercube in which the data points are distributed, expressed as
$V=h^m$, where $h$ is the bandwidth and $m$ is the dimensionality of the data.<br><br>
The assumptions under which this relationship is valid are as follows:
- The region $\mathcal R$ containing $\mathbf x$ is small enough that the density function $p(\mathbf x)$ is approximately constant within it.
- The region $\mathcal R$ containing $\mathbf x$ with volume $V$ is large enough that $K$ is sufficiently large for a reliable estimate.
- The region $\mathcal R$ containing $\mathbf x$ has a shape that does not depend on the location of $\mathbf x$.
- The kernel function $k$ is non-negative, continuous, bounded and integrates to 1.<br><br>
The volume V is chosen appropriately to strike a balance between including enough examples within the region and supporting the assumption that the density is constant within that region.
The non-parametric density estimation method used to derive equation (1) is called kernel density estimation (KDE). KDE is a non-parametric method that estimates the probability density function of a random variable based on a sample of data points. The basic idea behind KDE is to place a kernel function at each data point and then sum up the contributions of all the kernels to estimate the density at any given point in the space. The bandwidth parameter of the kernel function determines the width of the kernel and thus controls the smoothness of the estimated density.
</details>
b) Based on (1), explain two different approaches to estimate the density of a dataset at a point and discuss on the convergence of the approaches.
<details>
<summary style="font-weight:bold">Solution of 5b)</summary>
- Two approaches (Slide l8#9)
- Fix $𝑉$ and determine $𝐾$ from the data: Kernel Density Estimation (KDE)
- Fix $𝐾$ and determine $𝑉$ from the data: k Nearest Neighbor (kNN)
- As $𝑛 \to \infty$, both approaches become close to the true probability density<br><br>
- Convergence:
- KDE
- Convergence depends on the choice of kernel function and bandwidth parameter
- As bandwidth parameter increases, the estimate becomes smoother and convergence improves
- As dataset size increases, convergence improves
- kNN
- Convergence depends on the choice of k (number of neighbors)
- As k increases, the estimate becomes smoother and convergence improves
- As dataset size increases, convergence improves
<details>
<summary>Long answer</summary>
1. Kernel Density Estimation (KDE):
In KDE, the **volume $V$ is fixed**, and the value of $K$ is determined from the data. The idea behind KDE is to estimate the density at a point $\mathbf x$ by considering the contribution of neighboring data points within a certain region around $\mathbf x$. The density estimate is obtained by dividing the count of data points within the region ($K$) by the product of the total number of data points ($N$) and the fixed volume of the region ($V$). As the number of data points ($N$) increases, the density estimate becomes more accurate, and the KDE approach converges to the true probability density as $N$ approaches infinity.
2. k Nearest Neighbor (kNN) approach:
In the kNN approach, **the value of $K$ is fixed**, and the volume $V$ is determined from the data. The idea behind kNN is to estimate the density at a point $\mathbf x$ by considering the density of the k nearest neighbors to $\mathbf x$. The density estimate is obtained by dividing the fixed count of neighbors ($K$) by the product of the total number of data points (N) and the variable volume of the region ($V$) determined by the k nearest neighbors. Similar to KDE, as the number of data points ($N$) increases, the density estimate becomes more accurate, and the kNN approach converges to the true probability density as $N$ approaches infinity.
3. Histogram (optional): The region containing x is a fixed bin of width h. The density estimate is constant within each bin and proportional to the frequency of observations in that bin. The convergence rate of histogram depends on the choice of h and the smoothness of the true density.<br><br>
Both approaches, KDE and kNN, converge to the true probability density as the number of data points ($N$) tends to infinity. However, the choice between these approaches depends on the specific characteristics of the dataset and the desired trade-off between computational complexity and accuracy. KDE provides a smooth estimate of the density, while kNN provides a more localized estimate based on the k nearest neighbors. Histogram is simple and easy to implement, but it suffers from discontinuity and sensitivity to bin boundaries. Kernel is smooth and flexible, but it requires more computation and bandwidth selection.
</details>
</details>
c) A dataset $X$ is given $X = \{3, 8, 10, 12, 15, 9\}$. Estimate the density of the dataset at $x = 5$ by using Parzen windows with width $h = 4$. How will the probability density be modified if the width of the Parzen window increases?
Tipp: The density estimate by Parzen windows is given by
$$p(x) = \frac{1}{Nh^m}\sum\limits_{i=1}^{N}k\Big(\frac{x-x^{(i)}}{h}\Big)$$
where
$$k(u) =\bigg\{ \begin{array}{ll} 1,\quad|u_j| \le 0.5, \forall j=1,\ldots,m\\0,\quad\text{ otherwise}
\end{array}.$$
<details>
<summary style="font-weight:bold">Solution of 5c)</summary>
To estimate the density $\mathcal{P}(x)$ at $x = 5$ using Parzen windows with a bandwidth of $h = 4$ and the given dataset $X=\{3,8,10,12,15,9\}$, we need to calculate the probability of $\hat{p}(5)$ as follows:
<iframe src="https://mathembed.online/embed/uSd47Iq5dbvGS3QOmca4" height="200px" width="800px" title="Solution of Parzen Exam"></iframe>
where $k(u)$ is the kernel function, which is equal to 1 if $|u| \leq 0.5$ and 0 otherwise.
Simplifying the above equation, we get:
$$\hat{p}(5) = \frac{1}{24}\left(\color{green}1 + \color{red}0 + \color{red}0 + \color{red}0 + \color{red}0 + \color{red}0\right) = \frac{1}{24}$$
Therefore, the estimated density at $x=5$ is $\hat{p}(5) = \frac{1}{24}$.
<span style="color:red;font-weight:bold">Tipp:</span>
You can also solve it simply by head by sliding the rectangle with the width $h=4$ over the entire data and always counting how often at least one data point is within the parzen window at the same time as the point you are looking for (in our case $x=5$). This then corresponds exactly to the sum $\sum\limits_{i=1}^{N}k\Big(\frac{x-x^{(i)}}{h}\Big)$
</details>
6. Problem PCA: ==#exam_question==
Use proof by induction to show that the linear projection onto an M-dimensional subspace that maximizes the variance of the projected data is defined by the $M$ eigenvectors of the data covariance matrix $S$, corresponding to the $M$ largest eigenvalues. It is given that the covariance matrix is given by
$S = \frac{1}{N}\sum\limits_{n=1}^N(x_n-\tilde x)(x_n-\tilde x)^T$
where $\tilde x$ is the sample set mean
<details>
<summary style="font-weight:bold">Solution</summary>
{%hackmd HyVA9TaPh %}
</details>
<details>
<summary style="font-weight:bold">Remember: Proof of induction</summary>
The proof by induction is a method used to prove a statement for all natural numbers (usually starting from a specific value). It involves two main steps: the base case and the induction step.
1. Base case: The base case is the initial step where we prove that the statement holds for a specific value of the variable. In this case, the base case is often shown for M = 1, which serves as the starting point for the induction.
2. Induction step: After establishing the base case, we assume that the statement holds for a particular value of the variable, often denoted as k. This is called the induction hypothesis. Then, we aim to prove that the statement also holds for the next value, which is k + 1.
In the induction step, we assume that the statement holds for M = k, meaning that we assume the given statement is true for a k-dimensional subspace. Using this assumption, we then prove that the statement also holds for M = k + 1, thus extending the validity of the statement to the next value.
By showing that the statement holds for the base case (M = 1) and proving the induction step (M = k to M = k + 1), we can conclude that the statement holds for all natural numbers (all values of M greater than or equal to the base case).
In the specific context of the proof you mentioned, the base case is established by showing that the linear projection onto a 1-dimensional subspace that maximizes the variance is defined by the eigenvector corresponding to the largest eigenvalue. Then, by assuming that the statement holds for M = k (induction hypothesis), the proof proceeds to show that the statement also holds for M = k + 1, extending the validity of the statement to higher dimensions.
Overall, the proof by induction allows us to generalize a statement from a specific case to an infinite set of cases, building upon the truth of each previous case.
</details>
</details>
---
## Lec 8 (Nonparametric Density Estimation): ==Quiz Questions==
1. How to choose the bandwidth parameter in kernel density estimation (KDE)?
<details>
<summary style="font-weight:bold">Solution</summary>
Important: The kernel function determines the shape of the bumps The parameter $h$, also called the smoothing parameter or **bandwidth**, determines their width => <span style="color:red">$h$ is the bandwidth</span>!
* a. kernel $k(u)$ (the kernel depends on the bandwidth $h$)
* b. Cross-validation
* c. rule of thumb (e.g. Silverman’s rule of -thumb)
Don't forget these formula in context of this questions:
$p(\mathbf x) = \frac{K}{nh^m} \stackrel{\text{in case of KDE}}{=} \frac{1}{nh^m}\sum\limits_{i=1}^{n}k(\mathbf u) \stackrel{\text{in case of parzen window}}{=} \frac{1}{nh^m}\sum\limits_{i=1}^{n}k(\frac{(\mathbf x - \mathbf x^{(i)})}{h})$
<details>
<summary>Further reading to the bullet points:</summary>
One can use a rule of thumb such as Silverman’s rule of thumb, which is a widely used heuristic for selecting the bandwidth in kernel density estimation
However, it is important to note that this rule of thumb may not always be optimal for all datasets and may need to be adjusted based on the specific characteristics of the data being analyzed.
To choose the bandwidth parameter in kernel density estimation, one needs to consider the trade-off between bias and variance. A smaller bandwidth results in a higher variance and lower bias, while a larger bandwidth results in a lower variance and higher bias.
A common approach for selecting an appropriate bandwidth parameter is through cross-validation techniques such as leave-one-out cross-validation or local likelihood optimization
Cross-validation involves dividing the data into training and testing sets, and then using the training set to estimate the density and the testing set to evaluate the performance of the density estimator. The bandwidth that minimizes the cross-validation error is then selected as the optimal bandwidth
Another method is to use plug-in estimates based on moment assumptions about the data generating process
This involves estimating the optimal bandwidth based on the sample mean and variance of the data.
Ultimately, choosing an optimal bandwidth can be viewed as a tradeoff between goodness of fit and complexity/flexibility
</details>
</details>
2. What is the bandwidth parameter in kNN density estimation?
<details>
<summary style="font-weight:bold">Solution</summary>
- In kNN density estimation, the bandwidth is not a parameter that is explicitly defined like in kernel density estimation. **Instead, the bandwidth is determined implicitly by the choice of the number of neighbors, K**.
- In Case of kNN the data points $K$ that should be inside of the region $\mathcal R$ are constant and the volumina $V$ is reduced as long as the region $\mathcal R$, or equivalently the volumina $V$ contains the desired $K$ points.
Don't forget these formula in context of this questions:
$\qquad\displaystyle p(\mathbf x)=\frac{K}{nh^m} \stackrel{\text{in case of kNN}}{=}\frac{K}{n V}$
</details>
3. What are the properties of kernel functions in nonparametric density estimation?
<details>
<summary style="font-weight:bold">Solution</summary>
Properties:
a. Non-negativity: A kernel function must be non-negative, meaning that its values are always greater than or equal to zero.
b. Integration: A kernel function should integrate to 1 over its support. This property ensures that the kernel function represents a valid probability density.
c. Symmetry: Many commonly used kernel functions, such as the Gaussian (normal) kernel, are symmetric. Symmetry means that the kernel function is the same when its arguments are swapped.
d. Bandwidth control: The kernel function is scaled by a bandwidth parameter, which determines the width or spread of the kernel. The choice of bandwidth parameter affects the smoothness and accuracy of the density estimate.
e. Shape: Different kernel functions have different shapes, which can impact the smoothness and flexibility of the density estimate. Common kernel functions include the Gaussian (normal) kernel, the Epanechnikov kernel, the uniform kernel, and the triangular kernel.
f. Optimal properties: Some kernel functions, such as the Epanechnikov kernel, are optimal in the sense that they minimize the mean integrated squared error (MISE) of the density estimate. These optimal kernels strike a balance between accuracy and smoothness.
g. Flexibility: The choice of kernel function allows for flexibility in modeling different types of data distributions. Different kernel functions may be more suitable for different types of data.
</details>
4. What are the examples of kernel functions in nonparametric density estimation?
<details>
<summary style="font-weight:bold">Solution</summary>
Examples of kernel functions include the Gaussian, Epanechnikov, and uniform kernels like
- Uniform Density:
$f(x) = \begin{cases} 1 & \text{if } x \in [a,b] \\ 0 & \text{otherwise} \end{cases}$
- Triangle Density:
$f(x) = \frac{b - a}{2} |x - c|$
where $c$ lies between $a$ and $b$.
- Epanechnikov Density:
$f(x) = \begin{cases} 0.75 \left(1 - \frac{(x - c)^2}{h^2}\right) & \text{if } x \in [-h, h] \\ 0 & \text{otherwise} \end{cases}$
where $c$ is the median and $h$ is the interquartile range.
- Gaussian Density:
$f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$
where $\mu$ is the mean and $\sigma$ is the standard deviation.
- Biweight Density: Bivariate generalization of Epanechnikov Density, which allows for different ranges along different dimensions.
- Quartic Spline Density:
$f(x) = C(x - a)^4$
where $C$ is chosen so that $\int f(x) dx = 1$, and $a$ is the minimum value of $x$.
</details>
5. Compare nonparametric vs. parametric approaches of the density estimation. What are the underlying differences? What are the pros and cons?
<details>
<summary style="font-weight:bold">Solution</summary>
- Nonparametric density estimation is a data-driven approach that does not assume a specific functional form for the density, while parametric density estimation assumes a specific functional form
- The choice between nonparametric and parametric approaches depends on the data and the research question. Nonparametric methods are more flexible but require more data, while parametric methods are more efficient but require stronger assumptions
- Parametric and nonparametric density estimation differ fundamentally in terms of prior knowledge and modeling assumptions made about the true underlying data generating mechanism. Parametric methods assume a functional form for the unknown density (or PDF) that matches certain structural properties of the population, such as monotonicity, unimodality, concavity, etc. They offer mathematical tractability, computational simplicity, and interpretability via familiar statistical quantities such as moments and cumulants, leading to precise inference tools. Nonparametric methods abandon these restrictive assumptions and instead directly estimate the shape of the density without making strong structural claims. This leads to greater flexibility in capturing complex patterns and shapes, and to simpler algorithms that require fewer tunable hyperparameters. However, nonparametric approaches lack theoretical
</details>
6. compare nonparametric vs. parametric approaches of the density estimation. what are the underlying differences? What ware the pros and cons?
<details>
<summary style="font-weight:bold">Solution</summary>
1. Advantages
a. We have only hyperparameters but no other parameters. We do not need a model assumption
b. We are much more flexible: easy and flexible method
c. We can can append it to any shapes of our density
2. Disadvantages
a. For non parametric data we need much more data points to get the same results if we use prior knowledge as on the modelling, where we have information about of this.
b. curse of dimensionality
c. need for a large sample size
</details>
---
## Lec 9 (Gaussian Processes): ==Quiz Questions==
1. Validate that $k(\mathbf x, \mathbf x') = (1+\mathbf x^\mathrm T\mathbf x')^2$ is a valid kernel.
Tipp:
Assume $\mathbf x=\begin{bmatrix}x_1\\x_2\end{bmatrix}$
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/SkNqtqnvn.png" width=700px>
Show constructing that $k(\mathbf x,\mathbf x')=\phi(\mathbf x)^\mathrm T\phi(\mathbf x')$
<img src="https://hackmd.io/_uploads/r1GpT2YDn.png" width=500px>
<img src="https://hackmd.io/_uploads/H1bbA3KDh.png" width=500px>
<img src="https://hackmd.io/_uploads/HJuDC2Fv3.png" width=500px>
<img src="https://hackmd.io/_uploads/Syi_A3tD3.png" width=500px>
</details>
2. What is the kernel trick?
<details>
<summary style="font-weight:bold">Solution</summary>
- Kernel trick: Her answer
- GP use kernels to lift input data in a higher dimensional feature
space (kernel trick)
As in the example
$\phi(\mathbf x) = [x_1^2, x_2^2 , \sqrt 2x_1x_2, \sqrt 2 x_1, \sqrt 2x_2, 1]$
→ We can calculate $k(\mathbf x,\mathbf x’)$ without calculating the 6-dimensionality $\phi(\mathbf x)$.
Why is it better?
if we compute it in higher dimensional space and the linear model works with nonlinear data. We never calculate $\phi(\mathbf x)$, but we go to the higher dimension.
E. g. the Radial Basis Function kernel: We could calculate $\phi(\mathbf x)^T \phi(\mathbf x)$, but $\mathbf \phi(\mathbf x)$ would be infinite dimensional and this we could not calculate. This last step leads to computational savings when the dimensionality of the feature space is large compared to the number of data points.
- From Literature:
- "If an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by $k(\mathbf x, \mathbf x_0 )$; this is sometimes called the kernel trick. This technique is particularly valuable in situations where it is more convenient to compute the kernel than the feature vectors themselves. As we will see in the coming sections, this often leads to considering the kernel as the object of primary interest, and its corresponding feature space as having secondary practical importance. (...) Thus by using the kernel trick we can replace occurrences of the inner product by the kernel to obtain the equivalent result in feature space." <a href="https://gaussianprocess.org/gpml/chapters/RW.pdf">source</a>
- Chatpdf:
- The kernel trick is a technique used in machine learning, particularly in kernel methods, to implicitly map data into a higher-dimensional feature space without explicitly calculating the transformed feature vectors. It allows algorithms to operate in the original input space while effectively utilizing the benefits of working in a higher-dimensional space.
- In simple terms, the kernel trick enables us to compute the dot product between two feature vectors in the higher-dimensional space without explicitly calculating the transformation. This is achieved by defining a kernel function that measures the similarity between two data points in the original input space.
- By using the kernel trick, we can apply linear algorithms, such as support vector machines (SVMs), to nonlinear problems. The kernel function effectively captures the nonlinear relationships between data points by implicitly mapping them into a higher-dimensional space where the data becomes linearly separable.
- The most commonly used kernel functions include the linear kernel, polynomial kernel, Gaussian (RBF) kernel, and sigmoid kernel. These kernel functions allow us to perform complex computations in the higher-dimensional space without explicitly transforming the data, which can be computationally expensive or even infeasible for high-dimensional or infinite-dimensional feature spaces.
- Overall, the kernel trick is a powerful technique that enables us to leverage the benefits of working in a higher-dimensional feature space without explicitly calculating the transformed feature vectors, making it a key component in various machine learning algorithms.
<img src="https://hackmd.io/_uploads/S1487ohw2.png" width=500px>
<img src="https://hackmd.io/_uploads/H1rQWTKw3.png" width=500px>
</details>
---
## Lec 10 (Hidden Markov Models):
<span style="color:red">Anmk.: Es waren nicht wirklich Quiz Questions, ich hab mir hier nichts notiert, daher stellen die Fragen hier eher eine kurze Zusammenfassung der Folien dar</span>
1. Basic Definition of Markov Processes
2. What are the 3 basic HMM problems?
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/Byc_fTYw2.png" width=500px>
</details>
3. How does they work?
4. What are the types of HMM structure?
5. What is a continuous HMM?
6. What is the difference between GMM and HMM?
---
## Lec 11 (Markov Decision Processes): ==Quiz Questions==
1. What is the q-state?
<details>
<summary style="font-weight:bold">Solution</summary>
In the context of Markov decision processes, the q-state is a state-action pair, denoted as $(s, a)$, where $s$ is a state and $a$ is an action that can be taken in that state.
</details>
2. Why do we discounting?
<details>
<summary style="font-weight:bold">Solution</summary>
The main reason for discounting in Markov decision processes is to give more weight to immediate rewards and less weight to future rewards. This is because future rewards are uncertain and may not be realized, while immediate rewards are more certain. By discounting future rewards, we can ensure that the sum of discounted rewards converges to a finite value, which is necessary for the optimization algorithms to converge. Discounting is done by multiplying the reward at each time step by a discount factor gamma (0 < $\gamma$ < 1), which reduces the value of future rewards.
Discounting does not guarantee convergence in Markov decision processes, but it is a necessary condition for convergence. Other factors such as the choice of optimization algorithm, the structure of the problem, and the quality of the data can also affect convergence. **However, without discounting, the sum of rewards may diverge to infinity, which would make it impossible to optimize the policy.**
Convergence in a Markov decision process refers to the process of finding an optimal policy that maximizes the expected utility or sum of discounted rewards over an infinite horizon. It means that as the algorithm iteratively updates the value function and policy, the values and actions converge to a stable and optimal solution. In other words, the algorithm reaches a point where further iterations do not significantly change the value function or policy.
Convergence is important because it indicates that the algorithm has found the best possible policy given the current model and assumptions. It ensures that the agent is making the best decisions in the given environment to maximize its long-term rewards or utility.
</details>
3. Discussion about the finite and infinite horizon Slide.
<details>
<summary style="font-weight:bold">Solution</summary>
In the context of Markov decision processes, the difference between finite and infinite horizon is that in the finite horizon case, the agent has a fixed number of time steps to achieve its goal, while in the infinite horizon case, the agent can continue to take actions indefinitely. This affects the way that the agent's policy is optimized and the way that the value function is calculated.<br><br>
<span style="color:red">Her answer</span>:
The Problem is that if you are knowing that the game ends earlier the decision are changed, the actions does completely change. She talks about an example - if you would die together - you would act completely different if you you know that you die in 10 years, … At the other side the problem is that the solution is not a stationary solution because you break up earlier.
$V([s_0,s_1,s_2,\ldots])=\underbrace{R(s_0)}_{\text{current rewards}}+\underbrace{\gamma R(s_1)+\gamma^2R(s_2)+\ldots}_{\text{future rewards}}$
In our case we do not use the Bellman equation directly, we use the Bellman update. We Update our Value every iteration by using the Bellman Update. This updates means, that we look for $V_1(s)$ only 1 horizon in the future, for $V_2(s)$ two horizons and so on..
</details>
4. Value Iteration Slide (23): Discussion why you are earlier going to a stationary solution as for fields that are further away from the optimal point.
<details>
<summary style="font-weight:bold">Solution</summary>
Value iteration converges faster for states that are closer to the optimal point than for states that are further away. This is because the value function of the closer states is more accurate and stable than the value function of the farther states. **The closer states are more likely to receive immediate rewards and have higher transition probabilities to other rewarding states**. The farther states are more likely to receive delayed rewards and have lower transition probabilities to other rewarding states. Therefore, the value function of the closer states changes less in each iteration than the value function of the farther states.
Reminder: Value iteration is a dynamic programming algorithm that iteratively updates the value function of a Markov decision process (MDP) until convergence. The value function represents the expected long-term reward of being in a state and following an optimal policy. Value iteration does not explicitly compute or store a policy, but rather uses a one-step look-ahead to find the best action at each state.
[Markov Decision Process: How Does Value Iteration Work?](https://www.baeldung.com/cs/mdp-value-iteration)
[Value Iteration vs. Policy Iteration](https://www.baeldung.com/cs/ml-value-iteration-vs-policy-iteration)
</details>
5. Why does the Utility estimates goes down at the beginning?
<details>
<summary style="font-weight:bold">Solution</summary>
The utility estimates may go down at the beginning because the agent has not yet learned the optimal policy and is still exploring the state space (utility function estimate is inaccurate at the beginning). As the agent gains more experience and learns which actions lead to higher rewards, the utility estimates will increase.
Her answer:
⇒ The reason is the penalty to be alive. Because the states that are more away gets a higher penalty only because they live longer than the states that are nearer by the +1 state.
<img src="https://hackmd.io/_uploads/B13Hz3hPn.png" width=500px>
</details>
6. Policy evaluation: How find optimal policies. (overview about all methods: Value Iteration, Policy Iteration, Q-Learning, Direct Evaluation: Monte Carlo Methods, Temporal Difference Learning)
<details>
<summary style="font-weight:bold">Solution</summary>
{%hackmd SkXf9zg_n %}
</details>
7. Quiz: Calculate the Transition Value
States are $\{a,b,c,d,e\}$
The Reward function $R(s)$ is given as $R=\begin{bmatrix}10&0&0&0&1\end{bmatrix} = \begin{bmatrix}a&b&c&d&e\end{bmatrix}$
Actions: East, West and Exit (only available in exit states a, e)
Transitions: deterministic
For a discount factor of $\gamma = 0.1$, what is the optimal policy? $\begin{bmatrix}10|&| &| & |&1\end{bmatrix}$
You start in state $d$.
<details>
<summary style="font-weight:bold">Solution</summary>
$\begin{bmatrix}10|&\leftarrow|\leftarrow&|&\rightarrow|&1\end{bmatrix}$
Because we know that only two policies make sense, i.e.
$\pi_1 = \{\leftarrow,\leftarrow,\leftarrow, \text{exit}\}$
or
$\pi_2=\{\rightarrow,\text{exit}\}$
<s>we can use the policy iteration equation</s>
$V^\pi_{k+1}(s) = R(s) + \gamma \sum\limits_{s'} P(s'|s, \pi(s)) V^\pi(s')$
use better this equation:
$V([s_0,s_1,s_2,\ldots])=\underbrace{R(s_0)}_{\text{immediate rewards}}+\underbrace{\gamma R(s_1)+\gamma^2R(s_2)+\ldots}_{\text{future rewards}}$
that can be used to calculate the Value function by a given policy.
- Case 1: using policy $\pi_1 = \{\leftarrow,\leftarrow,\leftarrow, \text{exit}\}$:
$V^{\pi_1}(d,c,b,a) = R(d)+\gamma R(c)+\gamma^2R(b)+\gamma^3R(a) = \gamma^3 R(a) = (0.1)^3\cdot 10 = 0.01$
- Case 2: using policy $\pi_2=\{\rightarrow,\text{exit}\}$
$V^{\pi_2}(d,e) = R(d)+\gamma R(e) = (0.1)\cdot 1 = 0.1$
Therefore we should use the policy $\pi_2$ to maximize the Value function.
<img src="https://hackmd.io/_uploads/HJEgR33D3.png" width=500px>
<img src="https://hackmd.io/_uploads/BJtb0nhD2.png" width=500px>
<img src="https://hackmd.io/_uploads/rk3M0nnPh.png" width=500px>
</details>
8. Quiz: Which is the better policy?
<img src="https://hackmd.io/_uploads/BJ8IAn2wh.png" width=500px>
<details>
<summary style="font-weight:bold">Solution</summary>
In the first case the optimal policy is the left route.
</details>
9. Quiz: Bellman equation: Is the utility $V(s)$ optimal value?
<img src="https://hackmd.io/_uploads/Byi50h3Dh.png" width=900px>
Here we use the changes of 0.8 for North, and 0.1 for left and right. We should use the abort criteria pf $V_{k+1}-V_k<\epsilon = 0.01$ to check if we were converged.
<details>
<summary style="font-weight:bold">Solution</summary>
We can argue by using this graph:
<img src="https://hackmd.io/_uploads/B1A3R33wn.png" width=400px>
If we take the field with 0.705 that has the biggest distance to the +1 point, we can check if we converged…
$V(1,1) = -0.04+\max\limits_a(\begin{bmatrix}up: 0.8\cdot \underbrace{V(1,2)}_{0.762} + 0 + 0.1 \underbrace{V(1,1)}_{0.705} + 0.1 \underbrace{V(2,1)}_{0.655} \Rightarrow highest!\\left\\right\\down\end{bmatrix})$
</details>
10. Reward questions:
- What is the delayed reward?
- What is the meaning of gamma (discount factor)?
- Why is gamma less than 1?
<br>
<details>
<summary style="font-weight:bold">Solution</summary>
- What is the delayed reward?
- The delayed reward refers to the concept that the reward received by an agent in a reinforcement learning problem may not be immediate, but rather delayed until a later time step. This means that the agent may need to take a sequence of actions before receiving a reward, and the reward may depend on the entire sequence of actions taken.
- What is the meaning of gamma (discount factor)?
Earlier actions are more loaned: Gamma (discount factor) is a parameter used in reinforcement learning algorithms to determine the importance of future rewards relative to immediate rewards. It is a value between 0 and 1 that determines the weight given to future rewards in the value function update. A higher gamma value means that future rewards are given more weight, while a lower gamma value means that immediate rewards are given more weight.
- Why gamma less than 1?
- We can guaranteer convergence (Geometric series): The reason why gamma is less than 1 is because it allows the value function to converge to a finite value, even in infinite-horizon problems. If gamma were equal to 1, the value function would not converge, and the agent would not be able to make optimal decisions. By discounting future rewards, the agent is encouraged to take actions that lead to immediate rewards, while still considering the potential future rewards. Additionally, a lower gamma value can help the agent to focus on short-term goals and avoid getting stuck in suboptimal long-term strategies.
<img src="https://hackmd.io/_uploads/rk4j7pKwn.png" width=500px>
- What is the difference between current reward and immediate reward? (unsure, chatpdf)
The terms "current reward" and "immediate reward" are often used interchangeably in the context of reinforcement learning, and they both refer to the reward received by an agent for taking a particular action in a particular state.
However, some sources may use the term "current reward" to refer specifically to the reward received at the current time step, while "immediate reward" may refer to the reward received at the next time step. In this case, the "current reward" would be the reward that the agent receives immediately after taking an action in a state, while the "immediate reward" would be the reward that the agent receives after transitioning to the next state.
In general, the terms "current reward" and "immediate reward" are used to emphasize the fact that the reward is received immediately after taking an action, as opposed to a delayed reward that may be received at a later time step. However, the specific usage of these terms may vary depending on the context and the source of the information.
</details>
11. Is downloading a copy of all Wikipedia art an example of machine learning?
<details>
<summary style="font-weight:bold">Solution</summary>
Downloading a copy of all Wikipedia articles is not an example of machine learning. Machine learning involves using algorithms to learn patterns in data and make predictions or decisions based on those patterns. Downloading Wikipedia articles is simply a data collection task and does not involve any learning or decision-making.
</details>
12. calculus: consider the dataset $X=\{\}$. Conduct PCA. Calculate the matrix $W$ which transforms a data $x$ into $y = W^T x$ ==#possible_exam_question==
13. Describe the difference between GMM and HMM ==#possible_exam_question==
<details>
<summary style="font-weight:bold">Solution</summary>
The main difference between Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM) lies in their underlying structures and applications.
GMM is a probabilistic model used for clustering and density estimation. It assumes that the data is generated from a mixture of Gaussian distributions, where each Gaussian component represents a cluster. GMMs are often used in tasks such as image segmentation, speech recognition, and anomaly detection. In GMM, the observed data points are directly modeled as a mixture of Gaussians, and the goal is to estimate the parameters of the Gaussian components.
On the other hand, HMM is a statistical model used for modeling sequential data with hidden states. It assumes that there is an underlying Markov process with hidden states, and the observed data is generated based on the current hidden state. HMMs are widely used in speech recognition, natural language processing, and bioinformatics. In HMM, the hidden states represent the underlying system dynamics, and the observed data is used to infer the sequence of hidden states.
In summary, GMM is used for clustering and density estimation of observed data, while HMM is used for modeling sequential data with hidden states.
</details>
14. Given the model (…), compute the probability to observe… ==#possible_exam_question==
15. Why do we do “that” in the algorithm, Why do we use the largest eigenvalue to calculate the principal axis… ==#possible_exam_question==
16. draw a HMM model (...) ==#possible_exam_question==
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/BJwltuJu3.png">
<br><br>
We do not know the states, but only the observations. Due to the observations we can reconstruct what the hidden states of the HMM could be.
<img src="https://hackmd.io/_uploads/H1WUKdkO2.png">
</details>
17. Example to describe or draw a markov model, with some informations, states, nodes, edges, … ==#possible_exam_question==
<details>
<summary style="font-weight:bold">Solution</summary>
To describe a Markov model, let's consider a simple example of a weather model with two states: "Sunny" and "Rainy".
- States: The states represent the different conditions or situations in the model. In this case, we have two states: "Sunny" and "Rainy".
- Nodes: Each state is represented by a node in the Markov model. So, we will have two nodes, one for "Sunny" and one for "Rainy".
- Edges: The edges represent the transitions between states. In this example, we can have an edge from the "Sunny" node to the "Rainy" node, indicating a transition from a sunny day to a rainy day. Similarly, we can have an edge from the "Rainy" node to the "Sunny" node, indicating a transition from a rainy day to a sunny day.
- Transition probabilities: Each edge is associated with a transition probability, which represents the likelihood of transitioning from one state to another. For example, the transition probability from "Sunny" to "Rainy" might be 0.3, indicating a 30% chance of a sunny day turning into a rainy day.
- Initial probabilities: The initial probabilities represent the probabilities of starting in each state. In this example, we can assume that the initial probability of starting in the "Sunny" state is 0.6, and the initial probability of starting in the "Rainy" state is 0.4.
- State transition matrix: The state transition matrix is a matrix that contains the transition probabilities between all pairs of states. In this example, the state transition matrix would be a 2x2 matrix, where the (i,j)th element represents the probability of transitioning from state i to state j. For example, the state transition matrix might look like:<br>
$\begin{array}{c|c|c} &\text{Sunny}&\text{Rainy}\\\text{Sunny}&0.7&0.3&\\\text{Rainy}&0.4&0.6&\end{array}$<br>
This matrix indicates that there is a 70% chance of staying in the "Sunny" state and a 30% chance of transitioning to the "Rainy" state, and a 40% chance of transitioning from the "Rainy" state to the "Sunny" state and a 60% chance of staying in the "Rainy" state.
- Observations: In a Markov model, observations are generated based on the current state. In this example, we can assume that the observation at each time step is either "Sunny" or "Rainy", depending on the current state.
Overall, a Markov model is a graphical representation of a system that can be in different states, and the transitions between these states are governed by probabilities. The model can be used to predict the future state of the system based on the current state and the transition probabilities.
</details>
18. She give us a code line and makes mistakes e.g. a wrong indices or something why it is wrong ==#possible_exam_question==
19. Value Iteration: Perform 1 iteration of value iteration algorithm. ==#quiz_question==
<img src="https://hackmd.io/_uploads/rJck9d1dn.png" width=600px>
<details>
<summary style="font-weight:bold">Solution</summary>
1. Jeden State beschriften.<br>
2. Annahme:
- $P(s'|s, \text{chosen_direction}) = 0.8$
- $P(s'|s, \text{wrong_direction}) = 0.1$
- $P(s'|s, \text{opposite_direction}) = 0$<br>
3. Für jede Aktion
$a \in A = \{\text{chosen_direction}, \text{wrong_direction}, \text{opposite_direction}\}$
die iterative Formel anwenden.<br><br>
- Es gilt:<br>
- $\gamma = 1$
- $R(s) = -0.04$ for all states, except terminal states
<img src="https://hackmd.io/_uploads/rkfXBH1On.png" width = 800px>
<img src="https://hackmd.io/_uploads/r1PcdwJu3.png" width = 800px>
</details>
20. Toy Example - Hidden Markov Model (HMM): A professor doesn't really tell the students what the scope and look of the exam is, he just says there might be some derivations or there might be not, there could be theoretical questions, illustrative or purely descriptive, or there could also be purely true and false questions. But there is no other information per se, the exam is also taking place for the first time, there are no previous exams. Therefore, it may only become clear what the hidden states are through the observation that occurs during the implementation.
<details>
<summary style="font-weight:bold">Solution</summary>
{%hackmd SJZofQedh %}
</details>
21. An example: A three state Markov model of the weather:
Given are the states
- State 1: Rain or Snow
- State 2: Cloudy
- State 3: Sunny
as well as the transitions matrix
$\mathbf A = \{a_{ij}\} = \begin{bmatrix}0.4&0.3&0.3\\0.2&0.6&0.2\\0.1&0.1&0.8\end{bmatrix}$
a) Draw the state diagram
b) Given that the weather on day $t=1$ is sunny, what is the probability that the weather for the next 7 days wil be "sun, sun, rain, rain, sun, clouds, sun"?
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/Bk53kExd2.png" width=600px>
</details>
22. An example: A three state Markov model of the weather:
Given is the state diagram of a markov model
<img src="https://hackmd.io/_uploads/H1Y-ZEeu2.png" width = 250px>
with the states
- State 1: Rain or Snow
- State 2: Cloudy
- State 3: Sunny
a) Calculate the transitions matrix $\mathbf A$.
b) Given that the weather on day $t=1$ is sunny, what is the probability that the weather for the next 7 days wil be "sun, sun, rain, rain, sun, clouds, sun"?
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/Bk53kExd2.png" width=600px>
</details>
23. How to calculate the utility of state sequences?:
<details>
<summary style="font-weight:bold">Solution </summary>
<img src="https://hackmd.io/_uploads/BJJeYHg_n.png" width = 500px>
</details>
24. An example: Grid World:
<img src="https://hackmd.io/_uploads/r1Hb7VgOh.png" width=500px>
In stochastic world, how likely does an agent reach the goal state at (4,3) with the
action sequence [up, up, right, right, right]?
<details open>
<summary style="font-weight:bold">Solution </summary>
<span style="color:red">It is important to calculate the LIKELIHOOD NOT THE REWARD 🤦♂....</span>
$\begin{aligned}P({s_{34}|s_{11}}) &= P({s_{21}|s_{11}})P({s_{31}|s_{21}})P({s_{32}|s_{31}})P({s_{33}|s_{32}})P({s_{34}|s_{33}}) \\&+ P({s_{12}|s_{11}})P({s_{13}|s_{12}})P({s_{23}|s_{13}})P({s_{33}|s_{23}})P({s_{34}|s_{33}}) \\ &=0.8^5+0.1^4\cdot 0.8 = 0.32776\end{aligned}$
Eigentlich interessant: In einer stochastischen Welt gibt es, obwohl man ingesamt nur 5 Aktionen ausführt genau zwei Pfade die nach $s_{34}$ führen, die der Roboter beim Empfangen genau dieser 5 Aktionen gehen kann und daher ist die Wahrscheinlichkeit in Summe größer!
<img src="https://hackmd.io/_uploads/H1mlJBedn.png">
<img src="https://hackmd.io/_uploads/rJVTMVx_3.png" width = 600px>
</details>
---
## Lec 12 (Reinforcement Learning): ==Quiz Questions==
1. Are these statements true or false?
a. Temporal Difference Learning can be used only in episodic
settings.
<details>
<summary style="font-weight:bold">Solution</summary>
<ul style="list-style-type: '⇒ ';">
<li>You can also update immediately! therefore the answer is no</li>
</ul>
- Perplexity:
- Temporal difference (TD) learning is a class of model-free reinforcement learning methods that learn by bootstrapping from the current estimate of the value function
- These methods sample from the environment, like Monte Carlo methods, and perform updates based on current estimates, like dynamic programming methods
- TD learning can be used in continuous problems as well and can learn from incomplete episodes
- Therefore, the statement "Temporal difference learning can be used only in episodic settings" is false. TD learning can be used in both episodic and continuous settings.
</details>
b. Monte Carlo Policy Evaluation (Direct Policy Evaluation) can be
used only in in episodic settings.
<details>
<summary style="font-weight:bold">Solution</summary>
<ul style="list-style-type: '⇒ ';">
<li>It is only for episodic settings, we have to look where we start and where we end it is not other possible</li>
</ul>
</details>
c. The both Direct Policy Evaluation and TD learning are sampling
based learning approaches.
<details>
<summary style="font-weight:bold">Solution</summary>
<ul style="list-style-type: '⇒ ';">
<li>True</li>
</ul>
</details>
2. Describe the differences between Temporal Difference Learning
and Q learning.
<details>
<summary style="font-weight:bold">Solution</summary>
- For given policy we want to find the V: Value learning uses value and We want to find the policy for all pairs of s and a: Q-learning uses Q-values and that only contains the action and the state
- We have in Q learning more Action Learning
- In TD Learning we depend more on the policy
- Q learning is not dependet on the policy. In Q Learning we have the off-policy thing!
</details>
3. What is unique/different in direct policy evaluation approach,
compared to dynamic programming, TD learning and Q learning?
<details>
<summary style="font-weight:bold">Solution</summary>
- We do not assume the Markov Model, we assume the relation of V and s prime and??
- TD Learning: Bootstrapping: We do not mimic dis Markov …
- This maximum over all policies does not define a action. s and a prime in maxQ(s’,a’)
</details>
4. Quiz: Q-learning
- States are $\{a,b,c,d,e\}$
- $R(s)$ is given as $R=\begin{bmatrix}10&0&0&0&-10\end{bmatrix} = \begin{bmatrix}a&b&c&d&e\end{bmatrix}$
- Actions: East, West and Exit (only available in exit states a, e)
- Assume: $\gamma = 1$, $\alpha=1/2$
- Start with zero inital Q values
- Robot experienced:
- (b, right, 0, c, right, 0, d, right, 0, e, exit, -10)
- (b, right, 0, c, right, 0, d, right, 0, e, exit, -10)
- Robot repeated the same multiple times.
What would be Q value of state c?
<details>
<summary style="font-weight:bold">Image from slide</summary>
<img src="https://hackmd.io/_uploads/S1PLdTtPh.png" width=400px>
</details>
<details>
<summary style="font-weight:bold">Solution</summary>
$Q(c,left)=0$
$Q(c,right)=0$
⇒ Because b and d state is zero and we start with zero and it takes the maximum of Q and therefore we get 0
$R=\begin{bmatrix}10&0|0&0|0&0|-5&-10\end{bmatrix} = \begin{bmatrix}a&b&c&d&e\end{bmatrix}$
⇒ Important: The policy states that we only go to the right!
<img src="https://hackmd.io/_uploads/SJvSVyyO2.jpg" width=500px>
5. Problems with Direct Evaluation 1:
- What's good about direct evaluation?
- What is bad about it?
<details>
<summary style="font-weight:bold">Solution</summary>
<ul style="list-style-type: '⇒ ';">
<li>Important: It wastes Information of state connections & each state are learned separately.</li>
<li>This means we do not use markov properties!</li>
</ul>
</details>
6. Problems with Direct Evaluation 2: Example Question:
<img src="https://hackmd.io/_uploads/HJ-G3SeO3.png" width=250px>
<details>
<summary style="font-weight:bold">Solution</summary>
The policy from state B is $\pi_1 = \{right, right, exit\}$.
The policy from state E is $\pi_2 = \{up, right, exit\}$.
The reason is that the observation for the policy $\pi_2$ have to be completely different in comparison to the policy $\pi_1$ (unsure).
See Lec 12 Question 12.
</details>
7. Differences between Q Learning and TD Learning
<details>
<summary style="font-weight:bold">Solution</summary>
Q Learning have the off poilicy, we learn something, that would not reflect the current policy. even...
</details>
8. What are Examples for Reinforcement Learning?
<details>
<summary style="font-weight:bold">Solution</summary>
- Walking Robots
- Video Games
- Modelling of friction<br><br>
**Perplexity:**
- Healthcare: RL can be used to find optimal policies for patients to receive treatment
- RL can also enable improvement of long-term outcomes by factoring the delayed effects of treatments
- Engineering: RL can be used to optimize the design of products and systems
- Industrial automation: RL can be used to optimize the performance of industrial processes
- Trading stock prices forecasting: RL can be used to forecast stock prices and optimize trading strategies
- News recommendations: RL can be used to recommend news articles to users based on their preferences.
</details>
9. what are the 3 equations of the MDP algorithms and what do they mean? Explain all the quantities that occur.
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/BkAN3_1O3.png" width = 600px>
<span style="font-weight:bold">Perplexity (unchecked)</span> [src](https://www.perplexity.ai/search/59aafdeb-9c91-41a9-bd7b-2b8e62df22f6?s=u)
<div>
The Markov Decision Process (MDP) algorithm involves three equations that are used to find the optimal policy for a given MDP. The three equations are:<br><br>
1. Bellman Expectation Equation: This equation expresses the expected value of the reward for a given state and action in terms of the expected value of the rewards for the next state and action. The equation is given by:
$$V(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V(s')]$$
where:
- V(s) is the expected value of the reward for state s
- a is the action taken in state s
- $\pi(a|s)$ is the policy that specifies the probability of taking action a in state s
- s' is the next state
- P(s'|s,a) is the probability of transitioning to state s' from state s when action a is taken
- R(s,a,s') is the reward obtained for transitioning from state s to state s' when action a is taken
- $\gamma$ is the discount factor that determines the importance of future rewards
<br>
2. Bellman Optimality Equation: This equation expresses the optimal value of the reward for a given state in terms of the optimal value of the rewards for the next state and action. The equation is given by:
$$V^*(s) = \max_{a} \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]$$
where:
- $V^*(s)$ is the optimal value of the reward for state s
- a is the action taken in state s
- s' is the next state
- P(s'|s,a) is the probability of transitioning to state s' from state s when action a is taken
- R(s,a,s') is the reward obtained for transitioning from state s to state s' when action a is taken
- $\gamma$ is the discount factor that determines the importance of future rewards
<br>
3. Policy Iteration Algorithm: This algorithm is an iterative algorithm that involves evaluating and improving a policy until an optimal policy is found. The algorithm involves two steps:
- Policy Evaluation: This step involves evaluating the expected reward for each state and action under the current policy. The equation for policy evaluation is the same as the Bellman Expectation Equation.
- Policy Improvement: This step involves selecting the action that maximizes the expected reward for each state under the current policy. The equation for policy improvement is given by:
$$\pi'(s) = \arg\max_{a} \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^{\pi}(s')]$$
where:
- $\pi'(s)$ is the new policy for state s
- a is the action taken in state s
- s' is the next state
- P(s'|s,a) is the probability of transitioning to state s' from state s when action a is taken
- R(s,a,s') is the reward obtained for transitioning from state s to state s' when action a is taken
- $\gamma$ is the discount factor that determines the importance of future rewards
- $V^{\pi}(s')$ is the expected value of the reward for state s' under the current policy $\pi$
These equations are used to find the optimal policy that maximizes the expected reward over time by taking into account the optimal value of the rewards for the next state and action.
</div>
</details>
10. Discussion about MDP vs Reinforcemenent learning
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/HJPA3uJOn.png">
<img src="https://hackmd.io/_uploads/SJsgpOy_n.png">
<img src="https://hackmd.io/_uploads/BypQT_kdh.png">
<img src="https://hackmd.io/_uploads/BJCc6OJun.png" width = 500px>
<span style="font-weight:bold">Perplexity (unchecked)</span> [src](https://www.perplexity.ai/search/59aafdeb-9c91-41a9-bd7b-2b8e62df22f6?s=u)
<div>
Reinforcement Learning (RL) is a type of machine learning algorithm that focuses on training models to make decisions in an environment to maximize a reward. Here are some reasons why we should use RL:<br><br>
**Benefits of Reinforcement Learning:**
- RL can be used to solve very complex problems that cannot be solved by conventional techniques
- RL algorithms maintain a balance between exploration and exploitation, which is the process of trying different things to see if they are better than what has been tried before and trying the things that have worked best in the past, respectively
- RL can be useful when the only way to collect information about the environment is to interact with it
- RL can gain experience and feedback (rewards) from its actions, which helps it to improve its results
- RL can find optimal policies using previous experiences without the need for previous information on the mathematical model of biological systems
- RL can be used for tasks with objectives such as robots playing soccer or self-driving cars getting to their destinations
</div>
</details>
11. Example: Model-Based Learning
<img src="https://hackmd.io/_uploads/SyVhtApw3.png" width=600px>
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/HyMzNkAw3.png" width=800px>
</details>
12. Example: Model-Based Learning
<img src="https://hackmd.io/_uploads/ByAGpzRwh.png" width=600px>
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/ryOsnMAP3.png" width=800px>
</details>
13. Example: Model-Based Learning
<img src="https://hackmd.io/_uploads/BkqrRM0Dh.png" width=600px>
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/SkMLi7APn.png" width=400px>
</details>
14. HMM: Given a completely specified HMM $𝜆 = (𝐴, 𝐵, 𝜋)$, how can an observation sequence $𝑂 = {𝑜_1, 𝑜_2 , … 𝑜_𝑇 }$ be generated?
<details>
<summary style="font-weight:bold">Solution</summary>
<img src="https://hackmd.io/_uploads/HJROFSeOn.png" width=500px>
</details>
15. RL: How to choose actions while still learning
<details>
<summary style="font-weight:bold">Solution</summary>
- A: Exploration vs. Exploitation:
How to choose actions while still learning?
- Simplest: random actions ($\varepsilon$-greedy)
- With (small) probability $\varepsilon$, act randomly
- With (large) probability $1-\varepsilon$, act on current policy
- After learning sufficiently, we do not want to keep doing random any more.
- You do eventually explore the space, but keep thrashing around once learning is
done
- One solution: lower $\varepsilon$ over time
<br><img src="https://hackmd.io/_uploads/BJ37CBe_3.png" width=500px>
</details>