# Machine Learning Homework 3 (10.12.2020)
## Timeline
* 14:15 - 14:25 -> problem assignment
* 14:25 - 15:00 -> write down your solutions
* 15:00 - 15:45 -> presentations
## Declarations
Please write down problem numbers you wish to declare next to your name. Put "-" if you have nothing to declare:
:::warning
| |1|2|3|4|5|6|7|8|9|
| ---:| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|Zbigniew Drozd | |A|X| | | |X|X| |
|Michał Grzymek | | | | | | | | | |
|Michał Kępa | | | | | | | | | |
|Jakub Kuciński |A|X|X| |X|X|X| | |
|Karol Kuczmarz |X|A|X| |X|X|X| | |
|Michał Maras |X|X|A| |X|X|X|X| |
|Hubert Obrzut |X|X|A| |X|X|X|X| |
|Michał Opanowicz |X|X|X| |A|X|X|X| |
|Mateusz Orda |X|X|X| |X| |A|X| |
|Wiktor Pilarczyk |X|X|X| |X|A|X|X| |
|Piotr Słowik |X|X| | |A|X| | | |
|Bartosz Stefaniak |-|-|-|-|-|-|-|-|-|
|Michał Syposz |X|X|X| |X|X|X|A| |
|Aleksander Szymański|X|X|X| |A| |X|X| |
|Bartosz Troszka |A|X| | | | | | | |
:::
## Solutions
## Problem 1 (Jakub Kuciński, Bartosz Troszka)
#### Jakub Kuciński
$$L_t =\frac{1}{N}\left[ \sum_i w_t^i e^{-\alpha_t} + \sum_{i:y^i \neq f_t(x^i)} w_t^i(e^{\alpha_t} - e^{-\alpha_t}) \right]$$
We can take the derivative of $L_t$ with respect to $\alpha_t$ and equal it to zero to find the $\alpha_t$ that minimalize $L_t$.
$$\frac{\partial L_t}{\partial \alpha_t} = -\frac{1}{N}\alpha_t e^{-\alpha_t} \sum_i w_t^i + \frac{1}{N}\alpha_t(e^{\alpha_t} + e^{-\alpha_t}) \sum_{i:y^i \neq f_t(x^i)} w_t^i = 0$$
$$- e^{-\alpha_t} \sum_i w_t^i + (e^{\alpha_t} + e^{-\alpha_t}) \sum_{i:y^i \neq f_t(x^i)} w_t^i = 0$$
$$e^{-\alpha_t} \sum_i w_t^i - (e^{\alpha_t} + e^{-\alpha_t}) \sum_{i:y^i \neq f_t(x^i)} w_t^i = 0$$
$$e^{-\alpha_t} (\sum_i w_t^i - \sum_{i:y^i \neq f_t(x^i)} w_t^i) - e^{\alpha_t} \sum_{i:y^i \neq f_t(x^i)} w_t^i = 0$$
$$e^{-\alpha_t} (\sum_i w_t^i - \sum_{i:y^i \neq f_t(x^i)} w_t^i) = e^{\alpha_t} \sum_{i:y^i \neq f_t(x^i)} w_t^i$$
$$\frac{e^{\alpha_t}}{e^{-\alpha_t}} = \frac{\sum_i w_t^i - \sum_{i:y^i \neq f_t(x^i)} w_t^i}{\sum_{i:y^i \neq f_t(x^i)} w_t^i}$$
$$\alpha_t + \alpha_t = \ln{\frac{\sum_i w_t^i - \sum_{i:y^i \neq f_t(x^i)} w_t^i}{\sum_{i:y^i \neq f_t(x^i)} w_t^i}}$$
$$\alpha_t = \frac{1}{2} \ln{\left(\frac{\sum_i w_t^i }{\sum_{i:y^i \neq f_t(x^i)} w_t^i} -1\right)}$$
$$\alpha_t = \frac{1}{2} \ln{\left(\frac{1}{\epsilon_t} -1\right)}$$
$$\alpha_t = \frac{1}{2} \ln{\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)}$$
BARTOSZ TROSZKA
To minimalize $L_t$ let's calculate its derivative:

$$\frac{d}{d\alpha}\frac{1}{N}[(\sum_{i: y^i \neq H_t(x^i)} w_t^ie^{\alpha_t}) + (\sum_{i: y^i = H_t(x^i)} w_t^ie^{-\alpha_t})] = 0$$
$$(\sum_{i: y^i \neq H_t(x^i)} w_t^i)e^{\alpha_t} - (\sum_{i: y^i = H_t(x^i)} w_t^i)e^{-\alpha_t} = 0$$
$$e^{2\alpha} = \frac{\sum_{i: y^i = H_t(x^i)} w_t^i}{\sum_{i: y^i \neq H_t(x^i)} w_t^i}$$
$$2\alpha = ln(\frac{\sum_{i: y^i = H_t(x^i)} w_t^i}{\sum_{i: y^i \neq H_t(x^i)} w_t^i})$$
$$\alpha = \frac{1}{2}ln(\frac{\sum_{i: y^i = H_t(x^i)} w_t^i}{\sum_{i: y^i \neq H_t(x^i)} w_t^i})$$
$$\alpha = \frac{1}{2}ln(\frac{\sum_{i=1}^n w_t^i - \sum_{i: y^i \neq H_t(x^i)} w_t^i}{\sum_{i: y^i \neq H_t(x^i)} w_t^i})$$
$$\alpha = \frac{1}{2}ln(\frac{\frac{1}{\sum_{i=1}^n w_t^i}}{\frac{1}{\sum_{i=1}^n w_t^i}}\frac{\sum_{i=1}^n w_t^i - \sum_{i: y^i \neq H_t(x^i)} w_t^i}{\sum_{i: y^i \neq H_t(x^i)} w_t^i})$$
$$\alpha = \frac{1}{2}ln(\frac{1 - \frac{\sum_{i: y^i \neq H_t(x^i)}w_t^i}{\sum_{i=1}^n w_t^i}}{\frac{\sum_{i: y^i \neq H_t(x^i)}w_t^i}{\sum_{i=1}^n w_t^i}})$$
$$\alpha = \frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$$
## Problem 2 (Zbigniew Drozd, Karol Kuczmarz)
#### Zbigniew Drozd
Let's consider what's returned by a linreg base learner. Well, it is a `y=ax+b` linear function. AdaBoost will join these learners as a linear combination, and it will only output a polynomial of a non-higher degree than a linear function. Thus, applying AdaBoost on linreg gives no real gain in terms of building a higher degree polynomial.
Now, since linreg will output the optimal linear function (that has the smallest loss), running AdaBoost in this setting actually has the posibility of decreasing accuracy (since weighted points might decrease the models performance)
#### Karol Kuczmarz
In boosting our final solution is the linear combination of our base learner results.
$$F_T =\sum_{j=1}^T \alpha_j f_j$$
where $f_j$ is linear function. We know that linear combination of linear functions is a linear function.
Example for two functions:
$$\alpha_1 (a_1x + b_1) + \alpha_2(a_2 x + b_2) = (\alpha_1a_1 + \alpha_2 a_2)x + (\alpha_1 b_1 + \alpha_2 b_2)$$
$F_T$ is linear function too. It doesn't make any sense because we have an exact solution (which is the best) to problem of linear regression. So basically we have similar model but we have to do multple iterations.
## Problem 3 (Michał Maras, Hubert Obrzut)
#### Hubert Obrzut
If you use AdaBoost with fully grown trees, you would get $0\%$ error rate after the first round, as the trees are fully grown, which means that they split the nodes until reaching purity. So ensemble would consist of only one tree - we should stop after the first round, which defeats the whole purpose of applying AdaBoost.
#### Michał Maras
The decision tree (fully grown) will split the data into separate leafs. The
error rate on the training set will be 0, so there is no point in continuing
AdaBoost (we can either stop it or continue running, the latter will result
in a couple of exactly the same trees). The AdaBoost will give no advantage
whatsoever.
## Problem 4
## Problem 5 (Michał Opanowicz, Piotr Słowik, Aleksander Szymański)
#### Michał Opanowicz
Loss function:
$$L(f_x)=\mathbb{E}_{y|x}[e^{-yf_x}] = \sum_{y \in \{-1,1\}} e^{-yf_x}\mathbb{P}(y|x) = e^{-f_x}\mathbb{P}(y=1|x) + e^{f_x}\mathbb{P}(y=-1|x)$$
Let $a = \mathbb{P}(y=1|x), b = \mathbb{P}(y=-1|x)$
$$L(f_x) = e^{-f_x}a + e^{f_x}b$$
$$\frac{\partial L}{\partial f_x} = -e^{-f_x}a + e^{f_x}b$$
$$\frac{\partial L}{\partial f_x} = 0 $$
$$e^{-f_x}a = e^{f_x}b $$
$$e^{2f_x} = \frac{a}{b}$$
$$f_x = \frac12 \ln\left(\frac{a}{b}\right)$$
$$f_x = \frac12 \ln\left( \frac{\mathbb{P}(y=1|x)}{\mathbb{P}(y=-1|x)} \right)$$
#### Piotr Słowik
We want to minimaze expected loss, we will calculate derivative and compare it to 0
$$ \sum_{y\pm1} e^{-yf_x} p(y|x) = e^{f_x} p(-1|x)+e^{-f_x}p(1|x)$$ $$\frac{\delta}{\delta f_x} [e^{f_x} p(-1|x)+ e^{-f_x}p(1|x)] = e^{f_x} p(-1|x) - e^{-f_x}p(1|x)$$
Now we set the derivative to 0 and find optimal value of $f_x$
$$e^{f_x} p(-1|x)-e^{-f_x}p(1|x) = 0$$
Multiply that by $e^{f_x}$ and we have:$$e^{2f_x} p(-1|x) - p(1|x)= 0$$ $$e^{2f_x} =\frac{p(1|x)}{p(-1|x)}$$ $$f_x = \frac{1}{2}log{\frac{p(1|x)}{p(-1|x)}} = \frac{1}{2}log{\frac{p(1|x)}{1 - p(1|x)}}$$
#### Aleksander Szymański

## Problem 6 (Wiktor Pilarczyk)



According to Gini tree B is purer.
According to Infogain tree A is purer.
## Problem 7 (Mateusz Orda)
Let's denote output of weak classifier as a sequence of length $N$ with ones and zeros (meaning if $i$-th training sample was classified correctly or not).
For $N = 4$ and classifiers:
$1110$
$0111$
$1011$
we obtain $1111$, which is better.
However, for $N = 8$ and classifiers:
$00011111$
$01100111$
$10001111$
we obtain $00001111$, which is worse.
So generally combining classifiers by voting can affect error rate in both ways.
## Problem 8 (Michał Syposz)

(^)
$$F(X) = \sum_{t}^T \alpha_t f_t(x) \le \sum_{t}^T \alpha_t \max_t f_t(x) =
\max_t f_t (x)$$
$$F(X) = \sum_{t}^T \alpha_t f_t (x) \ge \sum_t^T \alpha_t \min_t f_t(x) =
\min_t f_t(x)$$
(v)
1. If the sum over $\alpha$s isn't 1 the inequalities won't hold. If it's
smaller the $\min$ inequality won't hold and if it's bigger the $\max$ one will be incorrect.
2. Let's assume there's an $\alpha_i < 0$. Then let's set $f_i = 1$ and
$f_j = 0$ for $j \neq i$. We'll get
$$0 = \min_t f_t(x) \leq F(X) = \alpha_i f_i(x) < 0$$
which is a contradiction.
## Problem 9