# Homework 3
## Declarations
| Imie i nazwisko | P1 | P2 | P3 |P4 |P5 |P6 |P7 |P8 | P9 | P10 | P11
| -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- |
|Michał Kępa |1.5|0.5|0.5|1.5|1|0.5|0.5|1|0||
|Michał Łazowski||||||||||
|Maurycy Borkowski|1.5|0.5|0.5|1.5|1|0.5|0.5|1|0|
|Andrzej Herman |1.5|0.5|0.5|0|1|0.5|0.5|1|0|
|Andrzej Kołacz||||||||||
| Szymon Wieczorek |1.5|0.5|0.5|1.5|1|0.5|0.5|0|0|0|
| Artur Skarżyński |1.5|.5|.5|0|1|.5|.5|0|0|
|Dominik Trautman |0|0.5|0.5|0|0|0.5|0.5|0|0|0|
|Maciej Okupnik |0|0.5|0.5|0|1|0.5|0.5|0|0|
|Małgorzata Taterka||||||||||
|Kacper Olejak |0|0.5|0.5|0|1|0.5|0|0|0|
|Jakub Pacierpnik |1.5|0|0.5|0|1|0.5|0.5|0|0||
| kto? | Wieczorek | Trautman| Skarżyński| Borkowski| Pacierpnik, Okupnik| Olejak| Kępa| Herman| |
## Problem 1 [1.5p]. Rozwiązuje: Szymon Wieczorek
Consider AdaBoost. Prove that $\alpha_t = \frac{1}{2}\log\frac{1-\epsilon_t}{\epsilon_t}$, where $\epsilon_t$ is the error rate is the optimal coefficient for the $t$-th weak learner under the exponential loss.
### Solution:

From the lecture: 
$L = \sum_ie^{-\alpha_t}w_t^i + \sum_{i:y_i !=f_t(x_i)}w_t^i(e^{\alpha_t}-e^{-\alpha_t}) = e^{-\alpha_t}\sum_iw_t^i + e^{\alpha_t}\sum_{i:y_i !=f_t(x_i)}w_t^i-e^{-\alpha_t}\sum_{i:y_i !=f_t(x_i)}w_t^i\\
\dfrac{dL}{d\alpha_t} = -e^{-\alpha_t}\sum_iw_t^i + e^{\alpha_t}\sum_{i:y_i !=f_t(x_i)}w_t^i+e^{-\alpha_t}\sum_{i:y_i !=f_t(x_i)}w_t^i = 0\ | \ /\sum_iw_t^i\\
-e^{-\alpha_t} + e^{\alpha_t}\epsilon_t + e^{-\alpha_t}\epsilon_t = 0\\
where\ \epsilon_t = \dfrac{\sum_{i:y_i!=f_t(x_i)}w_t^i}{\sum_iw_t^i}\ is\ the\ weighted\ error\ rate\ of\ f_t\\
e^{\alpha_t}\epsilon_t = e^{-\alpha_t}-e^{-\alpha_t}\epsilon_t\\
e^{\alpha_t}\epsilon_t = e^{-\alpha_t}(1-\epsilon_t)\ |\ /\epsilon_t\\
e^{\alpha_t}=e^{-\alpha_t}(\dfrac{1-\epsilon_t}{\epsilon_t})\ |\ log\\
\alpha_t = log(e^{-\alpha_t})+log(\dfrac{1-\epsilon_t}{\epsilon_t})\\
\alpha_t = -\alpha_t+log(\dfrac{1-\epsilon_t}{\epsilon_t})\\
\alpha_t = \dfrac{1}{2}log(\dfrac{1-\epsilon_t}{\epsilon_t})$
## Problem 2 [0.5p] Rozwiązuje Dominik Trautman (Wrong)
*Question:* Would it make sense to apply boosting in a regression setting using linear regression as the base learner?
[sklearn adaboost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html) (it's accually not the same thing but can be used this way)
*Answer:* Yes, it would, in fact this version of boosting is commonly used and has it's own implementation in sklearn.
*How it behaves:* in each loop of boosting, the samples are weighted based on the error of the current prediction. Therefore, each loop focuses on the most distant samples. The final result resembles polynomial regression.
## Problem 3 [0.5p] Rozwiązuje Artur Skarżyński
Suppose you apply AdaBoost using fully grown trees as the base learner. What would happen? What would the final ensemble look like?
So, AdaBoost works on the assumption that weak learners do *some* mistakes. If we use fully grown tree as a weak learner, we classify all training data correctly, thus no mistakes are made, no weights are updated. Final ensemble does exact same work as single tree.
## Problem 4 [1.5p] Rozwiązuje Maurycy Borkowski:
From lecture:
$$
L_t = \sum_{i=1}^N \left(r_T^i - \alpha_T f_T(x^i)\right)^2 =\\
= \sum_{i=1}^N {r_T^i}^2 - 2r_T^i\alpha_T f_T(x^i) + {\alpha_T f_T(x^i)}^2 \\ = \sum_{i=1}^N {r_T^i}^2 - 2 \sum_{i=1}^N r_T^i\alpha_T f_T(x^i) + \sum_{i=1}^N {\alpha_T f_T(x^i)}^2
$$
${r_T^i}$ it is independent of $f_T$ and $f_T(x^i) = \pm 1$ so ${\alpha_T f_T(x^i)}^2 = \alpha_T \cdot 1$ so it's also independent of $f_T$
thus:
$$
argmin(L_t) = argmin \left( -\sum_{i=1}^N r_T^i\alpha_T f_T(x^i) \right)
$$
Maximally correlation we get when we maximize covariance so:
$$
argmax(cov) = argmax \left(\frac{1}{N}\sum_{i=1}^N (F_{T-1}(x_i)-y_i)(F_{T}(x_i)-y_i)\right) =\\
argmax\left( \sum_{i=1}^N (F_{T-1}(x_i)-y_i)(F_{T-1}(x_i) + \alpha_Tf_T(x_i)-y_i) \right) =\\
=argmax\left( \sum_{i=1}^N (F_{T-1}(x_i)-y_i)(\alpha_Tf_T(x_i)) \right) =\\
= argmax\left( \sum_{i=1}^N r_T^i(\alpha_Tf_T(x_i)) \right) =
argmin \left( -\sum_{i=1}^N r_T^i\alpha_T f_T(x^i) \right) = argmin(L_t)
$$
Now we see, that maximizing correlation between residuals of $F_T$ and $F_{T-1}$ is minimizing loss function $L_T$ $\square$
## Problem 5 [1p] Rozwiązuje Jakub Pacierpnik
Consider a datasample $x$ with label distribution $p(y|x)$ (thus we assume that the sample may be ambiguous). Assume the boosted ensemble returns for this sample a score $f_x$.
The expected loss on this sample is
$\mathbb{E}_{y|x}\left[e^{-yf_x}\right] = \sum_{y\in\pm 1}e^{-yf_x}p(y|x)$
Find the value of the model output $f_x$ which minimizes $\mathbb{E}_{y|x}\left[e^{-yf_x}\right]$ and express it in terms of $p(y=1|x)$ and $p(y=-1|x) = 1 - p(y=1|x)$.
### Solution:
$L=\sum_{y=+-1}e^{-yf_x}p(y|x) = e^{-f_x}p(y=1|x)+e^{f_x}p(y=-1|x) =\\
=e^{-f_x}p(y=1|x)+e^{f_x}(1-p(y=1|x))\\
Calculate\ the\ derivative\ (for\ convenience\ let\ P = p(y=1|x)):\\
\dfrac{dL}{df_x}=-e^{-f_x}P+e^{f_x}-e^{f_x}P\\
e^{f_x}=e^{-f_x}P+e^{f_x}P\ |\ /e^{f_x}P\\
\dfrac{1}{P} = \dfrac{1}{e^{2f_x}}+1\\
\dfrac{1}{e^{2f_x}} = \dfrac{1}{P}-1\ |\ log\\
0 - log(e^{2f_x}) = log(\dfrac{1}{P} - 1)\\
-2f_x = log(\dfrac{1}{P} - 1)$
$f_x = -\dfrac{1}{2}\log(\dfrac{1}{p(y=1|x)}-1)$
### Solution by Maciej Okupnik:

## Problem 6 [0.5p] Rozwiązuje: Kacper Olejak
Consider a dataset with 400 examples of class C1 and 400 of class C2. Let tree A have 2 leaves with class distributions:
Tree A C1 C2
Leaf 1 100 300
Leaf 2 300 100
and let tree B have 2 leaves with class distribution:
Tree B C1 C2
Leaf 1 200 400
Leaf 2 200 0
What is the misclassification rate for both trees? Which tree is more pure according to Gini or Infogain?
### Solution:
Missclassifcation rate of both trees is 1/4.
#### Tree A:
Gini:
$\dfrac{400(1 - 1/16 - 9/16) + 400(1 - 1/16 - 9/16)}{800} = 3/8 = 0.375$ <br/>
Entropy:$\dfrac{400(-1/4\log(1/4)-3/4\log(3/4)) + 400(-1/4\log(1/4)-3/4\log(3/4)))}{800} =-1/4\log(1/4)-3/4\log(3/4) \approx\\
\approx 0.8112$
Infogain = $1 - 0.8112 = 0.1888$
#### Tree B:
Gini:
$\dfrac{600(1 - 1/9 - 4/9) + 200(1 - 1)}{800} = \dfrac{3 * 4/9}{4} = 1/3 = 0.3333$
Entropy:
$\dfrac{600(-1/3\log(1/3) - 2/3\log(2/3)) + 200(-1\log(1))}{800} = \dfrac{3(-1/3\log(1/3) - 2/3\log(2/3))}{4} \approx 0.68872$
Infogain = $1 - 0.68872 = 0.31128$
## Problem 6 [0.5p] Rozwiązuje: Dominik Trautman
Missclassification for tree A:
$(100+100)/800 = 1/4$
Missclassification for tree B:
$(200)/800 = 1/4$
Gini A:
$(1 - (3/4)^2 - (1/4)^2) *2/2 = 0.375$
Gini B:
$(1-(1/3)^2 - (2/3)^2)/2 = 0.2(2)$
Entropy A:
$\dfrac{400(-(1/4)*log_2(1/4) - (3/4)*log_2(3/4))*2}{800} = -(1/4)*log_2(1/4) - (3/4)*log_2(3/4) \approx 0.8112781244$
Infogain A = $1 - 0.8112781244 = 0.188721875$
Entropy B:
$\dfrac{600(-1/3*log_2(1/3) - 2/3*log_2(2/3)) + 0}{800} = 3/4*(-1/3*log_2(1/3) - 2/3*log_2(2/3)) \approx 0.6887218755408$
Infogain B = $1 - 0.6887218755408 = 0.3112781244592$
As we can see, missclassification for both trees is equal, but the second tree is more pure according to gini.
## Problem 7 [0.5p] Rozwiązuje Michał Kępa
### Example when combined classifier is better:
| Classifier | X_1| X_2| X_3|
|-----|---|---|---|---|---|---|
| A | 1 | 1 | 0 | | | |
| B | 0 | 1 | 1 | | | |
| C | 1 | 0 | 1 | | | |
| Combined | 1 | 1 | 1 | | | |
Each classifier on its own classify 2/3 samples correctly. The combined one classify everything correctly.
### Example when combined classifier is worse:
| Classifier | X_1| X_2| X_3| X_4| X_5| X_6|
|-----|---|---|---|---|---|---|
| A | 1 | 1 | 1 | 1 | 0 |0 |
| B | 1 | 1 | 1 | 0 | 1 | 0 |
| C | 1 | 1 | 1 | 0 | 0 | 1 |
| Combined | 1 | 1 | 1 | 0 | 0 | 0 |
Each classifier on its own classify 2/3 samples correctly. The combined one classify only half of the sample correctly.
## Problem 7 [0.5p] Rozwiązuje Dominik Trautman
Example where voting gives better result:
| target| t1| t2| t3|Voting|
|-----|---|---|---|---|
| 1 | 0 | 1 | 1 | 1 |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| ErrRate| 0.33 | 0.33 | 0.5 | 0.17 |
Example where voting gives worse result:
| target| t1| t2| t3|Voting|
|-----|---|---|---|---|
| 1 | 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| ErrRate| 0.33 | 0.33 | 0.33 | 0.5 |
## Problem 8 [1p] Rozwiązuje Andrzej Herman
Consider an ensemble model in which we allow unequal weights for the constituing models, so that:
$$
F(x) = \sum_{t=1}^T \alpha_t f_t(x).
$$
We want to ensure that the value of $F$ is bounded by the maximal and minimal output of the individual models, that is:$$
\min_t f_t(x) \leq F(X) \leq \max_t f_t(x).
$$
Show that a necessary and sufficient condition for this constraint is that the coefficients satisfy:$$
\alpha_t \geq 0\quad\text{ and }\quad \sum_t\alpha_t = 1.
$$
**Conditions are necessary:**
If not: $\alpha_t \geq 0\quad$
We may have a case: $F(x) = 2 \cdot 0 + -1 \cdot -5 = 5$
and now: $\max_t f_t(x) = 0 < 5 = F(x)$
If not: $\sum_t \alpha_t = 1$
We may have a case: $F(x) = 2 \cdot 0 + 2 \cdot 5 = 10$
and now: $\max_t f_t(x) = 5 < 10 = F(x)$
**Conditions are sufficient:**
Lets take that $\forall t ( \alpha_t \geqslant 0$ and $\sum_t \alpha_t = 1)$
$F(x) = \sum_t \alpha_t f_t(x) \geqslant \sum_t \alpha_t \min_{t}f_t(x) = \min_{t}f_t(x)$
$F(x) = \sum_t \alpha_t f_t(x) \leqslant \sum_t \alpha_t \max_{t}f_t(x) = \max_{t}f_t(x)$
Lets take that $\forall t (f_t = 1)$
$F(X) = \sum_t \alpha_t f_t(x) = \sum\alpha_t \geq minf_t(x) = 1$
$F(X) = \sum_t \alpha_t f_t(x) = \sum\alpha_t \leq maxf_t(x) = 1$
and it follows that $\sum\alpha_t=1$