Machine Learning Homework 2 (19.11.2020)

# Machine Learning Homework 2 (19.11.2020) ## Timeline * 14:15 - 14:25 -> problem assignment * 14:25 - 15:00 -> write down your solutions * 15:00 - 15:45 -> presentations ## Declarations Please write down problem numbers you wish to declare next to your name. Put "-" if you have nothing to declare: :::warning | |1|2|3|4|5|6|7|8| | ---:| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |Zbigniew Drozd | | |X|X| | |A| | |Zofia Dziedzic |A|X| |X|X| | | | |Michał Grzymek | | | | | | | | | |Michał Kępa | |X|X|X|A|X|X| | |Mikołaj Korobczak | |X|-|X|X| |X| | |Jakub Kuciński | |X|X|A|X|X|X| | |Karol Kuczmarz | |X|X|X|X|X|A| | |Michał Maras |A|X|X|X|X|X|X| | |Hubert Obrzut | |A|X|X|X|X|X| | |Michał Opanowicz | |X|X|X|X|X|X|A| |Mateusz Orda | |X|X|A|X|X|X|X| |Wiktor Pilarczyk | |X|X|X|X|A|X| | |Piotr Słowik | | |-|X| | | | | |Bartosz Stefaniak | |A| |X|X|X| | | |Michał Syposz | |X|X|X|X|A|X| | |Aleksander Szymański|X|X|X|A|X| | | | |Bartosz Troszka | |X|X|X|A| | | | ::: ## Solutions ## Problem 1 **Michał Maras** First, let's write the probability of $X$ being in class $0$ in logistic regression: $P(class = 0 | X) = \sigma(X \Theta + b)$ where $X$ is a horizontal features vector and $\Theta$ is a trained parameter. The boundary region are such $X$s that $P(class = 0 | X) = 0.5$. We can easily see that the boundary region for the logistic regression is linear: $0.5 = \sigma(X \Theta + b)$, $0 = X \Theta + b$ Now let's look deeper into the Naive Bayes classifier: We assume that the classifier is already trained. We choose the class for $X$ as follows: $Class = \max(\prod_i P(x_i | class = 0), \prod_i P(x_i | class = 1))$ If $X$ is in the boundary region, $\prod_i P(x_i | class = 0) = \prod_i P(x_i | class = 1) = 0.5$ We can show that the boundary region is again linear: $$0.5 = \prod_i P(x_i | class = 0) = \prod_i p_i^{x_i} (1 - p_i)^{1 - x_i}$$ $$\log(0.5) = \log(\prod_i p_i^{x_i} (1 - p_i)^{1 - x_i}) = \sum_i x_i \log(p_i) + (1 - x_i) \log(1 - p_i) = X \Theta + b$$ In addition, we can show that the logistic regression and Naive Bayes classifier are equivalent: In Naive Bayes, we use the following formula for training: $P(class = 0 | X) = \frac{P(class = 0) P(X | class = 0)}{P(X)}$ Let's represent it as the odds: $Odds = \frac{P(class = 0 | X)}{P(class = 1 | X)}$ The sigmoid function is an inverse of the log-odds, so: $P(class = 0 | X) = \sigma(\log(\frac{P(class = 0 | X)}{P(class = 1 | X)}))$ $$P(class = 0 | X) = \sigma(\log(\frac{P(class = 0)P(X | class = 0)}{P(class = 1)P(X | class = 1)})) $$ $$= \sigma(\log(\frac{P(X | class = 0)}{P(X | class = 1)}) + \log(P(class = 0) - \log(P(class = 1))))$$ As we've already shown, the $\log(X | class = 0)$ is linear, so: $P(class = 0 | X) = \sigma(X \Theta + b)$ which is the same formula as for the logistic regression. **Zofia Dziedzic** $X=(X_1,X_2,X_3,X_4,X_5)\in\{0,1\}^5$ Let us start with the Bayes classifier \begin{align*} \mathbb{P}(Y=1|X)=&\frac{\mathbb{P}(X|Y=1)\mathbb{P}(Y=1)}{\mathbb{P}(X|Y=1)\mathbb{P}(Y=1) + \mathbb{P}(X|Y=0)\mathbb{P}(Y=0)}=\\=&\frac{1}{1+\frac{\mathbb{P}(X|Y=0)\mathbb{P}(Y=0)}{\mathbb{P}(X|Y=1)\mathbb{P}(Y=1)}}=\frac{1}{1+\exp\{\ln\frac{\mathbb{P}(X|Y=0)\mathbb{P}(Y=0)}{\mathbb{P}(X|Y=1)\mathbb{P}(Y=1)}\}}. \end{align*} In the denominator we have $$ \ln\frac{\mathbb{P}(X|Y=0)\mathbb{P}(Y=0)}{\mathbb{P}(X|Y=1)\mathbb{P}(Y=1)}\ = \ln\frac{\mathbb{P}(Y=0)}{\mathbb{P}(Y=1)} +\ln\frac{\mathbb{P}(X|Y=0)}{\mathbb{P}(X|Y=1)}. $$ The second term we can write as $$ \ln\frac{\mathbb{P}(X|Y=0)}{\mathbb{P}(X|Y=1)}=\ln\frac{\mathbb{P}(X_1|Y=0)}{\mathbb{P}(X_1|Y=1)} +\ldots+ \ln\frac{\mathbb{P}(X_5|Y=0)}{\mathbb{P}(X_5|Y=1)} $$ and next let us look at first summand \begin{align*} \ln\frac{\mathbb{P}(X_1|Y=0)}{\mathbb{P}(X_1|Y=1)} =&\ln\left(\frac{\mathbb{P}(X_1=0|Y=0)}{\mathbb{P}(X_1=0|Y=1)}\right)^{(1-x_1)}\left(\frac{\mathbb{P}(X_1=1|Y=0)}{\mathbb{P}(X_1=1|Y=1)}\right)^{x_1}=\\ =&(1-x_1)\ln\left(\frac{\mathbb{P}(X_1=0|Y=0)}{\mathbb{P}(X_1=0|Y=1)}\right)+x_1\ln\left(\frac{\mathbb{P}(X_1=1|Y=0)}{\mathbb{P}(X_1=1|Y=1)}\right) =\\ =&x_1\left(\ln\left(\frac{\mathbb{P}(X_1=1|Y=0)}{\mathbb{P}(X_1=1|Y=1)}\right)-\ln\left(\frac{\mathbb{P}(X_1=0|Y=0)}{\mathbb{P}(X_1=0|Y=1)}\right)\right) + \\+&\ln\left(\frac{\mathbb{P}(X_1=0|Y=0)}{\mathbb{P}(X_1=0|Y=1)}\right) \end{align*} Thus we obtain the formula for $p = \mathbb{P}(Y=1|X)$ in the logistic regression, i.e. $$ \mathbb{P}(Y=1|X)=\frac{1}{1+\exp\{w_0+w_1x_1+w_2x_2+w_3x_3+w_4x_4+w_5x_5\}}, $$ where $$ w_i=\ln\left(\frac{\mathbb{P}(X_i=1|Y=0)}{\mathbb{P}(X_i=1|Y=1)}\right)-\ln\left(\frac{\mathbb{P}(X_i=0|Y=0)}{\mathbb{P}(X_i=0|Y=1)}\right) $$ and $$ w_0=\ln\frac{\mathbb{P}(Y=0)}{\mathbb{P}(Y=1)}+\ln\left(\frac{\mathbb{P}(X_1=0|Y=0)}{\mathbb{P}(X_1=0|Y=1)}\right)+\ldots+\ln\left(\frac{\mathbb{P}(X_5=0|Y=0)}{\mathbb{P}(X_5=0|Y=1)}\right) $$ ## Problem 2 Let W be matrix that has $w_i$ on diagonal and 0 elswhere. $F(\Theta)=\frac{1}{2}\Sigma_iw^{(i)}(x^{i}\Theta-y^{i})^2$ $F(\Theta) = \frac{1}{2}(X\Theta-Y)^TW(X\Theta-Y) = \frac{1}{2}(\Theta^T X^TWX\Theta - Y^T WX\Theta - \Theta^T X^T WY + Y^T W Y)$ $F(\Theta) = \frac{1}{2}(\Theta^T X^TWX\Theta - Y^T WX\Theta - Y^T WX\Theta + Y^T W Y)$ $\frac{\partial F(\Theta)}{\partial \Theta} = \frac{1}{2}(\Theta^T(X^T WX + (X^T W X)^T) - 2Y^T WX)=$ $=\frac{1}{2}(2\Theta^T X^T WX - 2Y^T WX) = \Theta^T X^T WX - Y^T WX$ $0 = \Theta^T X^T WX - Y^T WX$ $\Theta^T X^T WX = Y^T WX$ $\Theta^T = Y^T WX (X^T WX)^{-1}$ $\Theta = (X^T WX)^{-1} X^T WY$ **Hubert Obrzut** ![](https://i.imgur.com/66JVhK1.jpg) ![](https://i.imgur.com/r4DboTU.jpg) ![](https://i.imgur.com/597vr8e.jpg) **LOESS method** * problem: for the given $x$ we want to know $f(x)$ * we will take into the account some percentage of the nearest neighbours from the current point - we choose $k=\lfloor{n * x_0}\rfloor$, where $n$ is the number of samples (our data) and $x_0$ is the smoothing parameter * assign weights to the nearest neighbours - typical solution is to use the following formula $w_i=C*\left(1-\left(\frac{d_i}{D}\right)^3\right)^3$, where $C=\frac{32}{5}$ $d_i$ - distance to the i-th point $D$ - maximum distance * perform weighted least squares regression ## Problem 3 ### Mikołaj Korobczak We number the balls from 1 to 12 and group them in three groups. As a result now we have 3 groups $\{1,2,3,4\}, \{5,6,7,8\}, \{9,10,11,12\}$. At the beginning we put first and second grup of balls on the ballance. Consider cases: #### Balls 1,2,3,4 and 5,6,7,8 balanced. That means that the odd one is in the last group. Now we weigh 6,7,8 and 9,10,11. * 6,7,8 and 9,10,11 balanced $\to$ 12 is the odd one, * 9,10,11 was heavier $\to$ the heavier one is in 9,10,11 and we can determine which one is it in one weight (weigh two and if they balance the fird is heavier and if not then we know which one is). * 9,10,11 is lighter $\to$ the lighter ball is in 9,10,11 and we can find it in one weight. #### Balls 1,2,3,4 are heavier. That means that either light one is among 5,6,7,8 or heavier is among 1,2,3,4. Now let's weigh 1,2,5 and 3,6,12. * 1,2,5 and 3,6,12 balanced $\to$ either heavier one is 4 or the lighter one is one of 7 and 8. We weigh 7 and 8 and if they balance then 4 is the odd one and if not then we know the answer. * 3,6,12 is lighter $\to$ either 6 is odd light or heavier is among 1 and 2. We can determine that in one weight. * 3,6,12 is heavier $\to$ either 3 is heavy or 5 is light. We can determine it in 1 weight. #### Balls 5,6,7,8 are heavier. This case is symmetrical to the previous one. We now weigh 5,6,1 and 7,2,12 That means that we can determine the odd ball in exactly 3 weights. ### Piotr Słowik First lets assign a number [1,2..12] for each ball. Then we weigh [1,2,3,4] vs [5,6,7,8]. We have 3 options: #### w([1,2,3,4]) = w([5,6,7,8]) Then we know that [9,10,11,12] contains a light or heavy ball. We weigh [1,2,3] vs [9,10,11]. If it balances then 12 has odd weight. We can compare it to any other ball. Else if [9,10,11] is heavier then it contains a heavy ball. Then we weigh 9 aganist 10. If it balances then 11 is heavy ball. Else heavier of [9,10] is heavier ball. Else if [9,10,11] is lighter then it contains a light ball. This case is similiar to the one above. #### w([1,2,3,4]) < w([5,6,7,8]) Then either [1,2,3,4] contains a light ball or [5,6,7,8] contains heavy ball. Now we weigh [1,2,5] vs [3,6,12]. If it balances then either 4 is light or 7, 8 is heavy. Now we weigh 7 vs 8. If it balances then 4 is light ball. Else heavier of 7,8 is the heavy ball. Else if [3,6,12] is heavy then either 6 is heavy or 1, 2 is light. Weigh 1 vs 2. If it balances then 6 is heavy or lighter of 1,2 is the light ball. Else if [3,6,12] is light then either 3 is light or 5 is heavy. Weigh 1 vs 3. If it balances then 5 is heavy ball. Else 3 is the light ball. #### w([1,2,3,4]) > w([5,6,7,8]) Then either [1,2,3,4] contains a heavy ball or [5,6,7,8] contains light ball. Now we weigh [1,5,6] vs [2,7,12]. If it balances then either 8 is light ball or 3,4 is heavy ball. Weigh 3 vs 4, if it balances then 8 is the light ball. Else heavier of 3,4 is heavy ball. Else if [1,5,6] is heavier then either 7 is light or 1 is heavy. Weigh 1 vs 2. If it balances 7 is light ball else 1 is heavy. Else if [2,7,12] is heavier then either 2 is heavy or 5,6 is light. Weigh 5 vs 6. If it balances then 2 is heavy ball else lighter of 5,6 is light ball. Thus we need to weigh only 3 times to determine which ball is odd. ## Problem 4 *Aleksander Szymański* Let $M = \min(X, Y)$ \begin{align} F_M(m) &= P(M <= m) \\ &= 1 - P(M > m) \\ &= 1 - P(X > m)P(Y > m) \\ &= 1 - (1 - P(X <= m))(1 - P(Y <= m)) \\ &= 1 - (1 - m)^2 \\ &= 1 - (1 - 2m + m^2) = 2m - m^2 \end{align} Therefore $f_M(m) = \frac{d}{dm} 2m - m^2 = 2 - 2m$ $\mathbb{E}(M) = \int_0^1f_M(m)mdm = \int_0^12m - 2m^2dm =$ $\Big[m^2 - \frac{2}{3}m^3\Big]^1_0 = 1 - \frac{2}{3} = \frac{1}{3}$ ***Mateusz Orda*** $X, Y$ \~ $U(0, 1)$ I assume that $X$ and $Y$ are independent, otherwise we can obtain different results depending on how they depend on each other. Let $M = min(X, Y)$, $F$ -- cdf of $M$ and $f$ -- pdf of $M$ We are interested in $F(k)$ for $k \in (0, 1)$, it will give us density and expected value at the end. We have $F(k) = P(M \leq k) = 1 - P(M \geq k) = 1 - P(X \geq k) * P(Y \geq k) = 1 - (1 - k)^2 = 2k - k^2$ So $f(k) = F'(k) = 2 - 2k$ Finally, $EM = \int_{-\infty}^{\infty} x \cdot f(x) dx = \int_{0}^{1} 2x - 2x^2 dx = (x^2 - \frac{2}{3}x^3)|_0^1 = \frac{1}{3}$ ***Jakub Kuciński*** $$E[min(X,Y)] = E[X|X\leq Y] + E[Y|X\geq Y] =\\ = 2\cdot E[X|X\leq Y] =2\cdot \int_0^1 \int_0^y xf(x) dx dy = 2\cdot\int_0^1 \int_0^y x dx dy=\\ = 2\cdot\int_0^1 \frac{y^2}{2} dy = 2\cdot\frac{1}{6} = \frac{1}{3}$$ ## Problem 5 ***Michał Kępa*** $$f(x) = - log \left( S(x)_j \right) = - log \left(\frac{e^{x_j}}{\sum_i e^{x_i}} \right) = - log \left(e^{x_j} \right) + log \left(\sum_i e^{x_i} \right) = -x_j + log \left (\sum_i e^{x_i} \right)$$ Derivative in terms of $x_j$: $$-1 + \frac{ \left(\sum_i e^{x_i} \right) ^ \prime}{\sum_i e^{x_i}} = -1 + \frac{(e^{x_j})^\prime}{\sum_i e^{x_i}} = -1 + \frac{e^{x_j}}{\sum_i e^{x_i}} = S(x)_j - 1$$ Derivative in terms of $x_k$, where $k \neq j$: $$0 + \frac{\left(\sum_i e^{x_i}\right) ^ \prime}{\sum_i e^{x_i}} = \frac{(e^{x_k})^\prime}{\sum_i e^{x_i}} = \frac{e^{x_k}}{\sum_i e^{x_i}} = S(x)_k$$ **Bartosz Troszka** $S(x)_j = \frac{e^{(x_j)}}{\sum_{i=1}^{n}e^{x_i}}$ $$-(log(S(x)_{j}))' = -(log(\frac{e^{x_s}}{\sum_{i=1}^{n}e^{x_i}}))'(\frac{e^{x_s}}{\sum_{i=1}^{n}e^{x_i}})' = - (\frac{\sum_{i=1}^{n}e^{x_i}}{e^{x_j}}) (\frac{e^{x_j} (\sum_{i=1}^{n}e^{x_i}) - e^{x_j} (\sum_{i=1}^{n}e^{x_i})'}{(\sum_{i=1}^{n}e^{x_i})^{2}}) =$$ $$ - \frac{(\sum_{i=1}^{n}e^{x_i}) e^{x_j} ((\sum_{i=1}^{n}e^{x_i}) - e^{x_j})}{e^{x_j}(\sum_{i=1}^{n}e^{x_i})^{2}}= - ( 1 - (\frac{e^{x_j}}{\sum_{i=1}^{n}e^{x_i}})) = S(x)_{j} - 1$$ ## Problem 6 **Michał Syposz** $h_{M}(x) = \frac{1}{M} \sum_{m=1}^M h_m (x)$ $e_M(x) = h_M(x) - y = \frac{1}{M} \sum_{m=1}^M (h_m(x) - y) = \frac{1}{M} \sum_{m=1}^M e_m(x)$ $e_M(x)^2 = \frac 1M^2 \sum_{m=1, n=1} e_m(x)e_n(x)$ $\mathbb{E}[e_M(x)^2] = \frac 1M^2 \sum_{m=1} \mathbb{E}[e_m(x)^2]$ $\mathbb{E}[e_M(x)^2] = \frac 1M E_{AV}$ **Wiktor Pilarczyk** ![](https://i.imgur.com/I0TtlEb.jpg) ## Problem 7 **Zbigniew Drozd** When calculating log-sum-exp we might encounter a moment when we get a floating point failure (for example, while adding really big numbers (as in $e^{1000}$), when we would end up trying to calculate $log(\infty)$, or $log(0)$ when the numbers are really small). We see that we can rewrite the calculation of the value that we normalize by, to the following $y = \log \sum^N_{n=1}e^x$ $e^y = \sum^N_{n=1}e^x$ $e^y = e^o\sum^N_{n=1}e^{x - o}$ $y = o + \log \sum^N_{n=1}e^{x-o}$ where o is $o = max(x_1, x_2, \dots, x_n)$ We see that after this transformation, our elements in the sum will be in [0, 1] range (since at least one element will become $e^0$, and all others will be $e^{x_k}$ where $x_k$ is <0). As such, we are only dealing with ranges from [1, N], where we can expect better numerical stability. softmax that accounts for this fix is defines as such ```python= def logsumexp(x): c = x.max() return c + np.log(np.sum(np.exp(x - c))) x = np.array([1000, 1000, 1000]) >>> np.exp(x - logsumexp(x)) array([0.33333333, 0.33333333, 0.33333333]) ``` **Karol Kuczmarz** In our problem we want to calculate log-sum-exp $$\log \sum_{j=1}^n \exp (x_j)$$ Log-sum-exp may fail numerically when $x_i$ is large because we have to calculate $e^{x_i}$ and the outcome might be `inf`. The solution is to multiply by $1$. Let $x^\ast = \max \{ x_1, \ldots, x_n \}$ $$\log \sum_{j=1}^n \exp (x_j) = \log (\exp(x^\ast) \sum_{j=1}^n \exp(x_j - x^\ast)) = x^\ast + \log \sum_{j=1}^n \exp(x_j - x^\ast) $$ Now all exponents are $\leq 0$. ## Problem 8 **Michał Opanowicz** We see ~~an enemy tank~~ a train with a serial number on it. We want to guess how many were produced. $N$ - number of trains produced $n$ - observed serial number Maximum likelihood estimation won't work, because the MLE of $N$ assuming uniform distribution is $n$ (almost guaranteed to be significantly below the true $N$). That would mean significant negative bias in our estimation, which might prove inconvenient in the future if we are counting tanks. MLE generally works poorly if the parameter we are estimating changes the support of the distribution, because it might clamp the support to be as small as possible to maximize the likelihood of the points inside. So we use Bayesian reasoning. Our prior for $N$ is: $P(N) \sim \frac{1}{N^{\alpha} \zeta(\alpha,1)}$ We also assume that for any given $N$, $P(n|N) = \frac 1N$. $P(N|n) =\frac{P(n|N)P(N)}{P(n)} = \frac{P(N)}{Nc}$ $P(N|n) =\frac{1}{N^{\alpha+1} \zeta(\alpha,1) c}$ $c = \sum_{N\geq n}\frac{1}{N^{\alpha+1} \zeta(\alpha,1) }$ $\mathbb{E}N = \frac 1c\sum_{N\geq n}\frac{1}{N^{\alpha} \zeta(\alpha,1) } = \frac{1}{\sum_{N\geq n}\frac{1}{N^{\alpha+1}}}\sum_{N\geq n}\frac{1}{N^{\alpha}}$ If we observe multiple $k$ trains, we need to change our equation a bit. The conditional probability of any given $P(n_k|N) = \frac 1N$ stays the same, but possible N needs to satisfy $N \geq max(n_1,n_2,...,n_k)$: $P(N|n_1,n_2,...n_k) =\frac{P(n|N)^kP(N)}{P(n_1)...P(n_k)} = \frac{P(N)}{N^kc}$ $P(N|n) =\frac{1}{N^{\alpha+k} \zeta(\alpha,1) c}$ $\mathbb{E}N = \frac 1c\sum_{N\geq max(n_1...n_k)}\frac{1}{N^{\alpha} \zeta(\alpha,1) } = \frac{1}{\sum_{N\geq max(n_1...n_k)}\frac{1}{N^{\alpha+k}}}\sum_{N\geq max(n_1...n_k)}\frac{1}{N^{\alpha+k-1}}$ To calculate this we can use an extremely inefficient python program: ```python a = 2 observed_trains = 1 max_observed_n = 200 max_train_count = 1000000 normalizer = sum([(1/i)**(a + observed_trains) for i in range(max_observed_n, max_train_count)]) value = sum([(1/i)**(a + observed_trains - 1) for i in range(max_observed_n, max_train_count)]) print(value/normalizer) # result ~ 398.92 ``` ***Mateusz Orda*** $Let N be total number of trains$ MLE of our samples $\{a_i\}$ is a = $max_i a_$, since we have $P(a_i | N = m) = 0$ for $m < a$ and $P(a_i | N = m) = \frac{1}{m}$ for $m \geq a$, so it is maximized when $m = a$. It is generally quite pesimistic, since we assume that we've seen the train with biggest number and probability of it is quite small unless we have seen many od them. We have use Bayesian theorem to obtain better result. Let's assume $a_1 \geq a_2 \geq .. \geq a_n$ (without loss of generality). We have: $P(N | a_1, ..., a_n) = \frac{P(N) P(a_1, ..., a_n | N)}{P(a_1, ..., a_n)}$ Our observations are independent, so we can rewrite it as (let $c = \frac{1}{P(a_1, ..., a_n)}$) $P(N | a_1, ..., a_n) = c P(N) P(a_1 | N) ... P(a_n | N)$ which