# HW4 Answer ## Problem 1 t | 1 | 2 | 3 | 4 --- | --- | --- | --- |--- $z_i$ | 90 | 90 | 190 | 90 $f(z_i)$ | 1 | 1 | 1 | 1 $z_f$ | 10 | 10 | -90 | 10 $f(z_i)$ | 1 | 1 | 0 | 1 $z_o$ | -10 | 90 | 90 | 90 $f(z_o)$ | 0 | 1 | 1 | 1 $z$ | 3 | -2 | 4 | 0 $c′$ | 3 | 1 | 4 | 4 $y$ | 0 | 1 | 4 | 4 ## Problem 2 First, give an toy example: Consider a fully connected network with $\tanh$ as the non-linear activation e.g. $$ \begin{aligned} Y & =W X+B \\ Z & =\tanh (Y) \\ {\left[\begin{array}{l} y_1 \\ y_2 \end{array}\right] } & =\left[\begin{array}{ll} w_1 & w_2 \\ w_3 & w_4 \end{array}\right]\left[\begin{array}{l} x_1 \\ x_2 \end{array}\right]+\left[\begin{array}{l} b_1 \\ b_2 \end{array}\right] \\ {\left[\begin{array}{l} z_1 \\ z_2 \end{array}\right] } & =\left[\begin{array}{l} \tanh \left(y_1\right) \\ \tanh \left(y_2\right) \end{array}\right] \end{aligned} $$ If $L$ is the loss of the network and given $\frac{\partial L}{\partial Z}$ (from the preceding layer), then $$ \begin{aligned} \frac{\partial L}{\partial Y} & =\left[\begin{array}{l} \frac{\partial L}{\partial y_1} \\ \frac{\partial L}{\partial y_2} \end{array}\right] \\ & =\left[\begin{array}{l} \frac{\partial L}{\partial z_1}\left(1-z_1^2\right) \\ \frac{\partial L}{\partial z_2}\left(1-z_2^2\right) \end{array}\right] \\ & =\left[\begin{array}{l} \left(1-z_1^2\right) \\ \left(1-z_2^2\right) \end{array}\right] \odot\left[\begin{array}{c} \frac{\partial L}{\partial z_1} \\ \frac{\partial L}{\partial z_2} \end{array}\right] \\ & =\left[\begin{array}{cc} \left(1-z_1^2\right) & 0 \\ 0 & \left(1-z_2^2\right) \end{array}\right]\left[\begin{array}{c} \frac{\partial L}{\partial z_1} \\ \frac{\partial L}{\partial z_2} \end{array}\right] \\ & =\operatorname{diag}\left(1-Z^2\right) \frac{\partial L}{\partial Z} \\ \frac{\partial L}{\partial W} & =\frac{\partial L}{\partial Y} X^T \\ \frac{\partial L}{\partial X} & =W^T \frac{\partial L}{\partial Y} \end{aligned} $$ To write explicity, let $W_1 = W_2 = W_h$ and $U_1=U_2=W_i$, then $$ \begin{aligned} & h_1=\tanh \left(W_1 h_0+U_1 x_1\right) \\ & h_2=\tanh \left(W_2 h_1+U_2 x_2\right) \end{aligned} $$ Let $z = W_oh_2$. By the chain rule, we can derive the results as follows: $$ \begin{aligned} \frac{\partial L}{\partial W_o} &= \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial W_o}\\ \frac{\partial L}{\partial W} & =\frac{\partial L}{\partial W_1}+\frac{\partial L}{\partial W_2} \\ & =\operatorname{diag}\left(1-\left(h_1\right)^2\right) \frac{\partial L}{\partial h_1} h_0^T+\operatorname{diag}\left(1-\left(h_2\right)^2\right) \frac{\partial L}{\partial h_2} h_1^T \\ \frac{\partial L}{\partial U} & =\frac{\partial L}{\partial U_1}+\frac{\partial L}{\partial U_2} \\ & =\operatorname{diag}\left(1-\left(h_1\right)^2\right) \frac{\partial L}{\partial h_1} x_1^T+\operatorname{diag}\left(1-\left(h_2\right)^2\right) \frac{\partial L}{\partial h_2} x_2^T \end{aligned} $$ Note that $W_i, W_h\in \mathbb{R}^{n\times n}$, $h_2, h_1\in \mathbb{R}^{n}$, and $W_o\in \mathbb{R}^{1\times n}$. Let $L(y, \hat{y}) = -y\log{\hat{y}} - (1-y)\log{(1-\hat{y})}$ 1. $\frac{\partial L(y, \hat{y})}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}$: $\frac{\partial L(y, \hat{y})}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} = -[y \cdot \frac{1}{\hat{y}} + (1-y) \cdot \frac{-1}{1 - \hat{y}}] \cdot \hat{y}(1 - \hat{y}) = \hat{y} - y$ 2. $\frac{\partial z}{\partial W_o}$: $\frac{\partial z}{\partial W_o} = h_2^T$ 3. $\frac{\partial z}{\partial h_2}$: $W_o^T$ By (1)(2), we can get $\frac{\partial L(y, \hat{y})}{\partial W_o} = h_2^T(\hat{y} - y)$ Now, $$ \begin{aligned} \frac{\partial L}{\partial h_2} & =W_o^T \frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z} \\ \frac{\partial L}{\partial h_1} & =W_h^T \operatorname{diag}\left(1-\left(h_2\right)^2\right) \frac{\partial L}{\partial h_2} \\ \end{aligned} $$ Then, we can get $\frac{\partial L(y, \hat{y})}{\partial W_h} = (\hat{y} - y)diag(1-(h_2)^2)W_o^{T} h_1^T$, and $\frac{\partial L(y, \hat{y})}{\partial W_i} = (\hat{y} - y)diag(1-(h_1)^2)W_h^T diag(1-(h_2)^2)W_o^{T}x_1^T +diag(1-(h_2)^2)W_o^{T}x_2^T$. ## Problem 3 Given $\mathbf{g}_{t-1}=\left\{g_{t-1}^k\right\}_{k=1}^K$, we update $\mathbf{g}_t=\left\{g_{t-1}^k+\alpha_t f_t^k\right\}_{k=1}^K=\mathbf{g}_{t-1}+$ $\alpha_t \mathrm{f}_t$ as follows: $$ \begin{aligned} & \left.\mathbf{f}_t \in \underset{f \in \mathcal{F}}{\operatorname{argmin}} \frac{\partial}{\partial \alpha} L\left(\mathbf{g}_{t-1}+\alpha \mathbf{f}\right)\right|_{\alpha=0} \\ & =\left.\underset{f \in \mathcal{F}}{\operatorname{argmin}} \frac{\partial}{\partial \alpha} \sum_{i=1}^n \exp \left(\left(\frac{1}{K-1} \sum_{k \neq \hat{y}_i} g_{t-1}^k\left(x_i\right)-g_{t-1}^{\hat{y}_i}\left(x_i\right)\right)+\alpha\left(\frac{1}{K-1} \sum_{k \neq \hat{y}_i} f^k\left(x_i\right)-f^{\hat{y}_i}\left(x_i\right)\right)\right)\right|_{\alpha=0} \\ & =\underset{f \in \mathcal{F}}{\operatorname{argmin}} \sum_{i=1}^n \exp \left(\frac{1}{K-1} \sum_{k \neq \hat{y}_i} g_{t-1}^k\left(x_i\right)-g_{t-1}^{\hat{y}_i}\left(x_i\right)\right)\left(\frac{1}{K-1} \sum_{k \neq \hat{y}_i} f^k\left(x_i\right)-f^{\hat{y}_i}\left(x_i\right)\right) \\ & =\underset{f \in \mathcal{F}}{\operatorname{argmin}} Z_t \mathbb{E}_{i \sim D_t}\left[\frac{1}{K-1} \sum_{k \neq \hat{y}_i} f^k\left(x_i\right)-f^{\hat{y}_i}\left(x_i\right)\right]=\underset{f \in \mathcal{F}}{\operatorname{argmin}} Z_t \mathbb{E}_{i \sim D_t}\left[\frac{1}{K-1} \cdot 1\left\{f\left(x_i\right) \neq \hat{y}_i\right\}-1\left\{f\left(x_i\right)=\hat{y}_i\right\}\right] \\ & =\underset{f \in \mathcal{F}}{\operatorname{argmin}} Z_t\left(\frac{K}{K-1} \mathbb{P}_{i \sim D_t}\left[f\left(x_i\right) \neq \hat{y}_i\right]-1\right)=\underset{f \in \mathcal{F}}{\operatorname{argmin}} \mathbb{P}_{i \sim D_t}\left[f\left(x_i\right) \neq \hat{y}_i\right] \\ & \alpha_t\in \underset{\alpha \in \mathbb{R}}{\operatorname{argmin}} L\left(\mathbf{g}_{t-1}+\alpha \mathbf{f}_t\right) \\ & =\underset{\alpha \in \mathbb{R}}{\operatorname{argmin}} \sum_{i=1}^n \exp \left(\left(\frac{1}{K-1} \sum_{k \neq \hat{y}_i} g_{t-1}^k\left(x_i\right)-g_{t-1}^{\hat{y}_i}\left(x_i\right)\right)+\alpha\left(\frac{1}{K-1} \sum_{k \neq \hat{y}_i} f_t^k\left(x_i\right)-f_t^{\hat{y}_i}\left(x_i\right)\right)\right) \\ & =\underset{\alpha \in \mathbb{R}}{\operatorname{argmin}} Z_t \mathbb{E}_{i \sim D_t}\left[e^{\alpha\left(\frac{1}{K-1} \sum_{k \neq \hat{y}_i} f_t^k\left(x_i\right)-f_t^{\hat{y}_i}\left(x_i\right)\right)}\right] \\ & =\underset{\alpha \in \mathbb{R}}{\operatorname{argmin}} Z_t \mathbb{E}_{i \sim D_t}\left[e^{\frac{\alpha}{K-1}} \cdot 1\left\{f_t\left(x_i\right) \neq \hat{y}_i\right\}+e^{-\alpha} \cdot 1\left\{f_t\left(x_i\right)=\hat{y}_i\right\}\right] \\ & =\underset{\alpha \in \mathbb{R}}{\operatorname{argmin}} Z_t\left(\epsilon_t e^{\frac{\alpha}{K-1}}+e^{-\alpha}\left(1-\epsilon_t\right)\right)=\left\{\frac{K-1}{K} \log \frac{(K-1)\left(1-\epsilon_t\right)}{\epsilon_t}\right\} \\ & \end{aligned} $$ where $$ Z_t = \sum_{i=1}^n \exp \left(\frac{1}{K-1} \sum_{k \neq \hat{y}_i} g_{t-1}^k\left(x_i\right)-g_{t-1}^{\hat{y}_i}\left(x_i\right)\right) $$ and that $D_t$ is a probability distribution for $t =1, \cdots, n$ given by $$ D_t(i)=\frac{1}{Z_t} \exp \left(\frac{1}{K-1} \sum_{k \neq \hat{y}_i} g_{t-1}^k\left(x_i\right)-g_{t-1}^{\hat{y}_i}\left(x_i\right)\right) $$ and that $\epsilon_t=\mathbb{P}_{i \sim D_t}\left[f_t\left(x_i\right) \neq \hat{y}_i\right]$ is the error of $f_t$ on training sample weighted by the distribution $D_t$. ## Problem 4 Follow the lecture note (https://ntueemlta2022.github.io/slides/week9/W9_GMM_EM.pdf). Calculate $$ \begin{aligned} & \pi_k^{(t+1)}=\frac{1}{N} \sum_{i=1}^N \delta_{i k}^{(t)} \\ & \boldsymbol{\mu}_k^{(t+1)}=\frac{\sum_{i=1}^N \delta_{i k}^{(t)} \boldsymbol{x}_i}{\sum_{i=1}^N \delta_{i k}^{(t)}} \\ & \boldsymbol{\Sigma}_k^{(t+1)}=\frac{\sum_{i=1}^N \delta_{i k}^{(t)}\left(\boldsymbol{x}_i-\boldsymbol{\mu}_k^{(t+1)}\right)\left(\boldsymbol{x}_i-\boldsymbol{\mu}_k^{(t+1)}\right)^T}{\sum_{i=1}^N \delta_{i k}^{(t)}} \\ & \end{aligned} $$ explicity to get all points. The calculation process follows the HW2 answer. (之後“可能”會補上計算過程)