ML2021FALL HW3

--- title: 'ML2021FALL HW3' disqus: hackmd --- # HW3 - Handwritten Assignment ## Convolution (1%) As we mentioned in class, image size may change after convolution layers. Consider a batch of image data with shape $(B, W, H, input\_channels)$, how will the shape change after the convolution layer? <br> $Conv2D\ ( input\_channels,\ output\_channels,\ kernel\_size=(k_1,\ k_2),\\ \qquad \qquad stride=(s_1,\ s_2),\ padding=(p_1,\ p_2))$ <br> To simplify the answer: the padding tuple means that we pad $p_1$ pixels on both left and right side, and $p_2$ pixels for top and bottom ## Batch Normalization (1%) Besides ***Dropout***, we usualy use ***Batch Normalization*** in training nowadays [\[ref\]](https://arxiv.org/pdf/1502.03167.pdf). The trick is popular whithin the deep networks due to its convenience while training. It preverses the distribution within hidden layers and avoids gradient vanish. The alogrithm can be written as below: $Input:\ values\ of\ x\ over\ a\ mini-batch:\ B=\{x_{1..m}\};$ $Output: {y_i = BN_{\gamma, \beta}(x_i)}$ $Parameters\ to\ be\ learned: \gamma ,\ \beta$ $\mu _{B} \leftarrow \ \frac{1}{m} \sum^{m}_{i=1}x_i \qquad \qquad \ \ \ // mini-batch\ mean$ $\sigma ^2_B \leftarrow \ \frac{1}{m} \sum^{m}_{i=1}(x_i-\mu _B)^2 \quad \ \ \ \ // mini-batch\ variance$ $\hat{x_i} \leftarrow \frac{x_i-\mu_B}{\sqrt{\sigma_{B}^{2}+\epsilon}} \qquad \qquad \qquad \ \ //normalize$ $y_i \leftarrow \gamma \hat{x_i}\ +\ \beta \equiv BN_{\gamma , \beta}(x_i) \quad //scale\ and\ shift$ How to update $\gamma$ and $\beta$ from the optimization process of loss? Just try to derive $\frac{\partial l}{\partial \hat{x_i}}$, $\frac{\partial l}{\partial \sigma^2_B}$, $\frac{\partial l}{\partial \mu_B}$, $\frac{\partial l}{\partial x_i}$, $\frac{\partial l}{\partial \gamma}$, $\frac{\partial l}{\partial \beta}$ ## Softmax and Cross Entropy (1%) In classification problem, we use softmax as activation function and cross entropy as loss function. cross entropy is defined as $L(y, \hat{y}) = -\sum_{i}y_ilog\hat{y_i}$ cross entropy of the $t^{th}$ step is defined as $L_t(y_t, \hat{y_t}) = -y_tlog\hat{y_t}$ $softmax(z_t) = \frac{e^{z_t}}{\sum_{i}e^{z_i}}$ Let $\hat{y_t} = softmax(z_t)$ Derive that $\frac{\partial L_t}{\partial z_t} = \hat{y_t} - y_t$ ## Adaptive learning rate based optimization (1%) Adam optimizer is commonly used in deep learning applications. It combines two tricks, momentum and learning rate adaptation by gradients. Most of the time, Adam shows good results empirically. Here we show the updating schemes of AdaGrad, Adam as follows: ### AdaGrad $w^{t} = w^{t-1} - \frac{\eta}{\sqrt{\sum_{i=0}^{t}(g^i)^2}}g^{t}$, where $w^{i}$ and $g^{i}$ represent the model weights and the gradients in the $i$-th step; $\eta$ represent the learning rate which is usually assigned as a constant(termed as $\eta_0$), that is, $\eta = \eta_0$. ### Adam $m^t = \beta_1 \cdot m^{t-1} + (1 - \beta_1) \cdot g^t$ $v^t = \beta_2 \cdot v^{t-1} + (1 - \beta_2) \cdot (g^t)^2$ $\hat{m^t} = \frac{m^t}{1-\beta_1^t}$ $\hat{v^t} = \frac{v^t}{1-\beta_2^t}$ $w^{t} = w^{t-1} - \frac{\eta}{\sqrt{\hat{v^t}}}\hat{m^t}$, where $w^{i}$ and $g^{i}$ represent the model weights and the gradients in the $i$-th step; $\beta_1$ and $\beta_2$ are the constants standing for the momentum estimates' exponetial decay rate. Before training, we initially set $m_0 = 0$ and $v_0 = 0$. (a) Please rewrite $m^t$, $v^t$ into the formation $m^t = A \sum_{i=1}^{t} B_i\cdot g^i$ , $v^t = C \sum_{i=1}^{t} D_i\cdot (g^i)^2$ and derive the corresponding values of $A$, $B_i$, $C$, and $D_i$ in terms of $\beta_1$ and $\beta_2$. Note that $A$, $B_i$, $C$ and $D_i$ are all scalars. As you derive this result, we expect you to learn that how $\beta_1$ and $g_i$ affect the change of weights :P. (b) Under the situstion that we set a scheduling scheme for the Adam optimizer such that $\eta = \eta_0\cdot t^{-\frac{1}{2}}$, Adam will approach to a version of AdaGrad(with $\eta = \eta_0$) if $\beta_1 = 0$ and $\beta_2\rightarrow 1$. Please show the detailed derivation. (PS: If you are not able to prove it, you can still give the intuitive expression about this. Based on your answer, we might provide partial credits.)