owned this note
owned this note
Published
Linked with GitHub
---
title: 'ML2019FALL HW3'
disqus: hackmd
---
## HW3 - Handwritten Assignment
### Convolution (1%)
As we mentioned in class, image size may change after convolution layers. Consider a batch of image data with shape $(B, W, H, input\_channels)$, how will the shape change after the convolution layer?
<br>
$Conv2D\ ( input\_channels,\ output\_channels,\ kernel\_size=(k_1,\ k_2),\\ \qquad \qquad stride=(s_1,\ s_2),\ padding=(p_1,\ p_2))$
<br>
To simplify the answer: the padding tuple means that we pad $p_1$ pixels on both left and right side, and $p_2$ pixels for top and bottom
<br>
Sol:
$(B, W', H', output_channels)$
<br>
$W' = \lfloor\frac{W\ +\ 2*p_{1}\ -\ k_{1}}{s_{1}}+1\rfloor$
<br>
$H' = \lfloor\frac{H\ +\ 2*p_{2}\ -\ k_{2}}{s_{2}}+1\rfloor$
<br>
### Batch Normalization (1%)
Besides ***Dropout***, we usualy use ***Batch Normalization*** in training nowadays [\[ref\]](https://arxiv.org/pdf/1502.03167.pdf). The trick is popular whithin the deep networks due to its convenience while training. It preverses the distribution within hidden layers and avoids gradient vanish.
The alogrithm can be written as below:
$Input:\ values\ of\ x\ over\ a\ mini-batch:\ B=\{x_{1..m}\};$
$Output: {y_i = BN_{\gamma, \beta}(x_i)}$
$Parameters\ to\ be\ learned: \gamma ,\ \beta$
$\mu _{B} \leftarrow \ \frac{1}{m} \sum^{m}_{i=1}x_i \qquad \qquad \ \ \ // mini-batch\ mean$
$\sigma ^2_B \leftarrow \ \frac{1}{m} \sum^{m}_{i=1}(x_i-\mu _B)^2 \quad \ \ \ \ // mini-batch\ variance$
$\hat{x_i} \leftarrow \frac{x_i-\mu_B}{\sqrt{\sigma_{B}^{2}+\epsilon}} \qquad \qquad \qquad \ \ //normalize$
$y_i \leftarrow \gamma \hat{x_i}\ +\ \beta \equiv BN_{\gamma , \beta}(x_i) \quad //scale\ and\ shift$
How to update $\gamma$ and $\beta$ from the optimization process of loss?
Just try to derive $\frac{\partial l}{\partial \hat{x_i}}$, $\frac{\partial l}{\partial \sigma^2_B}$, $\frac{\partial l}{\partial \mu_B}$, $\frac{\partial l}{\partial x_i}$, $\frac{\partial l}{\partial \gamma}$, $\frac{\partial l}{\partial \beta}$
<br>
Sol:
$\frac{\partial l}{\partial \hat{x_{i}}} = \frac{\partial l}{\partial y_i} \gamma$
<br>
$\frac{\partial l}{\partial \sigma _{B}^{2}} = \sum_{i=1}^{m}\frac{\partial l}{\partial \hat{x_{i}}}*(x_i - \mu_B)*\frac{-1}{2} (\sigma _{B}^{2}+\epsilon )^{-3/2}$
<br>
$\frac{\partial l}{\partial \mu_B} = (\sum_{i=1}^{m}\frac{\partial l}{\partial \hat{x_{i}}}*\frac{-1}{\sqrt{\sigma _{B}^{2}+\epsilon }}) + \frac{\partial l}{\partial \sigma _{B}^{2}} \frac{\sum_{i=1}^{m}-2(x_i - \mu_B)}{m}$
<br>
$\frac{\partial l}{\partial x_{i}} = \frac{\partial l}{\partial \hat{x_{i}}}*\frac{-1}{\sqrt{\sigma _{B}^{2}+\epsilon }} + \frac{\partial l}{\partial \sigma _{B}^{2}}*\frac{2(x_i - \mu_B)}{m} + \frac{\partial l}{\partial \mu_B}*\frac{1}{m}$
<br>
$\frac{\partial l}{\partial \gamma} = \sum^{m}_{i=1} \frac{\partial l}{\partial y_i} *\hat{x_i}$
<br>
$\frac{\partial l}{\partial \beta} = \sum^{m}_{i=1} \frac{\partial l}{\partial y_i}$
<br>
### Softmax and Cross Entropy (1%)
In classification problem, we use softmax as activation function and cross entropy as loss function.
$softmax(z_t) = \frac{e^{z_t}}{\sum_{i}e^{z_i}}$
$cross\_entropy = L(y, \hat{y}) = -\sum_{i}y_ilog\hat{y_i}$
$cross\_entropy = L_t(y_t, \hat{y_t}) = -y_tlog\hat{y_t}$
$\hat{y_t} = softmax(z_t)$
Derive that $\frac{\partial L_t}{\partial z_t} = \hat{y_t} - y_t$
<br>
Sol:
In binary case ($y_t$ = 1) :
<br>
$\frac{\partial L_t}{\partial z_t} = -\frac{\partial y_tlog\hat{y_t}}{\partial z_t} = -y_t\frac{\partial log\hat{y_t}}{\partial z_t} = -y_t \frac{1}{\hat{y_t}} \frac{\partial \hat{y_t}}{\partial z_t} = -y_t\frac{1}{\hat{y_t}} (\hat{y_t} - \hat{y_t}^2) = y_t \hat{y_t} - y_t$
<br>
similar for $y_t$ = 0