HW3 Handwritten Assignment

# HW3 Handwritten Assignment --- ## Convolution As mentioned in class, image size may change after convolution layers. Consider a batch of image data with shape $(B, W, H, input\_channels)$, how will the shape change after the following convolution layer: $Conv2D\ ( input\_channels,\ output\_channels,\ kernel\_size=(k_1,\ k_2),\\ \qquad \qquad stride=(s_1,\ s_2),\ padding=(p_1,\ p_2))$ For simplicity, the padding tuple means that $p_1$ pixels are padded on both left and right sides, and $p_2$ pixels are padded on both top and bottom sides. ## Batch Normalization Besides ***Dropout***, we usualy use ***Batch Normalization*** in training nowadays [\[ref\]](https://arxiv.org/pdf/1502.03167.pdf). The trick is popular whithin the deep networks due to its convenience while training. It preverses the distribution within hidden layers and avoids gradient vanish. The algorithm can be written as below: $Input:\ values\ of\ x\ over\ a\ mini-batch:\ B=\{x_{1..m}\};$ $Output: {y_i = BN_{\gamma, \beta}(x_i)}$ $Parameters\ to\ be\ learned: \gamma ,\ \beta$ $\mu _{B} \leftarrow \ \frac{1}{m} \sum^{m}_{i=1}x_i \qquad \qquad \ \ \ \ \ \ \ \ // mini-batch\ mean$ $\sigma ^2_B \leftarrow \ \frac{1}{m} \sum^{m}_{i=1}(x_i-\mu _B)^2 \quad \ \ \ \ \ // mini-batch\ variance$ $\hat{x_i} \leftarrow \frac{x_i-\mu_B}{\sqrt{\sigma_{B}^{2}+\epsilon}} \qquad \qquad \qquad \ \ \ \ \ \ \ //normalize$ $y_i \leftarrow \gamma \hat{x_i}\ +\ \beta \equiv BN_{\gamma , \beta}(x_i) \quad //scale\ and\ shift$ During training we need to backpropagate the gradient of loss $\ell$ through this transformation, as well as compute the gradients with respect to the parameters $\gamma$, $\beta$. Towards this end, please write down the close form expressions for $\frac{\partial \ell}{\partial x_i}$, $\frac{\partial \ell}{\partial \gamma}$, $\frac{\partial \ell}{\partial \beta}$ in terms of $x_i$, $\mu_B$, $\sigma_B^2$, ${\hat x}_i$, $y_i$ (given by the forward pass) and $\frac{\partial \ell}{\partial y_i}$ (given by the backward pass). - Hint: You may first write down the close form expressions of $\frac{\partial \ell}{\partial {\hat x}_i}$, $\frac{\partial \ell}{\partial \sigma_B^2}$, $\frac{\partial \ell}{\partial \mu_B}$, and then use them to compute $\frac{\partial \ell}{\partial x_i}$, $\frac{\partial \ell}{\partial \gamma}$, $\frac{\partial \ell}{\partial \beta}$. ## Softmax Function and Cross Entropy In multi-class classification problem, we usually use softmax as activation function and cross entropy as loss function. The softmax function takes an $N$-dimensional vector of real numbers and transforms it into a vector of real number in range $(0,1)$ which add upto $1$. $$ \text{Softmax}(\mathbf{z}):= S(\mathbf{z}):\left[\begin{array}{l} z_1 \\ z_2 \\ \cdots \\ z_N \end{array}\right] \mapsto\left[\begin{array}{c} S_1 \\ S_2 \\ \cdots \\ S_N \end{array}\right] $$ $$ S_j=\frac{e^{z_j}}{\sum_{k=1}^N e^{z_k}} \quad \forall j \in 1 . . N $$ The cross entropy function is defined as: $$ L(\mathbf{y}, \mathbf{\hat{y}}) = -\sum_i y_i\log(\hat{y_i}) $$ where $\color{red}{\mathbf{y} = [y_1, \cdots, y_N]}$ is a ground truth vector, where $\color{red}{\sum_i y_i = 1}$ and $\color{red}{\mathbf{\hat{y}} = S(\mathbf{z})}$ i.e. the prediction of model and $\color{red}{\hat{y}_i = S(z_i)}$. 1. Calculate $\frac{\partial \color{red}{S_i}}{\partial z_j}$. - Hint 1 : Consider two cases: $i=j$ and $i\neq j$ - Hint 2 : You may use Kronecker delta $\delta_{i j}=\left\{\begin{array}{lll} 1 & \text { if } & i=j \\ 0 & \text { if } & i \neq j \end{array}\right.$ in the answer. 2. Derive that $\frac{\partial L}{\partial z_i} = \hat{y_i} - y_i$ ## Constrained Mahalanobis Distance Minimization Problem 1. Let $\Sigma \in R^{m \times m}$ be a symmetric positive semi-definite matrix, $\mu \in R^m$. Please construct a set of points $x_1,...,x_n \in R^m$ such that $$\frac{1}{n}\sum_{i=1}^n (x_i - \mu) (x_i - \mu)^T = \Sigma, ~~ \frac{1}{n}\sum_{i=1}^n x_i = \mu$$ - <font color="#f00">Find the relation between set of points and ($\mu$, $\Sigma$) and $(\mu, \Sigma)$ is known</font> 2. Let $1 \leq k \leq m$, solve the following optimization problem (and justify with proof): $$minimize \quad Trace(\Phi^T \Sigma \Phi) \\ subject \ to \quad \Phi^T \Phi = I_k \\ variables \quad \Phi \in R^{m \times k}$$