# HW3 Answer
## Problem 1
The output is $(B, W^\prime, H^\prime, output\_channels)$, where
$$
\begin{aligned}
&W^{\prime}=\left\lfloor\frac{W+2 * p_1-k_1}{s_1}+1\right\rfloor \\
&H^{\prime}=\left\lfloor\frac{H+2 * p_2-k_2}{s_2}+1\right\rfloor
\end{aligned}
$$
Note that you should give some explanation to get all points.
## Problem 2
We use the gradient descent to update $\gamma$ and $\beta$:
\begin{align*}
&\gamma \leftarrow \gamma - \eta\frac{\partial l}{\partial \gamma} \\
&\beta\leftarrow \beta - \eta\frac{\partial l}{\partial \beta}
\end{align*}
where $\eta$ is a learning rate and
\begin{align*}
&\frac{\partial l}{\partial \gamma} = \sum_{i=1}^m\left(\frac{\partial l}{\partial y_i}\frac{\partial y_i}{\partial \gamma}\right) = \sum_{i=1}^m\frac{\partial l}{\partial y_i}\hat{x_i}\\
&\frac{\partial l}{\partial \beta} = \sum_{i=1}^m\left(\frac{\partial l}{\partial y_i}\frac{\partial y_i}{\partial \beta}\right) = \sum_{i=1}^m\frac{\partial l}{\partial y_i}
\end{align*}
Note that we sum from $1$ to $m$ because we are working with mini-batches.
Now, we derive some important term by chain rule:
\begin{align*}
&\frac{\partial l}{\partial \hat{x_i}} = \frac{\partial l}{\partial y_i}\frac{\partial y_i}{\partial \hat{x_i}} = \frac{\partial l}{\partial y_i}\gamma\\
&\frac{\partial l}{\partial \sigma^2_B} = \frac{\partial l}{\partial \hat{x_i}}\frac{\partial \hat{x_i}}{\partial \sigma^2_B} = -\frac{1}{2}\sum_{i=1}^m\frac{\partial l}{\partial \hat{x_i}}(x_i-\mu_B)(\sigma_B^2+\epsilon)^{-\frac{3}{2}}
\end{align*}
\begin{align*}
\frac{\partial l}{\partial \mu_B} &= \frac{\partial l}{\partial \hat{x_i}}\frac{\partial \hat{x_i}}{\partial \mu_B}\\
&= \sum_{i=1}^m \frac{\partial l}{\partial \hat{x_i}} \frac{\partial}{\partial \mu_B} (x_i-\mu_B)(\sigma^2_B+\epsilon)^{-\frac{1}{2}}\\
&= \sum_{i=1}^{m}\frac{\partial l}{\partial \hat{x_i}}\left[ \frac{\partial (x_i - \mu_B)}{\partial \mu_B}(\sigma^2_B + \epsilon)^{-\frac{1}{2}} + (x_i - \mu_B)\frac{\partial (\sigma^2_B + \epsilon)^{-\frac{1}{2}}}{\partial \mu_B} \right]\\
&= \sum_{i=1}^{m}\frac{\partial l}{\partial \hat{x_i}}\left[\frac{-1}{\sqrt{\sigma^2_B + \epsilon}} + (x_i - \mu_B)\frac{\partial (\sigma^2_B + \epsilon)^{-\frac{1}{2}}}{\partial \mu_B} \right]
\end{align*}
We know $\sigma^2_B = \frac{1}{m} \sum^{m}_{i=1}(x_i-\mu _B)^2$, so the second term in bracket is
\begin{align*}
(x_i - \mu_B)\frac{\partial (\sigma^2_B + \epsilon)^{-\frac{1}{2}}}{\partial \mu_B} &= (x_i-\mu_B)\frac{-1}{2}(\sigma^2_B + \epsilon)^{-\frac{3}{2}}\frac{\partial (\sigma^2_B + \epsilon)}{\partial\mu_B}\\
&=\frac{-1}{2}(x_i - \mu_B)(\sigma^2_B + \epsilon)^{-\frac{3}{2}}\frac{\partial\left(\frac{1}{m} \sum^{m}_{i=1}(x_i-\mu _B)^2 + \epsilon\right)}{\partial\mu_B}\\
&= \frac{-1}{2}(x_i - \mu_B)(\sigma^2_B + \epsilon)^{-\frac{3}{2}}\left(\frac{-2}{m}\sum_{i=1}^m(x_i-\mu_B)\right)
\end{align*}
Hence,
\begin{align*}
\frac{\partial l}{\partial \mu_B} &= \sum_{i=1}^{m}\frac{\partial l}{\partial \hat{x_i}}\frac{-1}{\sqrt{\sigma^2_B + \epsilon}} + \underbrace{\sum_{i=1}^{m}\frac{\partial l}{\partial \hat{x_i}}\frac{-1}{2}(x_i - \mu_B)(\sigma^2_B + \epsilon)^{-\frac{3}{2}}}_{\frac{\partial l}{\partial \sigma^2_B}}\left(\frac{-2}{m}\sum_{i=1}^m(x_i-\mu_B)\right)\\
&= \sum_{i=1}^{m}\frac{\partial l}{\partial \hat{x_i}}\frac{-1}{\sqrt{\sigma^2_B + \epsilon}} + \frac{\partial l}{\partial \sigma^2_B}\left(\frac{-2}{m}\sum_{i=1}^m(x_i-\mu_B)\right)
\end{align*}
To derive $\frac{\partial l}{\partial x_i}$, we use the chain rule $\frac{\partial l}{\partial x_i} = \frac{\partial l}{\partial \hat{x_i}}\frac{\partial \hat{x_i}}{\partial x_i} + \frac{\partial l}{\partial \sigma^2_B}\frac{\partial \sigma^2_B}{\partial x_i} + \frac{\partial l}{\partial \mu_B}\frac{\partial \mu_B}{\partial x_i}$. Now, calculate the remain term:
\begin{align*}
&\frac{\partial \hat{x_i}}{\partial x_i} = \frac{1}{\sqrt{\sigma^2_B + \epsilon}}\\
&\frac{\partial \mu_B}{\partial x_i} = \frac{1}{m}\\
&\frac{\partial \sigma^2_B}{\partial x_i} = \frac{2(x_i - \mu)}{m}
\end{align*}
That is,
\begin{align*}
\frac{\partial l}{\partial x_i} &= \frac{\partial l}{\partial \hat{x_i}}\frac{1}{\sqrt{\sigma^2_B + \epsilon}} + \frac{\partial l}{\partial \sigma^2_B}\frac{2(x_i - \mu)}{m} + \frac{1}{m}\frac{\partial l}{\partial \mu_B}
\end{align*}
# Problem 3
Note that we use Kronecker delta in our answer
(1)
$$
\begin{aligned}
\frac{\partial S_i}{\partial z_j} = \frac{\partial \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}}{\partial z_j} &= \frac{\partial e^{z_i}}{\partial z_j}(\sum_{k=1}^N e^{z_k})^{-1} + e^{z_i}\frac{\partial}{\partial z_j}(\sum_{k=1}^N e^{z_k})^{-1}\\
&= e^{z_i}\delta_{ij}(\sum_{k=1}^N e^{z_k})^{-1} - e^{z_i}(\sum_{k=1}^N e^{z_k})^{-2}e^{z_j}\\
&= S_i\delta_{ij} - S_iS_j
\end{aligned}
$$
(2)
$$
\begin{aligned}
\frac{\partial L}{\partial z_i} &= -\sum_j y_j\frac{\partial}{\partial z_i}\log{\hat{y}_j}\\
&= -\sum_j y_j \frac{1}{\hat{y}_j}\frac{\partial \hat{y}_j}{\partial z_i}\\
&= -\sum_{j=i} \frac{y_j}{\hat{y}_j}\frac{\partial \hat{y}_j}{\partial z_i} - \sum_{j\neq i} \frac{y_j}{\hat{y}_j}\frac{\partial \hat{y}_j}{\partial z_i}\\
&= -\frac{y_i}{\hat{y}_i} (\hat{y_i} - \hat{y_i}\hat{y_i}) - \sum_{j\neq i}\frac{y_j}{\hat{y}_j}(-\hat{y_i} \hat{y_j})\\
&= -y_i + y_i\hat{y_i} +\sum_{j\neq i} y_j\hat{y}_i\\
&= -y_i + \hat{y_i} \sum_j y_j = \hat{y_i} - y_i
\end{aligned}
$$
## Problem 4
(1) $WLOG$(Without Loss of Generality), let $\mu = 0$. Since $\Sigma$ is symmetric positive semi-definite matrix, we can perform eigen decomposition as follows:
$$
\Sigma = UDU^T = \sum_{i=1}^m (d_iu_iu_i^T).$$
where $U$ and $U^T$ are orthogonal matrix.
Let $n=2m$ and construct a set of points $x_1, ..., x_m, ..., x_{2m}$
where $x_i = \sqrt{d_i}u_i \quad$ and $\quad x_{m+i} = -\sqrt{d_i}u_i \ \forall\ 1\leq i \leq m.$ Then,
$$
\begin{aligned}
\frac{1}{n}\sum_{i=1}^{n}x_i &= \mu = 0 \\
\frac{1}{n}\sum_{i=1}^{n}x_ix_i^T &= \sum_{i=1}^{m}(d_iu_iu_i^T) =UDU^T = \Sigma
\end{aligned}
$$
Note that lots of students just use the covariance of eigen decomposition to construct $\{x_i\}_{i=1}^n$. However, It should satisfy the condition that $\frac{1}{n}\sum_{i=1}^n x_i = \mu$
(2)
Let $\phi_1, \cdots, \phi_k$ be the columns of $\Phi$. Then
$$
Trace(\Phi^T\Sigma\Phi) = \sum_{i=1}^k \phi_i^T\Sigma \phi_i = \sum_{i=1}^k \phi_i^T(\sum_{j=1}^m (d_ju_ju_j^T) )\phi_i = \sum_{j=1}^m d_j\sum_{i=1}^k\langle u_j, \phi_i\rangle^2 = \sum_{j=1}^m c_jd_j
$$
where $\langle\cdot, \cdot\rangle$ is standard inner product in Euclidean space, $c_j:= \sum_{i=1}^k\langle u_j, \phi_i\rangle^2$ for each $j=1, \cdots, m$ and $d_1\geq d_2\geq \cdots\geq d_m\geq 0$
Claim: $0\leq c_j\leq 1$ and $\sum_{j=1}^m c_j = k$
Clearly, $c_j\geq 0$. Extending $\phi_1, \cdots, \phi_k$ to $\phi_1, \cdots, \phi_k, \phi_{k+1}, \cdots, \phi_m$ for $\mathbb{R}^m$. Then, for each $j=1, \cdots, m$
$$
c_j = \sum_{i=1}^k \langle u_j, \phi_i\rangle^2 \leq \sum_{i=1}^m \langle u_j, \phi_i\rangle^2 = 1
$$
Finally, since $u_1, \cdots, u_m$ is an orthonormal basis for $\mathbb{R}^m$,
\begin{align*}
\sum_{j=1}^{m} c_{j}=\sum_{j=1}^{m} \sum_{i=1}^{k}\left\langle{u}_{j}, {\phi}_{i}\right\rangle^{2}=\sum_{i=1}^{k} \sum_{j=1}^{m}\left\langle{u}_{j}, {\phi}_{i}\right\rangle^{2}=\sum_{i=1}^{k}\left\|{\phi}_{i}\right\|_{2}^{2}=k
\end{align*}
Hence, the minimum value of $\sum_{j=1}^m c_jd_j$ over all choice of $c_1, c_2, \cdots, c_m\in[0,1]$ with $\sum_{j=1}^k c_j=k$ is $d_{m-k+1}, \cdots, d_{m}$. This is achieved when $c_1, \cdots, c_{m-k} = 0$ and $c_{m-k+1} = \cdots = c_m = 1$
**在證明過程中,所有符號都要有定義,盡量不要直接使用上課的符號 ex. $\hat{x}^{\text{PCA}}$**