# HW4 Handwritten Assignment
## LSTM Cell
In this exercise, we will simulate the forward pass of a simple LSTM cell. Figure.1 shows a single LSTM cell, where $z$ is the cell input, $z_i, z_f , z_o$ are the control inputs of the gates, $c$ is the cell memory, and $f, g, h$ are activation functions. Given an input $x$, the cell input and the control inputs can be calculated by the following equations :
- $z = w \cdot x + b$
- $z^i = w_i \cdot x + b_i$
- $z^f = w_f \cdot x + b_f$
- $z^o = w_o \cdot x + b_o$
Where $w, w_i, w_f, w_o$ are weights and $b, b_i, b_f, b_o$ are biases. The final output can be calculated by
$y = f(z^o)\ h(c′)$
where the value stored in cell memory is updated by
$c′=f(z^i)g(z)+cf(z^f)$
Note that $f(z) = \frac{1}{1+e^{-z}}\quad , g(z) = z,\quad h(z) = z$
Given an input sequence $x^t\ ( t = 1, 2, 3, 4 )$, please derive the output sequence $y_t$. The input sequence, the weights, and the activation functions are provided below.
![](https://i.imgur.com/qGHaIwH.png =400x150)
- Input sequence: $x^1 = [0,1,0,3], x^2=[1,0,1,-2], x^3=[1,1,1,4], x^4=[0,1,1,0]$
The initial value in cell memory is 0. **Please note that your calculation process is required to receive full credit.**
![](https://i.imgur.com/hCEi9Ok.png =400x400)
- Hint: Finish the table like
t | 1 | 2 | 3 | 4 |
--- | --- | --- | --- |---|
$z^i$ | | | | |
$f(z^i)$ | | | | |
$z^f$ | | | | |
$f(z^i)$ | | | | |
$z^o$ | | | | |
$f(z^o)$ | | | | |
$z$ | | | | |
$c′$ | | | | |
$y$ | | | | |
## Backpropagation through time via Simple RNN (1%)
Backpropagation through time is a critical concept to know as we train a recurrent network. Here, we set a toy case of binary classification problem.
![](https://i.imgur.com/jcTtrdv.png)
The Simple RNN module has two kinds of weights, $W_i$ and $W_h$, such that $h_t = \text{tanh}(W_ix_t + W_hh_{t-1})$, where $t$ represents the index of steps. The Classifier module has the weight $W_o$ such that $\hat{y} = \sigma(W_o h_2)$, where $\sigma(W_oh_2) = \frac{1}{1+\text{exp}(-W_o h_2)}$. The initial state $h_0$ is set to be $0$. The sequential input only contains $\{x_1, x_2\}$; the label is $y$; the objective function is binary cross entropy. Please derive $\frac{\partial L(y, \hat{y})}{\partial W_o}$, $\frac{\partial L(y, \hat{y})}{\partial W_h}$, $\frac{\partial L(y, \hat{y})}{\partial W_i}$ in terms of $x_1$, $x_2$, $h_0$, $h_1$, $h_2$, $W_i$, $W_o$, and $W_h$.
## Multiclass AdaBoost
Let $\mathcal{X}$ be the input space, $\mathscr F$ be a collection of multiclass classifiers that map from $\mathcal{X}$ to $[ 1,K]$, where $K$ denotes the number of classes. Let $\{(x_i,{\hat y}_i)\}_{i=1}^m$ be the training data set, where $x_i \in \mathcal{X}$ and ${\hat y}_i \in [1,K]$. Given $T \in \mathbb{N}$, suppose we want to find functions
\begin{equation*}
g_{T+1}^k(x) = \sum_{t=1}^T \alpha_t f_t^k(x), ~~ k \in [ 1,K ]
\end{equation*}
where $f_t \in \mathscr F$ and $\alpha_t \in \mathbb{R}$ for all $t \in [1, T]$. Here for $f \in \mathscr F$, we denote $f^k(x) = \mathbf{1}\{f(x) = k\}$, where $\mathbf{1}(\cdot)$ is an indicator function, as the $k$'th element in the one-hot representation of $f(x) \in [ 1,K ]$. The aggregated classifier $h: \mathcal{X} \rightarrow [ 1,K ]$ is defined as
\begin{equation*}
x \mapsto \underset{1 \leq k \leq K}{\mbox{argmax}} ~ g_{T+1}^k(x)
\end{equation*}
Please apply gradient boosting to show how the functions $f_t$ and coefficients $\alpha_t$ are computed with an aim to minimize the following loss function
\begin{equation*}
L((g_{T+1}^1, \cdots, g_{T+1}^K) = \sum_{i=1}^m \exp\left(\frac{1}{K-1}\sum_{k \neq {\hat y}_i} g_{T+1}^{k}(x_i) - g_{T+1}^{{\hat y}_i}(x_i) \right)
\end{equation*}
## Expectation-Maximization for GMMs
Suppose the data are drawn from $K$ Gaussian distributions specified by $\theta=\{(\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\}_{k=1}^K$, where $\pi_k\in\mathbb{R}, \boldsymbol{\mu_k}\in\mathbb{R}^m, \boldsymbol{\Sigma}_k\in\mathbb{R}^{m\times m}$ denotes the prior probability, mean, and (non-singular) covariance matrix of the $k$'th Gaussian distribution.
The probability density function
$$
p(\boldsymbol{x} ; \theta)=\sum_{k=1}^K \pi_k \mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\right)
$$
where $\sum_{k=1}^K \pi_k = 1$
Given data set $\mathcal{X} = \{\boldsymbol{x}_1, \cdots, \boldsymbol{x}_N\}$, where $\boldsymbol{x}_1, \cdots, \boldsymbol{x}_N\in \mathbb{R}^m$. Denote the latent variable $z_i\in \{1, \cdots, K\}$ indicating which Gaussian distribution $\boldsymbol{x}_i$ is drawn from
$$
p(\boldsymbol{x}, z=k ; \theta)=\pi_k \frac{1}{\sqrt{(2 \pi)^m\left|\boldsymbol{\Sigma}_k\right|}} \exp \left(-\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}_k\right)^T \boldsymbol{\Sigma}_k^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_k\right)\right)
$$
Denote $\mathcal{Z} = \{z_1, \cdots, z_N\}$ as the collection of all latent variables. However, the GMM is intractable to find the optinal $\theta^*$ analytically. Follow the lecture slide, using the Expectation-Maximization (EM) algorithm to find the optimal $\theta^*$.
Consider current estimators $\theta^{(t)}=\left\{\left(\pi_k^{(t)}, \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)}\right)\right\}_{k=1}^K$
1. Expectation Step (E-step): Find $\mathbb{P}\left[z_i=k \mid x_i ; \theta^{(t)}\right]$ and $Q(\theta|\theta^{(t)})$
2. Maximization Step (M-step): Find $\pi_k^{(t+1)}, \boldsymbol{\mu}_k^{(t+1)}$, and $\boldsymbol{\Sigma}_k^{(t+1)}$