# HW4 Handwritten Assignment ## LSTM Cell In this exercise, we will simulate the forward pass of a simple LSTM cell. Figure.1 shows a single LSTM cell, where $z$ is the cell input, $z_i, z_f , z_o$ are the control inputs of the gates, $c$ is the cell memory, and $f, g, h$ are activation functions. Given an input $x$, the cell input and the control inputs can be calculated by the following equations : - $z = w \cdot x + b$ - $z^i = w_i \cdot x + b_i$ - $z^f = w_f \cdot x + b_f$ - $z^o = w_o \cdot x + b_o$ Where $w, w_i, w_f, w_o$ are weights and $b, b_i, b_f, b_o$ are biases. The final output can be calculated by $y = f(z^o)\ h(c′)$ where the value stored in cell memory is updated by $c′=f(z^i)g(z)+cf(z^f)$ Note that $f(z) = \frac{1}{1+e^{-z}}\quad , g(z) = z,\quad h(z) = z$ Given an input sequence $x^t\ ( t = 1, 2, 3, 4 )$, please derive the output sequence $y_t$. The input sequence, the weights, and the activation functions are provided below. ![](https://i.imgur.com/qGHaIwH.png =400x150) - Input sequence: $x^1 = [0,1,0,3], x^2=[1,0,1,-2], x^3=[1,1,1,4], x^4=[0,1,1,0]$ The initial value in cell memory is 0. **Please note that your calculation process is required to receive full credit.** ![](https://i.imgur.com/hCEi9Ok.png =400x400) - Hint: Finish the table like t | 1 | 2 | 3 | 4 | --- | --- | --- | --- |---| $z^i$ | | | | | $f(z^i)$ | | | | | $z^f$ | | | | | $f(z^i)$ | | | | | $z^o$ | | | | | $f(z^o)$ | | | | | $z$ | | | | | $c′$ | | | | | $y$ | | | | | ## Backpropagation through time via Simple RNN (1%) Backpropagation through time is a critical concept to know as we train a recurrent network. Here, we set a toy case of binary classification problem. ![](https://i.imgur.com/jcTtrdv.png) The Simple RNN module has two kinds of weights, $W_i$ and $W_h$, such that $h_t = \text{tanh}(W_ix_t + W_hh_{t-1})$, where $t$ represents the index of steps. The Classifier module has the weight $W_o$ such that $\hat{y} = \sigma(W_o h_2)$, where $\sigma(W_oh_2) = \frac{1}{1+\text{exp}(-W_o h_2)}$. The initial state $h_0$ is set to be $0$. The sequential input only contains $\{x_1, x_2\}$; the label is $y$; the objective function is binary cross entropy. Please derive $\frac{\partial L(y, \hat{y})}{\partial W_o}$, $\frac{\partial L(y, \hat{y})}{\partial W_h}$, $\frac{\partial L(y, \hat{y})}{\partial W_i}$ in terms of $x_1$, $x_2$, $h_0$, $h_1$, $h_2$, $W_i$, $W_o$, and $W_h$. ## Multiclass AdaBoost Let $\mathcal{X}$ be the input space, $\mathscr F$ be a collection of multiclass classifiers that map from $\mathcal{X}$ to $[ 1,K]$, where $K$ denotes the number of classes. Let $\{(x_i,{\hat y}_i)\}_{i=1}^m$ be the training data set, where $x_i \in \mathcal{X}$ and ${\hat y}_i \in [1,K]$. Given $T \in \mathbb{N}$, suppose we want to find functions \begin{equation*} g_{T+1}^k(x) = \sum_{t=1}^T \alpha_t f_t^k(x), ~~ k \in [ 1,K ] \end{equation*} where $f_t \in \mathscr F$ and $\alpha_t \in \mathbb{R}$ for all $t \in [1, T]$. Here for $f \in \mathscr F$, we denote $f^k(x) = \mathbf{1}\{f(x) = k\}$, where $\mathbf{1}(\cdot)$ is an indicator function, as the $k$'th element in the one-hot representation of $f(x) \in [ 1,K ]$. The aggregated classifier $h: \mathcal{X} \rightarrow [ 1,K ]$ is defined as \begin{equation*} x \mapsto \underset{1 \leq k \leq K}{\mbox{argmax}} ~ g_{T+1}^k(x) \end{equation*} Please apply gradient boosting to show how the functions $f_t$ and coefficients $\alpha_t$ are computed with an aim to minimize the following loss function \begin{equation*} L((g_{T+1}^1, \cdots, g_{T+1}^K) = \sum_{i=1}^m \exp\left(\frac{1}{K-1}\sum_{k \neq {\hat y}_i} g_{T+1}^{k}(x_i) - g_{T+1}^{{\hat y}_i}(x_i) \right) \end{equation*} ## Expectation-Maximization for GMMs Suppose the data are drawn from $K$ Gaussian distributions specified by $\theta=\{(\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\}_{k=1}^K$, where $\pi_k\in\mathbb{R}, \boldsymbol{\mu_k}\in\mathbb{R}^m, \boldsymbol{\Sigma}_k\in\mathbb{R}^{m\times m}$ denotes the prior probability, mean, and (non-singular) covariance matrix of the $k$'th Gaussian distribution. The probability density function $$ p(\boldsymbol{x} ; \theta)=\sum_{k=1}^K \pi_k \mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\right) $$ where $\sum_{k=1}^K \pi_k = 1$ Given data set $\mathcal{X} = \{\boldsymbol{x}_1, \cdots, \boldsymbol{x}_N\}$, where $\boldsymbol{x}_1, \cdots, \boldsymbol{x}_N\in \mathbb{R}^m$. Denote the latent variable $z_i\in \{1, \cdots, K\}$ indicating which Gaussian distribution $\boldsymbol{x}_i$ is drawn from $$ p(\boldsymbol{x}, z=k ; \theta)=\pi_k \frac{1}{\sqrt{(2 \pi)^m\left|\boldsymbol{\Sigma}_k\right|}} \exp \left(-\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}_k\right)^T \boldsymbol{\Sigma}_k^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_k\right)\right) $$ Denote $\mathcal{Z} = \{z_1, \cdots, z_N\}$ as the collection of all latent variables. However, the GMM is intractable to find the optinal $\theta^*$ analytically. Follow the lecture slide, using the Expectation-Maximization (EM) algorithm to find the optimal $\theta^*$. Consider current estimators $\theta^{(t)}=\left\{\left(\pi_k^{(t)}, \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)}\right)\right\}_{k=1}^K$ 1. Expectation Step (E-step): Find $\mathbb{P}\left[z_i=k \mid x_i ; \theta^{(t)}\right]$ and $Q(\theta|\theta^{(t)})$ 2. Maximization Step (M-step): Find $\pi_k^{(t+1)}, \boldsymbol{\mu}_k^{(t+1)}$, and $\boldsymbol{\Sigma}_k^{(t+1)}$