## 1. Uni-gram language model and Maximum Likelihood Estimation (1%) The language model is a highly related application of machine learning. For a language model, the uni-gram is a classical one, which ignores the correlation between words and treats the distribution of all words independently. For a document $d$ consisting $N$ types of words, $w_1$, …, $w_N$, each word $w_i$ corresponds to their counts $c(w_i)$, and its length $|d|$ is equivalent to $\sum_{i=1}^N c(w_i)$. Based on the uni-gram model, we can make the probability of any sampled document $d$ be modeled as a multinomial distribution, shown as follows: $P(d|\theta_1\text{, …,}\theta_N) = \begin{pmatrix} |d|\\ c(w_1)…c(w_N)\\ \end{pmatrix}\prod_{i=1}^N\theta_i^{c(w_i)}$ where $\theta_i$ represents the probability of $w_i$, that is, $\theta_i = P(w_i)$. Generally, we would not know the real value of $P(w_i)$ so that we need to estimate it from the sampled document $d$ with the constraint, $\sum_{i=1}^N\theta_i = 1$, since we set $\theta_i$ as a probability. The Maximum Likelihood Estimation (MLE) method estimates $\theta_i$ by deriving $\text{argmax}_{\theta_1\text{, …,}\theta_N}\{P(d|\theta_1\text{, …,}\theta_N)\}$. Please show the detailed derivation of $\theta_i = \frac{c(w_i)}{|d|}$ from MLE. (Hint: You might need to use the Lagrange multiplier. That is, if the original objective which we need to find critical points is $L(\theta_1\text{, …,}\theta_N)$, you should set the new objective as the Lagrangian function, $L(\theta_1\text{, …,}\theta_N) + \lambda \cdot (\sum_{i=1}^N\theta_i - 1)$) ## 2. LSTM Cell (1%) In this exercise, we will simulate the forward pass of a simple LSTM cell. Figure.1 shows a single LSTM cell, where $z$ is the cell input, $z_i, z_f , z_o$ are the control inputs of the gates, $c$ is the cell memory, and $f, g, h$ are activation functions. Given an input $x$, the cell input and the control inputs can be calculated by the following equations : - $z = w \cdot x + b$ - $z_i = w_i \cdot x + b_i$ - $z_f = w_f \cdot x + b_f$ - $z_o = w_o \cdot x + b_o$ Where $w, w_i, w_f, w_o$ are weights and $b, b_i, b_f, b_o$ are biases. The final output can be calculated by $y = f(z_o)\ h(c′)$ where the value stored in cell memory is updated by $c′=f(z_i)g(z)+cf(z_f)$ Given an input sequence $x^t\ ( t = 1, 2, ..., 8 )$, please derive the output sequence $y_t$. The input sequence, the weights, and the activation functions are provided below. The initial value in cell memory is 0. **Please note that your calculation process is required to receive full credit.** ![](https://i.imgur.com/EbKUKt7.png) ## 3. Backpropagation through time via Simple RNN (1%) Backpropagation through time is a critical concept to know as we train a recurrent network. Here, we set a toy case of a binary classification problem. ![](https://i.imgur.com/jcTtrdv.png) The Simple RNN module has two kinds of weights, $W_i$ and $W_h$, such that $h_t = \text{tanh}(W_ix_t + W_hh_{t-1})$, where $t$ represents the index of steps. The Classifier module has the weight $W_o$ such that $\hat{y} = \sigma(W_o h_2)$, where $\sigma(W_oh_2) = \frac{1}{1+\text{exp}(-W_o h_2)}$. The initial state $h_0$ is set to be $0$. The sequential input only contains $\{x_1, x_2\}$; the label is $y$, which is a scalar; the objective function is binary cross entropy. Please derive $\frac{\partial L(y, \hat{y})}{\partial W_i}$, $\frac{\partial L(y, \hat{y})}{\partial W_o}$, $\frac{\partial L(y, \hat{y})}{\partial W_h}$ in terms of $y$, $x_1$, $x_2$, $h_1$, $h_2$, $W_i$, $W_o$, and $W_h$. ## 4. Multiclass AdaBoost (1%) Let $X$ be the input space, $F$ be a collection of multiclass classifiers that map from $X$ to $[1,K]$, where $K$ denotes the number of classes. Let $\{(x_i,{\hat y}_i)\}_{i=1}^n$ be the training data set, where $x_i \in R^m$ and ${\hat y}_i \in [1,K]$. Given $T \in N$, suppose we want to find functions $$g_T^k(x) = \sum_{t=1}^T \alpha_t f^k_t(x), ~~ k \in [ 1,K]$$ where $f_t \in F$ and $\alpha_t \in R$ for all $t \in [ 1,T]$. Here for $f \in F$, we denote $f^k(x) = \{f(x) = k\}$ as the $k$-th element in the one-hot representation of $f(x) \in [ 1,K ]$. The aggregated classifier $h: X \rightarrow [1,K ]$ is defined as $$h(x) = \underset{1 \leq k \leq K}{\mbox{argmax}} ~ g_T^k(x)$$ Please apply gradient boosting to show how the functions $f_t$ and coefficients $\alpha_t$ are computed with an aim to minimize the following loss function $$L(g_T^1,...,g_T^K) = \sum_{i=1}^n \exp\left(\frac{1}{K-1}\sum_{k \neq {\hat y}_i} g_T^{k}(x_i) - g_T^{{\hat y}_i}(x_i) \right) $$