# Sequence Models in Deep Learning ## Introduction Deep learning has revolutionized many fields, but traditional models like Convolutional Neural Networks (CNNs) are primarily designed for data with a fixed structure, like images. However, a vast amount of real-world data is sequential, meaning the order of elements is inherently meaningful. Think of sentences, where the arrangement of words determines the message, or time series data, where values change over time. Sequence models are a specialized class of deep learning architectures tailored to handle such data. ### Why Sequence Models Matter * **Capturing Temporal Dynamics:** Unlike models that treat inputs independently, sequence models have a built-in memory mechanism. This allows them to understand how elements in a sequence relate to each other over time. * **Unlocking New Possibilities:** Sequence models open the door to a wide range of applications previously difficult to tackle with traditional machine learning. ### Key Sequence Modeling Tasks Sequence models excel in a variety of tasks: * **Prediction:** Forecasting future values in a sequence (e.g., stock prices, weather). * **Classification:** Assigning labels to sequences (e.g., sentiment analysis of text, genre classification of music). * **Generation:** Creating new sequences that mimic the patterns of training data (e.g., writing poems, composing music). * **Sequence-to-Sequence (Seq2Seq):** Transforming one type of sequence into another (e.g., machine translation, speech recognition). ### Pervasive Applications Sequence models have found widespread use across diverse domains: * **Natural Language Processing (NLP):** * Machine translation * Text summarization * Sentiment analysis * Question answering * **Computer Vision:** * Video captioning * Action recognition * Object tracking * Time Series Analysis: * Financial forecasting * Demand prediction * Anomaly detection * **And Beyond:** Sequence models are also used in healthcare (e.g., predicting disease progression), robotics (e.g., motion planning), and even music composition! ## Recurrent Neural Networks (RNNs) ### RNNs vs. Feedforward Networks The fundamental difference between RNNs and feedforward neural networks (FNNs) lies in how they handle information flow. * **Feedforward Neural Networks:** In FNNs, data flows strictly in one direction – from input to output – without any loops or cycles. This makes them well-suited for tasks where inputs are independent of each other. * **Recurrent Neural Networks:** RNNs, on the other hand, introduce loops into the network architecture. This enables them to maintain an internal memory called a hidden state, which is updated at each time step. This hidden state allows RNNs to capture dependencies across time, making them ideal for processing sequential data. ![Screenshot 2024-07-04 at 5.07.06 PM](https://hackmd.io/_uploads/r1jL81NDR.png) ### Key Advantages of RNNs * **Variable-Length Sequences:** RNNs can process input sequences of varying lengths, unlike FNNs, which require fixed-size inputs. * **Memory and Context:** The hidden state serves as a form of memory, retaining information from previous time steps and providing context for processing the current input. ### The RNN Cell The core building block of an RNN is the **RNN cell**. Let's break down its components: * **Input $x_t$:** The data point at the current time step $t$. * **Hidden State $h_t$:** The RNN's memory, which summarizes information from all previous time steps. It's calculated as follows: $$h_t = f(W_{hh}h_{t-1} + W_{xh}x_t + b_h)$$ * $h_{t-1}$: The hidden state from the previous time step. * $f$: A non-linear activation function (e.g., tanh, ReLU) to introduce complexity. * $W_{hh}$, $W_{xh}$: Weight matrices that learn the relationships between the hidden states and inputs. * $b_h$: Bias term. * Output $o_t$: The prediction or output generated by the RNN at time step $t$. $$o_t = g(W_{ho}h_t + b_o)$$ * $g$: An output function, often chosen based on the task (e.g., softmax for classification). * $W_{ho}$: Weight matrix that transforms the hidden state into the output. * $b_o$: Bias term. ![rnn-done](https://hackmd.io/_uploads/rJ8rb67w0.png) The weight matrices $W$ and bias terms $b$ are shared across all time steps, meaning the RNN learns a general way to update the hidden state and generate outputs regardless of the sequence's length. #### Code Implementation ```python= class SRNNModel(nn.Module): def __init__(self, input_size, hidden_size, output_size): super().__init__() # Create the RNN layer (num_layers = 1 for single layer) self.rnn = nn.RNN(input_size, hidden_size, num_layers=1, batch_first=True) # Create the fully connected (linear) output layer self.fc = nn.Linear(hidden_size, output_size) def forward(self, x: torch.Tensor) -> torch.Tensor: """ Defines the forward pass of the model. Args: x: The input tensor of shape (batch_size, seq_len, input_size). Returns: out: The output tensor of shape (batch_size, output_size). """ # Pass the input through the RNN layer # out: (batch_size, seq_len, hidden_size) # _ : (num_layers, batch_size, hidden_size) (discarding the final hidden state) out, _ = self.rnn(x) # Get the output from the last time step for each sequence in the batch out = out[:, -1, :] # (batch_size, hidden_size) # Pass the last time step's output through the fully connected layer to get the final output out = self.fc(out) # (batch_size, output_size) return out ``` ### Mapping Types RNNs are versatile and can be adapted to various input-output relationships: * **One-to-many:** One input produces a sequence of outputs (e.g., image captioning, music generation). * **Many-to-one:** A sequence of inputs produces a single output (e.g., sentiment analysis, video classification). * **Many-to-many:** A sequence of inputs produces a sequence of outputs (e.g., machine translation, speech recognition). ![mapping-type-done](https://hackmd.io/_uploads/rkEBMc5UR.png) ### Backpropagation Through Time (BPTT) and its Challenges To train RNNs effectively, we need to adjust their internal parameters (weights and biases) based on the errors they make. This is done through a process called Backpropagation Through Time (BPTT), which is an adaptation of the standard backpropagation algorithm used in feedforward networks. #### How BPTT Works Let's consider a simplified RNN with a single hidden layer: * The hidden state $h_t$ is calculated as: $$h_t = f(x_t,h_{t-1},w_h)$$ * $x_t$: The input at time step $t$ * $h_{t-1}$: The hidden state from the previous time step * $w_h$: The weights associated with the hidden state * $f$: A non-linear activation function (e.g., tanh, ReLU) * The output $o_t$ is calculated as: $$o_t = g(h_t,w_o)$$ * $w_o$: The weights associated with the output * $g$: An activation function (often the identity function for regression or softmax for classification) The BPTT process consists of the following steps: 1. **Forward Pass:** The RNN processes the input sequence, one time step at a time, updating its hidden state and generating outputs at each step. 2. **Loss Calculation:** The difference between the RNN's predicted outputs $o_t$ and the actual target values $y_t$ is computed using a loss function, often the Mean Squared Error (MSE) for regression tasks: $$L(x_1,...,x_T,y_1,...,y_T,w_h,w_o)=\frac{1}{T}\sum_{t=1}^{T}l(y_t,o_t)$$ where $l$ represents the loss at each time step. 3. **Backward Pass:** BPTT propagates the error backwards through time. It calculates the gradients of the loss with respect to the model's parameters $(w_h, w_o)$. This involves the chain rule of calculus, as the loss depends on the outputs, which in turn depend on the hidden states, which depend on the weights. For instance, the gradient of the loss with respect to the hidden state weight wₕ at time step $t$ is: \begin{split} \frac{\partial L}{\partial w_h}&=\frac{1}{T}\sum_{t=1}^{T}\frac{\partial l(y_t,o_t)}{\partial w_h} \\&= \frac{1}{T}\sum_{t=1}^{T}\frac{\partial l(y_t,o_t)}{\partial o_t}\frac{\partial o_t}{\partial w_h}\\&= \frac{1}{T}\sum_{t=1}^{T}\frac{\partial l(y_t,o_t)}{\partial o_t}\frac{\partial g(h_t,w_o)}{\partial h_t}\frac{\partial h_t}{\partial w_h} \end{split} Note that $\frac{\partial h_t}{\partial w_h}$ is not straightforward, as it depends on $h_{t-1}$, which also depends on wₕ. This leads to a chain of derivatives: \begin{split} \frac{\partial h_t}{\partial w_h}&=\frac{\partial f(x_t,h_{t-1},w_h)}{\partial w_h}+\frac{\partial f(x_t,h_{t-1},w_h)}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial w_h}\\ &=\frac{\partial f(x_t,h_{t-1},w_h)}{\partial w_h}+\frac{\partial f(x_t,h_{t-1},w_h)}{\partial h_{t-1}}\left(\frac{\partial f(x_{t-1},h_{t-2},w_h)}{\partial w_h}+\frac{\partial f(x_{t-1},h_{t-2},w_h)}{\partial h_{t-2}}\frac{\partial h_{t-2}}{\partial w_h}\right) \\ &=\frac{\partial f(x_t,h_{t-1},w_h)}{\partial w_h}+\frac{\partial f(x_t,h_{t-1},w_h)}{\partial h_{t-1}}\left(\frac{\partial f(x_{t-1},h_{t-2},w_h)}{\partial w_h}+\frac{\partial f(x_{t-1},h_{t-2},w_h)}{\partial h_{t-2}}\left(...+\frac{\partial f(x_{2},h_{1},w_h)}{\partial h_{1}}\frac{\partial h_{1}}{\partial w_h}\right)...\right) \\ &=\frac{\partial f(x_t,h_{t-1},w_h)}{\partial w_h}+\frac{\partial f(x_t,h_{t-1},w_h)}{\partial h_{t-1}}\left(\frac{\partial f(x_{t-1},h_{t-2},w_h)}{\partial w_h}+\frac{\partial f(x_{t-1},h_{t-2},w_h)}{\partial h_{t-2}}\left(...+\frac{\partial f(x_2,h_{1},w_h)}{\partial h_{1}}\frac{\partial f(x_1,h_{0},w_h)}{\partial w_h}\right)...\right)\\ &=\frac{\partial f(x_t,h_{t-1},w_h)}{\partial w_h}+\sum_{i=1}^{t-1} \left(\prod_{j=i+1}^{t}\frac{\partial f(x_j,h_{j-1},w_h)}{\partial h_{j-1}}\right)\frac{\partial f(x_i,h_{i-1},w_h)}{\partial w_h} \end{split} 4. **Parameter Update:** An optimization algorithm like stochastic gradient descent uses these gradients to update the model's parameters, aiming to minimize the overall loss. \begin{split} w_h \leftarrow w_h - \eta \ast \frac{\partial L}{\partial w_h} \\ w_0 \leftarrow w_o - \eta \ast \frac{\partial L}{\partial w_o}\end{split} where $\eta$ is the learning rate. #### The Vanishing and Exploding Gradient Problem The chain rule in BPTT can cause gradients to either vanish or explode as they propagate back through time. This issue arises because gradients are repeatedly multiplied by the weight matrix $W_{hh}$ and the derivatives of activation functions as they propagate back through time. * **Vanishing Gradients:** If the magnitudes of these values are consistently less than 1, gradients shrink exponentially, hindering the learning of long-term dependencies. * **Exploding Gradients:** Conversely, if the magnitudes are consistently greater than 1, gradients grow exponentially, leading to numerical instability and making training difficult. #### Truncated BPTT: A Practical Solution **Truncated BPTT** addresses this problem by limiting the backpropagation to a fixed number of time steps $k$. This creates a shorter chain of derivatives, reducing the risk of vanishing or exploding gradients. \begin{split} \frac{\partial h_t}{\partial w_h}&\approx\frac{\partial f(x_t,h_{t-1},w_h)}{\partial w_h}+\sum_{i=t-k}^{t-1} \left(\prod_{j=i+1}^{t}\frac{\partial f(x_j,h_{j-1},w_h)}{\partial h_{j-1}}\right)\frac{\partial f(x_i,h_{i-1},w_h)}{\partial w_h} \end{split} Truncated BPTT is a trade-off between computational efficiency and the ability to capture long-term dependencies. It works well in practice because many tasks primarily rely on short-term context. ### Multi-layer RNNs (Deep RNNs) Similar to deep feedforward networks, stacking multiple RNN layers allows us to create deep RNNs. This architecture enables the network to learn hierarchical representations of sequential data. In simpler terms, each layer can focus on different levels of abstraction, with lower layers capturing basic patterns and higher layers learning more complex, abstract features. ![drnn-done](https://hackmd.io/_uploads/BJReR5iUC.png) #### How Deep RNNs Work * **Layered Hidden States:** In a deep RNN, each layer has its own hidden state $h_t^{(l)}$ at time step t. The hidden state of the first layer $h_t^{(0)}$ is simply the input $x_t$. * **Information Flow:** The hidden state of each layer (except the first) is calculated using both the previous hidden state from the same layer $h_{t-1}^{(l)}$ and the hidden state from the previous layer $h_t^{(l-1)}$: $$h_t^{(l)} = f(W_{hh}^{(l)}h_{t-1}^{(l)} + W_{xh}^{(l)}h_{t}^{(l-1)} + b_h^{(l)})$$ * $f$: Activation function * $W_{hh}^{(l)}$, $W_{xh}^{(l)}$: Weight matrices specific to layer $l$ * $b_h^{(l)}$: Bias term for layer $l$ * **Output Generation:** The final output $o_h$ is typically generated from the hidden state of the last layer $h_t^{(L)}$ using a linear transformation and an output function: $$o_t = g(W_{ho}h_t^{(L)} + b_o)$$ * $g$: Output function (e.g., softmax for classification) * $W_{ho}$: Weight matrix * $b_o$: Bias term #### Code Implementation ```python= class DRNNModel(nn.Module): def __init__(self, input_size, hidden_size, num_layers, output_size): super().__init__() self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True) self.fc = nn.Linear(hidden_size, output_size) def forward(self, x): out, _ = self.rnn(x) # Output 'out' has shape (batch_size, seq_len, hidden_size) out = self.fc(out[:, -1, :]) # Take the output from the last time step return out ``` ## Gated Recurrent Networks (GRUs and LSTMs) As we've seen, training standard RNNs with BPTT can be hampered by the notorious vanishing/exploding gradient problem. #### Gradient Clipping: A Mitigation Strategy One technique to mitigate exploding gradients is gradient clipping. This involves setting a maximum threshold $\tau$ for the norm of the gradient. If the gradient norm exceeds this threshold, it is scaled down proportionally to maintain the threshold value. $$ \text{Grad_Clipping}(A)=\begin{cases}A&\text{if$||A.\text{grad}||_2<\tau$}\\ \tau\frac{A}{||A||_2}&\text{otherwise}\end{cases} $$ #### Code Implementation ```python= torch.nn.utils.clip_grad_norm_(model.parameters(), clip_value) ``` While gradient clipping can help with exploding gradients, it doesn't fully address the vanishing gradient problem. This is where **gated recurrent networks** come into play. ### Gated Recurrent Networks: GRUs and LSTMs Gated recurrent networks introduce additional mechanisms to control the flow of information through time. They address the vanishing gradient problem by selectively remembering or forgetting information from previous time steps. The two most prominent gated architectures are: #### Gated Recurrent Unit (GRU) The GRU simplifies the LSTM by combining the forget and input gates into a single **update gate** $z_t$. It also merges the cell state and hidden state. \begin{split} r_t&=\sigma(W_{rh}h_{t-1}+W_{rx}x_t+b_r)\\ h_t^{in}&= \text{tanh}(W_{hr}(r_t \circ h_{t-1})+W_{hx}x_t+b_h)\\ g_t&=\sigma(W_{gh}h_{t-1}+W_{gx}x_t+b_g)\\ h_t&=g_th_{t-1}+(1-g_t)h_t^{in} \end{split} ![GRU-done](https://hackmd.io/_uploads/S1ZYiW3UR.png) #### Long Short-Term Memory (LSTM) The LSTM introduces a separate **cell state** $c_t$ to store long-term information. It uses **input**, **forget**, and **output** gates to control the flow of information into and out of the cell state. \begin{split} i_t&=\sigma(W_{ih}h_{t-1}+W_{ix}x_t+b_i)\\ f_t&=\sigma(W_{fh}h_{t-1}+W_{fx}x_t+b_f)\\ c_t^{in}&= \text{tanh}(W_{ch}h_{t-1}+W_{cx}x_t+b_c)\\ c_t&=f_t\circ c_{t-1}+i_t\circ c_t^{in}\\ o_t&=\sigma(W_{oh}h_{t-1}+W_{ox}x_t+b_o)\\ h_t&=o_t\circ tanh(c_t) \end{split} ![LSTM](https://hackmd.io/_uploads/HkxCy51PC.png) #### Key Advantages of Gated RNNs * **Effective Learning of Long-Term Dependencies:** The gating mechanisms help prevent the vanishing gradient problem, allowing GRUs and LSTMs to capture relationships across long time spans. * **Improved Stability:** They are less prone to exploding gradients, leading to more stable training. #### Code Implementation ```python= class GRUModel(nn.Module): def __init__(self, input_size, hidden_size, output_size, num_layers=1): super().__init__() self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True) self.fc = nn.Linear(hidden_size, output_size) def forward(self, x): out, _ = self.gru(x) out = self.fc(out[:, -1, :]) return out ``` ```python= class LSTMModel(nn.Module): def __init__(self, input_size,hidden_size, output_size, num_layers=1): super().__init__() self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True) self.fc = nn.Linear(hidden_size, output_size) def forward(self, x): out, _ = self.lstm(x) out = self.fc(out[:, -1, :]) return out ``` ## Autoregressive Models (AR) Autoregressive (AR) models offer an alternative approach to modeling sequential data. Unlike RNNs, which rely on hidden states to capture dependencies, AR models directly utilize a fixed number of past observations to make predictions. ### How AR Models Work An autoregressive model of order $k$ directly predicts the value $y_t$ at time $t$ as a function of the $k$ previous values $(x_t, x_{t-1}, \dots, x_{t-k+1})$. In its simplest linear form, this function is a weighted sum of the past values plus a bias term: $$ y_t = f(x_t, x_{t-1}, \dots, x_{t-k+1}) = w_0 + w_1 x_t + w_2 x_{t-1} + ... + w_k x_{t-k+1} $$ where * $w_0, w_1, \dots, w_k$: Learned weights that determine the influence of each past value on the prediction. * $k$: The order of the AR model, representing the number of past values used for prediction. ![Screenshot 2024-07-04 at 5.10.01 PM](https://hackmd.io/_uploads/HkGfPy4wC.png) ### AR Models vs. RNNs | Feature | Autoregressive (AR) Models | Recurrent Neural Networks (RNNs) | |---------|----------------------------|----------------------------------| | Memory | Limited to $k$ past values | Maintained in hidden state | | Input Dependence | Explicit | Implicit | | Prediction Mechanism | Direct linear combination | Non-linear transformation | | Long-Term Dependencies | Limited | Can be captured (but challenging) | #### Code Implementation ```python= class ARModel(nn.Module): def __init__(self, p): # p represents the order k super().__init__() self.linear = nn.Linear(p, 1) # Linear layer to predict yₜ def forward(self, x): return self.linear(x) ``` ### Multi-layer Autoregressive Models To capture more intricate patterns in the data, we can extend the AR model by stacking multiple linear layers, with non-linear activation functions (e.g., ReLU) in between. This enables the model to learn hierarchical representations, where each layer extracts increasingly complex features from the past values. A simple two-layer AR model could be represented as: \begin{split} h_t &= \text{ReLu}(W_{h1} x_t + W_{h2} x_{t-1} + \dots + W_{hk} x_{t-k+1} + b_h) \\ y_t &= W_0 h_t + b_0 \end{split} This creates a deeper AR model capable of learning hierarchical representations. ![Screenshot 2024-07-04 at 5.10.21 PM](https://hackmd.io/_uploads/BJ-1d1VP0.png) #### Code Implementation ```python= class DARModel(nn.Module): def __init__(self, p, hidden_sizes=[64, 32]): super().__init__() self.p = p layers = [] input_size = p for hidden_size in hidden_sizes: layers.append(nn.Linear(input_size, hidden_size)) layers.append(nn.ReLU()) input_size = hidden_size layers.append(nn.Linear(input_size, 1)) self.network = nn.Sequential(*layers) def forward(self, x): return self.network(x) ``` ### Choosing Between AR and RNN Models The choice between AR models and RNNs depends on the characteristics of your data and the problem you're trying to solve. * If you primarily need to capture short-term dependencies and favor simpler, more interpretable models, AR models might be a good choice. * If you suspect longer-term dependencies or complex non-linear patterns in your data, RNNs (especially LSTMs and GRUs) are likely a better fit. ## Example: Time Series Prediction with Sequence Models ### Problem Setup Dataset: We'll use the [Time Series Toy Data Set](https://www.kaggle.com/datasets/yekahaaagayeham/time-series-toy-data-set?resource=download) from Kaggle. This dataset contains monthly toy sales data, providing a practical testbed for exploring time series prediction. **Task:** We'll tackle a dual challenge: 1. **In-Sample Prediction:** Assess model performance by predicting sales within the dataset's timeframe, verifying their ability to learn historical patterns. 2. **Out-of-Sample Forecasting:** Predict sales for the three months immediately after the dataset (January 1973- January 1976). This gauges a model's ability to generalize to unseen future data. **Model Selection:** We'll compare six different sequence models to determine the most suitable architecture for this task: * Simple RNN (SRNN) * Deep RNN (DRNN) * Long Short-Term Memory (LSTM) * Gated Recurrent Unit (GRU) * Autoregressive (AR) * Deep Autoregressive (DAR) ### Implementation #### Import Modules & Load Data ```python= import pandas as pd import matplotlib.pyplot as plt import torch import torch.nn as nn import numpy as np from torch.utils.data import Dataset, DataLoader from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_squared_error df = pd.read_csv('sales.csv') df['date'] = pd.to_datetime(df['date']) df.set_index('date', inplace=True) plt.figure(figsize=(10, 6)) plt.plot(df.index, df['sales']) plt.title('Monthly Toy Sales') plt.xlabel('Date') plt.ylabel('Sales (US$)') plt.show() ``` ![download](https://hackmd.io/_uploads/SyKdl9zD0.png) #### Hyperparameters ```python=+ LEARNING_RATE = 0.0001 EPOCHS = 5000 TRAIN_END = -24 # Last 24 months for testing WINDOW_SIZE = 24 # 2 years of monthly data FORECAST_HORIZON = 3 # 3 months out-of-sample forecast ``` #### Define Models, Dataset & Prepare Data ```python= models = { 'SRNN': SRNNModel(input_size=1, hidden_size=64, output_size=1), 'DRNN': DRNNModel(input_size=1, hidden_size=64, output_size=1, num_layers=3), 'LSTM': LSTMModel(input_size=1, hidden_size=64, output_size=1), 'GRU': GRUModel(input_size=1, hidden_size=64, output_size=1), 'AR': ARModel(p=WINDOW_SIZE), 'DAR': DARModel(p=WINDOW_SIZE, hidden_sizes=[64, 32]) } Is_autoregressive = {'SRNN': False, 'DRNN': False, 'LSTM': False, 'GRU': False, 'AR': True, 'DAR': True} # Data preparation function def prepare_time_series_data(series, window_size, train_end): ''' df:Dataframe column:choose only data of one column to predict train_end:offset of last train day and last day window_size:Sequence size of data to train model ''' train_series, test_series = series[:train_end], series[train_end - window_size:] train_data = pd.DataFrame() for i in range(window_size): #Every row is one sequence of data with sequence length window_size train_data['c%d' % i] = train_series.tolist()[i: -window_size + i] train_data['y'] = train_series.tolist()[window_size:] train_data.index = train_series.index[window_size:] train_data.index.name='Date of y' print(train_data.head(5)) ''' train_data: Data for training series: All data including training,testing data index: Date corresponging to series ''' test_data = pd.DataFrame() for i in range(window_size): #Every row is one sequence of data with sequence length window_size test_data['c%d' % i] = test_series.tolist()[i: -window_size + i] test_data['y'] = test_series.tolist()[window_size:] test_data.index = test_series.index[window_size:] test_data.index.name='Date of y' print(test_data.head(5)) return train_data,test_data, series, df.index.tolist() class TrainSet(Dataset): def __init__(self, data): #data is the x we use to train #label is the y we use to calculate loss self.data, self.label = data[:, :-1].float(), data[:, -1].float() def __getitem__(self, index): return self.data[index], self.label[index] def __len__(self): return len(self.data) # Prepare data train_data, test_data, all_series, dates = prepare_time_series_data(df['sales'], WINDOW_SIZE, TRAIN_END) # Standardize the data scaler = StandardScaler() train_data_normalized = scaler.fit_transform(train_data) test_data_normalized = scaler.transform(test_data) # Build dataloader train_set = TrainSet(torch.Tensor(train_data_normalized)) test_set = TrainSet(torch.Tensor(test_data_normalized)) train_loader = DataLoader(train_set, batch_size=10, shuffle=False) test_loader = DataLoader(test_set, batch_size=10, shuffle=False) ``` #### Training and Validation ```python= PRINT_EPOCH=500 def train_model(model, train_loader, test_loader, optimizer, criterion, epochs, is_autoregressive): for epoch in range(epochs): model.train() train_loss = 0 for batch_x, batch_y in train_loader: optimizer.zero_grad() if is_autoregressive: outputs = model(batch_x.unsqueeze(1)) else: outputs = model(batch_x.unsqueeze(2)) loss = criterion(outputs.squeeze(), batch_y) loss.backward() optimizer.step() train_loss += loss.item() model.eval() test_loss = 0 with torch.no_grad(): for batch_x, batch_y in test_loader: if is_autoregressive: outputs = model(batch_x.unsqueeze(1)) else: outputs = model(batch_x.unsqueeze(2)) loss = criterion(outputs.squeeze(), batch_y) test_loss += loss.item() if epoch % PRINT_EPOCH == 0: print(f'Epoch [{epoch}/{epochs}], Train Loss: {train_loss/len(train_loader):.4f}, Test Loss: {test_loss/len(test_loader):.4f}') return model # Train all models for name, model in models.items(): print(f"Training {name} model:") optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) criterion = nn.MSELoss() models[name] = train_model(model, train_loader, test_loader, optimizer, criterion, EPOCHS, Is_autoregressive[name]) ``` ``` Training SRNN model: Epoch [0/3000], Train Loss: 1.0424, Test Loss: 2.1848 Epoch [500/3000], Train Loss: 0.0425, Test Loss: 0.0911 Epoch [1000/3000], Train Loss: 0.0368, Test Loss: 0.0705 Epoch [1500/3000], Train Loss: 0.0180, Test Loss: 0.0623 Epoch [2000/3000], Train Loss: 0.0108, Test Loss: 0.0533 Epoch [2500/3000], Train Loss: 0.0062, Test Loss: 0.0503 Training DRNN model: Epoch [0/3000], Train Loss: 1.0525, Test Loss: 2.6385 Epoch [500/3000], Train Loss: 0.0062, Test Loss: 0.0404 Epoch [1000/3000], Train Loss: 0.0038, Test Loss: 0.0318 Epoch [1500/3000], Train Loss: 0.0021, Test Loss: 0.0259 Epoch [2000/3000], Train Loss: 0.0038, Test Loss: 0.0274 Epoch [2500/3000], Train Loss: 0.0005, Test Loss: 0.0257 Training LSTM model: Epoch [0/3000], Train Loss: 1.0509, Test Loss: 2.5158 Epoch [500/3000], Train Loss: 0.0352, Test Loss: 0.1144 Epoch [1000/3000], Train Loss: 0.0328, Test Loss: 0.0933 Epoch [1500/3000], Train Loss: 0.0279, Test Loss: 0.0960 Epoch [2000/3000], Train Loss: 0.0247, Test Loss: 0.0699 Epoch [2500/3000], Train Loss: 0.0245, Test Loss: 0.0534 Training GRU model: Epoch [0/3000], Train Loss: 1.0482, Test Loss: 2.1544 Epoch [500/3000], Train Loss: 0.0335, Test Loss: 0.0831 Epoch [1000/3000], Train Loss: 0.0265, Test Loss: 0.0971 Epoch [1500/3000], Train Loss: 0.0260, Test Loss: 0.0944 Epoch [2000/3000], Train Loss: 0.0228, Test Loss: 0.0909 Epoch [2500/3000], Train Loss: 0.0212, Test Loss: 0.1052 Training AR model: Epoch [0/3000], Train Loss: 2.7177, Test Loss: 5.6123 Epoch [500/3000], Train Loss: 0.1110, Test Loss: 0.1414 Epoch [1000/3000], Train Loss: 0.0379, Test Loss: 0.0897 Epoch [1500/3000], Train Loss: 0.0260, Test Loss: 0.0840 Epoch [2000/3000], Train Loss: 0.0231, Test Loss: 0.0846 Epoch [2500/3000], Train Loss: 0.0218, Test Loss: 0.0856 Training DAR model: Epoch [0/3000], Train Loss: 1.0525, Test Loss: 2.8006 Epoch [500/3000], Train Loss: 0.0053, Test Loss: 0.0507 Epoch [1000/3000], Train Loss: 0.0022, Test Loss: 0.0368 Epoch [1500/3000], Train Loss: 0.0014, Test Loss: 0.0350 Epoch [2000/3000], Train Loss: 0.0010, Test Loss: 0.0348 Epoch [2500/3000], Train Loss: 0.0009, Test Loss: 0.0344 ``` ### Evaluation and Prediction ```python= def forecast(model, initial_sequence, steps, is_autoregressive): model.eval() with torch.no_grad(): predictions = [] input_seq = initial_sequence.clone().cuda() for _ in range(steps): if is_autoregressive: output = model(input_seq.unsqueeze(0).unsqueeze(1)) else: output = model(input_seq.unsqueeze(0).unsqueeze(2)) predictions.append(output.item()) if is_autoregressive: input_seq = torch.cat([input_seq[1:], output.view(1)]) else: input_seq = torch.cat([input_seq[1:], output.view(1)]) return predictions plt.figure(figsize=(12,8)) IsAutoregressive={'SRNN':False,'DRNN': False, 'LSTM': False, 'GRU': False,'AR':True,'DAR':True} for model_name, model in models.items(): model.cuda() # Standardize the data orignial_series = scaler.transform(pd.DataFrame(all_series)) original_series = torch.Tensor(orignial_series).squeeze(1) last_sequence = original_series[-FORECAST_HORIZON-WINDOW_SIZE:-FORECAST_HORIZON] forecast_values = forecast(model, last_sequence, FORECAST_HORIZON, IsAutoregressive[model_name]) # Calculate and print MSE for each model mse = mean_squared_error(scaler.transform(pd.DataFrame(all_series[-FORECAST_HORIZON:])), forecast_values) print(f'{model_name} Model MSE: {mse:.2f}') forecast_values=scaler.inverse_transform(torch.Tensor(forecast_values)) plt.plot(dates[-FORECAST_HORIZON:], forecast_values, label=f'{model_name}_pred') #plt.plot(dates, original_series*train_std+train_mean, 'k', label='Actual Sales') plt.plot(dates[-FORECAST_HORIZON:], all_series[-FORECAST_HORIZON:], 'k', label='Actual Sales') plt.title('Sales Prediction and 3-Year Forecast') plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') plt.xlabel('Date') plt.ylabel('Sales (US$)') plt.tight_layout() plt.show() ``` ```! SRNN Model MSE: 0.09 DRNN Model MSE: 0.11 LSTM Model MSE: 0.13 GRU Model MSE: 0.03 AR Model MSE: 0.03 DAR Model MSE: 0.03 ``` ![test1-4](https://hackmd.io/_uploads/H1z3c7BwR.png) ### Conclusion and Discussion The results reveal these key insights: * **GRU as Top Performer:** The GRU model consistently demonstrated the lowest MSE for both in-sample prediction and the 3-year forecast. This suggests it's a good fit for this dataset, striking a balance between capturing the underlying patterns and avoiding overfitting. * **LSTM Performance:** The LSTM also performed well, but its forecast was slightly more conservative than the GRU. This is common, as LSTMs can be a bit more cautious in extrapolating trends. * **RNNs:** Both the simple and deep RNNs captured the overall trend and seasonality, but their predictions were less accurate than the gated models (LSTM and GRU), highlighting the importance of gating mechanisms in handling long-term dependencies. * **AR Models:** The AR models, especially the simple AR, performed surprisingly well on this dataset. Their linear nature might make them less prone to overfitting on this relatively smooth time series. ## Reference * L08 - Sequence Models of University of Tübingen Deep Learning Course [link](https://drive.google.com/file/d/1-riYFFIfJCCP5K002n5LepGL-AEhPBwK/view) * Chap 9 & Chap 10 of Dive into Deep Learning [link](https://d2l.ai/)