# Module 4 (Time Series): LSTM, GRU, and Deep Learning
### 1. From ARIMA to Deep Learning
Traditional models like **ARIMA** work well for linear, stationary time series.
However, real-world data often shows **nonlinear** and **long-term** dependencies.
Deep learning models, such as **RNN**, **LSTM**, and **GRU**, are designed to handle these complex sequential patterns.
> “Unlike ARIMA, which looks only at the mathematical relationship between past and future values, LSTM learns temporal patterns directly from data.”
>
---
### 2. Sequence Problems
In **sequence problems**, the **order of datapoints matters**.
Example:
To predict next month’s sales, we must consider the previous months, not treat each as independent.
A standard neural network (MLP) cannot do this because:
- It assumes independence between inputs.
- It cannot retain information from past timesteps.
Hence, we need networks that **remember context** → RNNs and LSTMs.
---
### 3. Recurrent Neural Network (RNN)
#### Mechanism
RNNs pass information from one time step to the next:
$$
h_t = f(W_x x_t + W_h h_{t-1} + b)
$$
Where:
- $(x_t)$: input at time step *t*
- $(h_{t-1})$: previous hidden state (memory)
- $(h_t)$: current hidden state
This allows RNNs to process **sequential data** like time series, text, or speech.
---
### 4. Long Short-Term Memory (LSTM)
#### Motivation
LSTM (Long Short-Term Memory) networks were designed to **solve the vanishing gradient problem.** They introduce a special unit called the cell state that can store information across many time steps.
This makes LSTMs ideal for long sequences where context from far back still matters.
> Imagine you’re baking a cake:
> you remember *“add eggs”* but forget *“what shirt you wore.”*
>
> LSTM works the same way — it decides what to **remember**, **forget**, and **output** based on context.
#### The Three Gates of LSTM
Each LSTM cell contains three “gates” that control information flow:
| Gate | Function | Formula |
|------|-----------|----------|
| **Forget Gate** | Decides what old information to discard | $$ f_t = \sigma(W_f[h_{t-1}, x_t] + b_f) $$ |
| **Input Gate** | Adds new relevant information | $$ i_t = \sigma(W_i[h_{t-1}, x_t] + b_i) $$ |
| **Candidate State** | Generates potential new memory | $$ \tilde{C}_t = \tanh(W_c[h_{t-1}, x_t] + b_c) $$ |
| **Cell State Update** | Combines past and new information | $$ C_t = f_t * C_{t-1} + i_t * \tilde{C}_t $$ |
| **Output Gate** | Determines what to output next | $$ h_t = o_t * \tanh(C_t) $$ |
#### Explanation of Symbols
| Symbol | Meaning |
|--------|----------|
| $x_t$ | Input at time step *t* |
| $h_{t-1}$ | Hidden state from previous step |
| $C_{t-1}$ | Previous cell state |
| $\sigma$ | Sigmoid activation (values between 0–1) |
| $\tanh$ | Hyperbolic tangent activation |
| $f_t, i_t, o_t$ | Forget, Input, and Output gate activations |
| $W$ | Weight matrices for each gate |
| $b$ | Bias terms |
---
#### Information Flow Summary
1. **Forget Gate:** Decides which information from the previous state should be forgotten.
2. **Input Gate:** Determines which new information to store in the cell state.
3. **Cell State Update:** Combines the remembered and new information.
4. **Output Gate:** Produces the output (hidden state) for the next step.
---
#### Example: LSTM in TensorFlow
```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
# Build a simple LSTM model
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(timesteps, features)))
model.add(Dense(1))
# Compile and train
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train, y_train, epochs=50, batch_size=16, verbose=1)
```
---
### 5. Gated Recurrent Unit (GRU)
The **Gated Recurrent Unit (GRU)** is a simplified version of the **LSTM** architecture.
It was introduced by Cho et al. (2014) to **reduce the computational cost** of LSTMs while keeping similar performance.
Unlike LSTM, GRU merges the **cell state** and **hidden state** into a single vector and uses only **two gates**:
- **Update Gate**
- **Reset Gate**
This makes GRU:
- Faster to train
- Less prone to overfitting
- Easier to implement for smaller datasets
#### GRU Structure Overview
| Gate | Function | Formula |
|------|-----------|----------|
| **Update Gate** | Controls how much of the past information to keep | $$ z_t = \sigma(W_z \cdot [h_{t-1}, x_t]) $$ |
| **Reset Gate** | Decides how much past information to forget | $$ r_t = \sigma(W_r \cdot [h_{t-1}, x_t]) $$ |
| **Candidate Activation** | Generates a new memory content | $$ \tilde{h}_t = \tanh(W_h \cdot [r_t * h_{t-1}, x_t]) $$ |
| **Final Hidden State** | Blends old memory and new information | $$ h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t $$ |
#### Explanation of Terms
| Symbol | Meaning |
|--------|----------|
| $x_t$ | Input at time step *t* |
| $h_{t-1}$ | Previous hidden state |
| $z_t$ | Update gate value |
| $r_t$ | Reset gate value |
| $\tilde{h}_t$ | Candidate hidden state |
| $\sigma$ | Sigmoid activation function |
| $\tanh$ | Hyperbolic tangent activation |
| $W_z, W_r, W_h$ | Weight matrices for each gate |
#### How GRU Works Step-by-Step
1. **Reset Gate ($r_t$):**
Controls how much of the previous memory to forget.
- If $r_t$ → 0 → the network forgets old information.
- If $r_t$ → 1 → the network remembers it.
2. **Update Gate ($z_t$):**
Determines how much of the new candidate $\tilde{h}_t$ replaces the old memory $h_{t-1}$.
- If $z_t$ → 0 → mostly old memory is kept.
- If $z_t$ → 1 → the new information dominates.
3. **Candidate State ($\tilde{h}_t$):**
Generates new information based on the current input and reset gate.
4. **Final Output ($h_t$):**
Blends the previous hidden state and the new candidate state.
---
#### Visual Comparison: LSTM vs GRU
| Feature | **LSTM** | **GRU** |
|----------|-----------|----------|
| Number of Gates | 3 (Forget, Input, Output) | 2 (Reset, Update) |
| Cell State | Yes (separate memory) | No (merged with hidden state) |
| Computation | Heavier | Lighter |
| Training Time | Slower | Faster |
| Memory Retention | Better for long sequences | Good for short to medium sequences |
#### Example: GRU in TensorFlow
```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense
# Build a simple GRU model
model = Sequential()
# Add GRU layer
model.add(GRU(50, activation='tanh', input_shape=(timesteps, features)))
# Add output layer
model.add(Dense(1))
# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')
# Train the model
history = model.fit(
X_train,
y_train,
epochs=50,
batch_size=16,
validation_data=(X_val, y_val),
verbose=1
)
```
---
### 6. Attention Mechanism
Even with powerful models like **LSTM** or **GRU**, there’s a limitation —
they must compress **all past information into a single hidden state vector**.
This makes it difficult for the model to:
- Remember long-term dependencies
- Focus on relevant parts of the input sequence
- Interpret which parts of the past data influence the prediction
The **Attention Mechanism** solves this by allowing the model to **“focus” on specific time steps or features** dynamically — much like how humans pay attention to certain words in a sentence or key points in a graph.
> Imagine you’re reading a paragraph to answer a question.
> You don’t reread every word — you focus only on **the relevant sentences**.
>
> Similarly, the attention mechanism lets a neural network assign **weights** to important time steps and ignore the rest.
#### Core Idea
At each prediction step, the model computes a **weighted average** of all previous hidden states —
the more relevant a hidden state is, the **higher its weight (attention score)**.
#### Mathematical Formulation
1. **Score Function** — measures how relevant each previous hidden state is to the current step:
$$
e_{t,i} = \text{score}(h_t, h_i)
$$
2. **Softmax Normalization** — converts scores into probabilities (attention weights):
$$
\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_j \exp(e_{t,j})}
$$
3. **Context Vector** — combines all hidden states using the attention weights:
$$
c_t = \sum_i \alpha_{t,i} h_i
$$
4. **Final Output** — uses both context and current state to make prediction:
$$
\tilde{h}_t = \tanh(W_c [c_t ; h_t])
$$
#### Explanation of Symbols
| Symbol | Meaning |
|--------|----------|
| $h_t$ | Hidden state at time *t* |
| $e_{t,i}$ | Alignment score between current and past state |
| $\alpha_{t,i}$ | Attention weight (importance of each past step) |
| $c_t$ | Context vector (weighted sum of past hidden states) |
| $W_c$ | Weight matrix for combining context and hidden state |
---
#### Common Attention Types
| Type | Description | Example Formula |
|------|--------------|----------------|
| **Dot Product (Luong)** | Computes similarity via dot product | $$ e_{t,i} = h_t^\top h_i $$ |
| **Additive (Bahdanau)** | Uses a small neural network to learn attention weights | $$ e_{t,i} = v_a^\top \tanh(W_1 h_t + W_2 h_i) $$ |
| **Scaled Dot Product** | Normalizes dot product by vector dimension | $$ e_{t,i} = \frac{h_t^\top h_i}{\sqrt{d_k}} $$ |
### Attention in Time Series Forecasting
For time series, attention allows the model to **focus on important time periods**
for example, recent spikes or seasonal peaks rather than treating every point equally.
The model can learn:
- Which past observations are most influential
- Which features (variables) are relevant
- How to dynamically shift focus across time
---
#### Example: Add Attention Layer (Keras)
```python
import tensorflow as tf
from tensorflow.keras.layers import Layer
# Custom Attention Layer
class Attention(Layer):
def __init__(self):
super(Attention, self).__init__()
def call(self, query, values):
# Compute attention scores
score = tf.matmul(query, values, transpose_b=True)
weights = tf.nn.softmax(score, axis=-1)
# Weighted sum of values
context = tf.matmul(weights, values)
return context, weights
# Example usage with LSTM
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
timesteps = 10
features = 16
inputs = Input(shape=(timesteps, features))
lstm_out = LSTM(64, return_sequences=True)(inputs)
context, attention_weights = Attention()(lstm_out, lstm_out)
output = Dense(1)(context[:, -1, :])
model = Model(inputs, output)
model.compile(optimizer='adam', loss='mse')
model.summary()
```
---
### 7. Combining Attention with LSTM / GRU for Time Series Forecasting
While **LSTM** and **GRU** can capture temporal dependencies,
they treat all past time steps *equally* once the information is passed through the hidden state.
However, in real-world **time series**, not every past event has the same importance.
Some periods or spikes (like holidays, sudden market shifts, or recent trends)
are far more relevant than others.
The solution?
**Combine LSTM/GRU with an Attention Mechanism**
to create a model that can both **learn long-term dependencies** and **focus on important time steps**.
> Imagine you’re predicting next month’s sales.
> You remember trends from past years (long-term memory, LSTM)
> but focus more on recent promotional months or holidays (attention).
>
> The hybrid Attention–LSTM model does exactly this.
---
#### Architecture Overview
The Attention–LSTM model works in three main steps:
1. **Sequence Encoding:**
LSTM (or GRU) encodes temporal dependencies from input sequences.
2. **Attention Layer:**
Assigns *importance weights* to each hidden state output of the LSTM.
3. **Weighted Context Vector:**
Combines the most relevant hidden states for the final prediction.
#### Mathematical Representation
Let’s formalize this process.
1. **LSTM Output Sequence**
$$
H = [h_1, h_2, ..., h_T]
$$
2. **Attention Weights**
$$
\alpha_t = \frac{\exp(\text{score}(h_t))}{\sum_{i=1}^{T} \exp(\text{score}(h_i))}
$$
3. **Context Vector (Weighted Average)**
$$
c = \sum_{t=1}^{T} \alpha_t h_t
$$
4. **Final Prediction**
$$
\hat{y} = \sigma(W_c c + b)
$$
The result $\hat{y}$ is your model’s time-series forecast (e.g., next month’s value).
---
#### Advantages of Attention–LSTM
| Feature | Description |
|----------|-------------|
| **Selective Focus** | The model learns which time steps are most important. |
| **Interpretability** | You can visualize attention weights to understand model reasoning. |
| **Improved Accuracy** | Helps capture both short- and long-term dependencies. |
| **Robustness** | Performs better on irregular or noisy time series. |
#### Implementation Example: Attention LSTM in TensorFlow
```python
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense, Layer
from tensorflow.keras.models import Model
# Custom Attention Layer
class Attention(Layer):
def __init__(self):
super(Attention, self).__init__()
def call(self, inputs):
query = inputs
score = tf.matmul(query, query, transpose_b=True)
weights = tf.nn.softmax(score, axis=-1)
context = tf.matmul(weights, query)
return context, weights
# Input and LSTM encoding
timesteps = 10
features = 8
inputs = Input(shape=(timesteps, features))
lstm_out = LSTM(64, return_sequences=True)(inputs)
# Attention mechanism applied on LSTM output
context, attention_weights = Attention()(lstm_out)
# Final dense layer for regression output
output = Dense(1)(context[:, -1, :])
model = Model(inputs=inputs, outputs=output)
model.compile(optimizer='adam', loss='mse')
model.summary()
```
---
### 8. Hyperparameter Tuning & Regularization in Time Series Models
Deep learning models such as **LSTM**, **GRU**, and **Attention-LSTM** are powerful —
but also **sensitive to hyperparameters** like learning rate, number of layers, and batch size.
Poor tuning can cause:
- Overfitting: model memorizes training data but fails on unseen data
- Underfitting: model too simple to capture patterns
- Slow convergence: model trains too long or oscillates
To achieve reliable forecasting performance, we need **systematic tuning and regularization**.
---
#### What Are Hyperparameters?
Hyperparameters are **configuration variables** set before training begins.
| Category | Examples |
|-----------|-----------|
| **Model Architecture** | Number of layers, units per layer |
| **Training Process** | Learning rate, batch size, optimizer |
| **Regularization** | Dropout rate, early stopping, L2 penalty |
| **Sequence Settings** | Time steps, look-back window |
---
#### Key Hyperparameters in LSTM/GRU
| Hyperparameter | Description | Typical Range |
|----------------|-------------|----------------|
| **Learning Rate (η)** | Controls step size of updates | 1e-5 → 1e-2 |
| **Batch Size** | Number of samples per gradient update | 16 → 128 |
| **Hidden Units** | Neurons per LSTM/GRU layer | 32 → 256 |
| **Epochs** | Training cycles over dataset | 50 → 300 |
| **Dropout Rate** | Fraction of neurons to drop | 0.1 → 0.5 |
| **Optimizer** | Controls weight update method | Adam, RMSprop, SGD |
| **Lookback Window** | Number of time steps considered | 7 → 60 |
#### Tuning Strategy
A common process to tune models efficiently:
1. **Start Simple** — Train baseline LSTM with default parameters.
2. **Adjust Learning Rate** — Use learning rate scheduling or find best η via experiments.
3. **Tune Units & Layers** — Increase gradually until validation loss stops improving.
4. **Regularize** — Add dropout, L2 penalties, or early stopping.
5. **Optimize Batch Size** — Balance speed and stability.
6. **Monitor Validation Metrics** — Plot training vs validation loss over time.
#### Example: Learning Rate Update Formula
The parameter update for gradient descent:
$$
\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta J(\theta_t)
$$
where:
- $\theta_t$ = model parameters
- $\eta$ = learning rate
- $\nabla_\theta J(\theta_t)$ = gradient of loss function
If $\eta$ is too large → overshoot minimum
If $\eta$ is too small → slow convergence
---
#### Example: Implementing Dropout & Early Stopping
```python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
# Build model with Dropout
model = Sequential()
model.add(LSTM(128, activation='tanh', return_sequences=True, input_shape=(timesteps, features)))
model.add(Dropout(0.3))
model.add(LSTM(64, activation='tanh'))
model.add(Dropout(0.3))
model.add(Dense(1))
# Compile model
model.compile(optimizer='adam', loss='mse')
# Early stopping callback
early_stop = EarlyStopping(
monitor='val_loss',
patience=10, # stop after 10 epochs of no improvement
restore_best_weights=True
)
# Train model
history = model.fit(
X_train, y_train,
epochs=200,
batch_size=32,
validation_data=(X_val, y_val),
callbacks=[early_stop],
verbose=1
)
```
---
### 9. Summary
In this module, we explored how **deep learning models** can be applied to **time series forecasting**, moving from simple recurrent structures to advanced attention-based architectures.
We covered:
1. The fundamentals of **sequence modeling**
2. Classic recurrent models — **RNN**, **LSTM**, **GRU**
3. Modern improvements — **Attention Mechanisms**
4. **Hybrid models** that combine memory and focus
5. Techniques for **tuning, regularization, and interpretation**
6. A **real-world case study** using Attention-LSTM on airline data
---
#### Evolution of Time Series Models
| Stage | Model | Key Capability | Limitation |
|--------|--------|----------------|-------------|
| **1️⃣** | RNN | Sequential modeling of data | Vanishing gradients, short-term memory |
| **2️⃣** | LSTM | Long-term dependency capture via cell states | Computationally expensive |
| **3️⃣** | GRU | Simplified memory gating (faster) | Slightly less expressive |
| **4️⃣** | Attention | Focus on relevant time steps dynamically | Heavier computation |
| **5️⃣** | Hybrid (Attention-LSTM/GRU) | Combines memory + interpretability | Requires careful tuning |
> “Each step in evolution wasn’t a replacement, it was an improvement.”
#### Comparative Summary
| Model | Memory | Interpretability | Computation | Use Case |
|--------|----------|------------------|--------------|-----------|
| **RNN** | Short-term | Low | Fast | Simple sequential data |
| **LSTM** | Long-term | Medium | 🕓 Slow | Seasonal/long dependencies |
| **GRU** | Medium-long | Medium | Faster | Faster training with less data |
| **Attention** | Variable focus | High | Heavy | Long/complex dependencies |
| **Attention-LSTM** | Long + Focused | High | Moderate | Accurate, explainable forecasting |
#### Practical Guidelines for Future Projects
1. **Start simple, scale up**: always benchmark against a basic LSTM or ARIMA.
2. **Watch for overfitting**: use dropout, early stopping, and regularization.
3. **Visualize your model’s reasoning**: plot attention weights or feature importances.
4. **Tune systematically**: adjust learning rate, lookback window, and units gradually.
5. **Document experiments**: reproducibility is key in forecasting research.
6. **Interpret results**: always link predictions back to domain insights.
#### The Future of Time Series Forecasting
Modern forecasting is shifting towards:
- **Transformers for Time Series** (e.g., *Informer*, *Autoformer*)
- **Hybrid Neural–Statistical Models** (combining ARIMA + DL)
- **Multimodal Forecasting** (mixing images, text, and sensors)
- **Explainable AI (XAI)** for better decision trust
---
### Recommended Reading
- https://medium.com/data-science-collective/future-forecasting-of-time-series-using-lstm-a-quick-guide-for-business-leaders-370661c574c9
- https://medium.com/@myskill.id/time-series-prediction-lstm-759763728c48
## Notebook Example
This notebook provides a practical implementation of an Attention-based LSTM model for time series forecasting, demonstrating how attention mechanisms enhance interpretability and accuracy in sequential data prediction.
https://colab.research.google.com/drive/1mW44w-OBS_720puYvpx4oZ0i6ADrKbil?usp=sharing