Module 4 (Time Series): LSTM, GRU, and Deep Learning

# Module 4 (Time Series): LSTM, GRU, and Deep Learning ### 1. From ARIMA to Deep Learning Traditional models like **ARIMA** work well for linear, stationary time series. However, real-world data often shows **nonlinear** and **long-term** dependencies. Deep learning models, such as **RNN**, **LSTM**, and **GRU**, are designed to handle these complex sequential patterns. > “Unlike ARIMA, which looks only at the mathematical relationship between past and future values, LSTM learns temporal patterns directly from data.” > --- ### 2. Sequence Problems In **sequence problems**, the **order of datapoints matters**. Example: To predict next month’s sales, we must consider the previous months, not treat each as independent. A standard neural network (MLP) cannot do this because: - It assumes independence between inputs. - It cannot retain information from past timesteps. Hence, we need networks that **remember context** → RNNs and LSTMs. --- ### 3. Recurrent Neural Network (RNN) #### Mechanism RNNs pass information from one time step to the next: $$ h_t = f(W_x x_t + W_h h_{t-1} + b) $$ Where: - $(x_t)$: input at time step *t* - $(h_{t-1})$: previous hidden state (memory) - $(h_t)$: current hidden state This allows RNNs to process **sequential data** like time series, text, or speech. --- ### 4. Long Short-Term Memory (LSTM) #### Motivation LSTM (Long Short-Term Memory) networks were designed to **solve the vanishing gradient problem.** They introduce a special unit called the cell state that can store information across many time steps. This makes LSTMs ideal for long sequences where context from far back still matters. > Imagine you’re baking a cake: > you remember *“add eggs”* but forget *“what shirt you wore.”* > > LSTM works the same way — it decides what to **remember**, **forget**, and **output** based on context. #### The Three Gates of LSTM Each LSTM cell contains three “gates” that control information flow: | Gate | Function | Formula | |------|-----------|----------| | **Forget Gate** | Decides what old information to discard | $$ f_t = \sigma(W_f[h_{t-1}, x_t] + b_f) $$ | | **Input Gate** | Adds new relevant information | $$ i_t = \sigma(W_i[h_{t-1}, x_t] + b_i) $$ | | **Candidate State** | Generates potential new memory | $$ \tilde{C}_t = \tanh(W_c[h_{t-1}, x_t] + b_c) $$ | | **Cell State Update** | Combines past and new information | $$ C_t = f_t * C_{t-1} + i_t * \tilde{C}_t $$ | | **Output Gate** | Determines what to output next | $$ h_t = o_t * \tanh(C_t) $$ | #### Explanation of Symbols | Symbol | Meaning | |--------|----------| | $x_t$ | Input at time step *t* | | $h_{t-1}$ | Hidden state from previous step | | $C_{t-1}$ | Previous cell state | | $\sigma$ | Sigmoid activation (values between 0–1) | | $\tanh$ | Hyperbolic tangent activation | | $f_t, i_t, o_t$ | Forget, Input, and Output gate activations | | $W$ | Weight matrices for each gate | | $b$ | Bias terms | --- #### Information Flow Summary 1. **Forget Gate:** Decides which information from the previous state should be forgotten. 2. **Input Gate:** Determines which new information to store in the cell state. 3. **Cell State Update:** Combines the remembered and new information. 4. **Output Gate:** Produces the output (hidden state) for the next step. --- #### Example: LSTM in TensorFlow ```python from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense # Build a simple LSTM model model = Sequential() model.add(LSTM(50, activation='relu', input_shape=(timesteps, features))) model.add(Dense(1)) # Compile and train model.compile(optimizer='adam', loss='mean_squared_error') model.fit(X_train, y_train, epochs=50, batch_size=16, verbose=1) ``` --- ### 5. Gated Recurrent Unit (GRU) The **Gated Recurrent Unit (GRU)** is a simplified version of the **LSTM** architecture. It was introduced by Cho et al. (2014) to **reduce the computational cost** of LSTMs while keeping similar performance. Unlike LSTM, GRU merges the **cell state** and **hidden state** into a single vector and uses only **two gates**: - **Update Gate** - **Reset Gate** This makes GRU: - Faster to train - Less prone to overfitting - Easier to implement for smaller datasets #### GRU Structure Overview | Gate | Function | Formula | |------|-----------|----------| | **Update Gate** | Controls how much of the past information to keep | $$ z_t = \sigma(W_z \cdot [h_{t-1}, x_t]) $$ | | **Reset Gate** | Decides how much past information to forget | $$ r_t = \sigma(W_r \cdot [h_{t-1}, x_t]) $$ | | **Candidate Activation** | Generates a new memory content | $$ \tilde{h}_t = \tanh(W_h \cdot [r_t * h_{t-1}, x_t]) $$ | | **Final Hidden State** | Blends old memory and new information | $$ h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t $$ | #### Explanation of Terms | Symbol | Meaning | |--------|----------| | $x_t$ | Input at time step *t* | | $h_{t-1}$ | Previous hidden state | | $z_t$ | Update gate value | | $r_t$ | Reset gate value | | $\tilde{h}_t$ | Candidate hidden state | | $\sigma$ | Sigmoid activation function | | $\tanh$ | Hyperbolic tangent activation | | $W_z, W_r, W_h$ | Weight matrices for each gate | #### How GRU Works Step-by-Step 1. **Reset Gate ($r_t$):** Controls how much of the previous memory to forget. - If $r_t$ → 0 → the network forgets old information. - If $r_t$ → 1 → the network remembers it. 2. **Update Gate ($z_t$):** Determines how much of the new candidate $\tilde{h}_t$ replaces the old memory $h_{t-1}$. - If $z_t$ → 0 → mostly old memory is kept. - If $z_t$ → 1 → the new information dominates. 3. **Candidate State ($\tilde{h}_t$):** Generates new information based on the current input and reset gate. 4. **Final Output ($h_t$):** Blends the previous hidden state and the new candidate state. --- #### Visual Comparison: LSTM vs GRU | Feature | **LSTM** | **GRU** | |----------|-----------|----------| | Number of Gates | 3 (Forget, Input, Output) | 2 (Reset, Update) | | Cell State | Yes (separate memory) | No (merged with hidden state) | | Computation | Heavier | Lighter | | Training Time | Slower | Faster | | Memory Retention | Better for long sequences | Good for short to medium sequences | #### Example: GRU in TensorFlow ```python from tensorflow.keras.models import Sequential from tensorflow.keras.layers import GRU, Dense # Build a simple GRU model model = Sequential() # Add GRU layer model.add(GRU(50, activation='tanh', input_shape=(timesteps, features))) # Add output layer model.add(Dense(1)) # Compile the model model.compile(optimizer='adam', loss='mean_squared_error') # Train the model history = model.fit( X_train, y_train, epochs=50, batch_size=16, validation_data=(X_val, y_val), verbose=1 ) ``` --- ### 6. Attention Mechanism Even with powerful models like **LSTM** or **GRU**, there’s a limitation — they must compress **all past information into a single hidden state vector**. This makes it difficult for the model to: - Remember long-term dependencies - Focus on relevant parts of the input sequence - Interpret which parts of the past data influence the prediction The **Attention Mechanism** solves this by allowing the model to **“focus” on specific time steps or features** dynamically — much like how humans pay attention to certain words in a sentence or key points in a graph. > Imagine you’re reading a paragraph to answer a question. > You don’t reread every word — you focus only on **the relevant sentences**. > > Similarly, the attention mechanism lets a neural network assign **weights** to important time steps and ignore the rest. #### Core Idea At each prediction step, the model computes a **weighted average** of all previous hidden states — the more relevant a hidden state is, the **higher its weight (attention score)**. #### Mathematical Formulation 1. **Score Function** — measures how relevant each previous hidden state is to the current step: $$ e_{t,i} = \text{score}(h_t, h_i) $$ 2. **Softmax Normalization** — converts scores into probabilities (attention weights): $$ \alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_j \exp(e_{t,j})} $$ 3. **Context Vector** — combines all hidden states using the attention weights: $$ c_t = \sum_i \alpha_{t,i} h_i $$ 4. **Final Output** — uses both context and current state to make prediction: $$ \tilde{h}_t = \tanh(W_c [c_t ; h_t]) $$ #### Explanation of Symbols | Symbol | Meaning | |--------|----------| | $h_t$ | Hidden state at time *t* | | $e_{t,i}$ | Alignment score between current and past state | | $\alpha_{t,i}$ | Attention weight (importance of each past step) | | $c_t$ | Context vector (weighted sum of past hidden states) | | $W_c$ | Weight matrix for combining context and hidden state | --- #### Common Attention Types | Type | Description | Example Formula | |------|--------------|----------------| | **Dot Product (Luong)** | Computes similarity via dot product | $$ e_{t,i} = h_t^\top h_i $$ | | **Additive (Bahdanau)** | Uses a small neural network to learn attention weights | $$ e_{t,i} = v_a^\top \tanh(W_1 h_t + W_2 h_i) $$ | | **Scaled Dot Product** | Normalizes dot product by vector dimension | $$ e_{t,i} = \frac{h_t^\top h_i}{\sqrt{d_k}} $$ | ### Attention in Time Series Forecasting For time series, attention allows the model to **focus on important time periods** for example, recent spikes or seasonal peaks rather than treating every point equally. The model can learn: - Which past observations are most influential - Which features (variables) are relevant - How to dynamically shift focus across time --- #### Example: Add Attention Layer (Keras) ```python import tensorflow as tf from tensorflow.keras.layers import Layer # Custom Attention Layer class Attention(Layer): def __init__(self): super(Attention, self).__init__() def call(self, query, values): # Compute attention scores score = tf.matmul(query, values, transpose_b=True) weights = tf.nn.softmax(score, axis=-1) # Weighted sum of values context = tf.matmul(weights, values) return context, weights # Example usage with LSTM from tensorflow.keras.models import Model from tensorflow.keras.layers import Input, LSTM, Dense timesteps = 10 features = 16 inputs = Input(shape=(timesteps, features)) lstm_out = LSTM(64, return_sequences=True)(inputs) context, attention_weights = Attention()(lstm_out, lstm_out) output = Dense(1)(context[:, -1, :]) model = Model(inputs, output) model.compile(optimizer='adam', loss='mse') model.summary() ``` --- ### 7. Combining Attention with LSTM / GRU for Time Series Forecasting While **LSTM** and **GRU** can capture temporal dependencies, they treat all past time steps *equally* once the information is passed through the hidden state. However, in real-world **time series**, not every past event has the same importance. Some periods or spikes (like holidays, sudden market shifts, or recent trends) are far more relevant than others. The solution? **Combine LSTM/GRU with an Attention Mechanism** to create a model that can both **learn long-term dependencies** and **focus on important time steps**. > Imagine you’re predicting next month’s sales. > You remember trends from past years (long-term memory, LSTM) > but focus more on recent promotional months or holidays (attention). > > The hybrid Attention–LSTM model does exactly this. --- #### Architecture Overview The Attention–LSTM model works in three main steps: 1. **Sequence Encoding:** LSTM (or GRU) encodes temporal dependencies from input sequences. 2. **Attention Layer:** Assigns *importance weights* to each hidden state output of the LSTM. 3. **Weighted Context Vector:** Combines the most relevant hidden states for the final prediction. #### Mathematical Representation Let’s formalize this process. 1. **LSTM Output Sequence** $$ H = [h_1, h_2, ..., h_T] $$ 2. **Attention Weights** $$ \alpha_t = \frac{\exp(\text{score}(h_t))}{\sum_{i=1}^{T} \exp(\text{score}(h_i))} $$ 3. **Context Vector (Weighted Average)** $$ c = \sum_{t=1}^{T} \alpha_t h_t $$ 4. **Final Prediction** $$ \hat{y} = \sigma(W_c c + b) $$ The result $\hat{y}$ is your model’s time-series forecast (e.g., next month’s value). --- #### Advantages of Attention–LSTM | Feature | Description | |----------|-------------| | **Selective Focus** | The model learns which time steps are most important. | | **Interpretability** | You can visualize attention weights to understand model reasoning. | | **Improved Accuracy** | Helps capture both short- and long-term dependencies. | | **Robustness** | Performs better on irregular or noisy time series. | #### Implementation Example: Attention LSTM in TensorFlow ```python import tensorflow as tf from tensorflow.keras.layers import Input, LSTM, Dense, Layer from tensorflow.keras.models import Model # Custom Attention Layer class Attention(Layer): def __init__(self): super(Attention, self).__init__() def call(self, inputs): query = inputs score = tf.matmul(query, query, transpose_b=True) weights = tf.nn.softmax(score, axis=-1) context = tf.matmul(weights, query) return context, weights # Input and LSTM encoding timesteps = 10 features = 8 inputs = Input(shape=(timesteps, features)) lstm_out = LSTM(64, return_sequences=True)(inputs) # Attention mechanism applied on LSTM output context, attention_weights = Attention()(lstm_out) # Final dense layer for regression output output = Dense(1)(context[:, -1, :]) model = Model(inputs=inputs, outputs=output) model.compile(optimizer='adam', loss='mse') model.summary() ``` --- ### 8. Hyperparameter Tuning & Regularization in Time Series Models Deep learning models such as **LSTM**, **GRU**, and **Attention-LSTM** are powerful — but also **sensitive to hyperparameters** like learning rate, number of layers, and batch size. Poor tuning can cause: - Overfitting: model memorizes training data but fails on unseen data - Underfitting: model too simple to capture patterns - Slow convergence: model trains too long or oscillates To achieve reliable forecasting performance, we need **systematic tuning and regularization**. --- #### What Are Hyperparameters? Hyperparameters are **configuration variables** set before training begins. | Category | Examples | |-----------|-----------| | **Model Architecture** | Number of layers, units per layer | | **Training Process** | Learning rate, batch size, optimizer | | **Regularization** | Dropout rate, early stopping, L2 penalty | | **Sequence Settings** | Time steps, look-back window | --- #### Key Hyperparameters in LSTM/GRU | Hyperparameter | Description | Typical Range | |----------------|-------------|----------------| | **Learning Rate (η)** | Controls step size of updates | 1e-5 → 1e-2 | | **Batch Size** | Number of samples per gradient update | 16 → 128 | | **Hidden Units** | Neurons per LSTM/GRU layer | 32 → 256 | | **Epochs** | Training cycles over dataset | 50 → 300 | | **Dropout Rate** | Fraction of neurons to drop | 0.1 → 0.5 | | **Optimizer** | Controls weight update method | Adam, RMSprop, SGD | | **Lookback Window** | Number of time steps considered | 7 → 60 | #### Tuning Strategy A common process to tune models efficiently: 1. **Start Simple** — Train baseline LSTM with default parameters. 2. **Adjust Learning Rate** — Use learning rate scheduling or find best η via experiments. 3. **Tune Units & Layers** — Increase gradually until validation loss stops improving. 4. **Regularize** — Add dropout, L2 penalties, or early stopping. 5. **Optimize Batch Size** — Balance speed and stability. 6. **Monitor Validation Metrics** — Plot training vs validation loss over time. #### Example: Learning Rate Update Formula The parameter update for gradient descent: $$ \theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta J(\theta_t) $$ where: - $\theta_t$ = model parameters - $\eta$ = learning rate - $\nabla_\theta J(\theta_t)$ = gradient of loss function If $\eta$ is too large → overshoot minimum If $\eta$ is too small → slow convergence --- #### Example: Implementing Dropout & Early Stopping ```python import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM, Dense, Dropout from tensorflow.keras.callbacks import EarlyStopping # Build model with Dropout model = Sequential() model.add(LSTM(128, activation='tanh', return_sequences=True, input_shape=(timesteps, features))) model.add(Dropout(0.3)) model.add(LSTM(64, activation='tanh')) model.add(Dropout(0.3)) model.add(Dense(1)) # Compile model model.compile(optimizer='adam', loss='mse') # Early stopping callback early_stop = EarlyStopping( monitor='val_loss', patience=10, # stop after 10 epochs of no improvement restore_best_weights=True ) # Train model history = model.fit( X_train, y_train, epochs=200, batch_size=32, validation_data=(X_val, y_val), callbacks=[early_stop], verbose=1 ) ``` --- ### 9. Summary In this module, we explored how **deep learning models** can be applied to **time series forecasting**, moving from simple recurrent structures to advanced attention-based architectures. We covered: 1. The fundamentals of **sequence modeling** 2. Classic recurrent models — **RNN**, **LSTM**, **GRU** 3. Modern improvements — **Attention Mechanisms** 4. **Hybrid models** that combine memory and focus 5. Techniques for **tuning, regularization, and interpretation** 6. A **real-world case study** using Attention-LSTM on airline data --- #### Evolution of Time Series Models | Stage | Model | Key Capability | Limitation | |--------|--------|----------------|-------------| | **1️⃣** | RNN | Sequential modeling of data | Vanishing gradients, short-term memory | | **2️⃣** | LSTM | Long-term dependency capture via cell states | Computationally expensive | | **3️⃣** | GRU | Simplified memory gating (faster) | Slightly less expressive | | **4️⃣** | Attention | Focus on relevant time steps dynamically | Heavier computation | | **5️⃣** | Hybrid (Attention-LSTM/GRU) | Combines memory + interpretability | Requires careful tuning | > “Each step in evolution wasn’t a replacement, it was an improvement.” #### Comparative Summary | Model | Memory | Interpretability | Computation | Use Case | |--------|----------|------------------|--------------|-----------| | **RNN** | Short-term | Low | Fast | Simple sequential data | | **LSTM** | Long-term | Medium | 🕓 Slow | Seasonal/long dependencies | | **GRU** | Medium-long | Medium | Faster | Faster training with less data | | **Attention** | Variable focus | High | Heavy | Long/complex dependencies | | **Attention-LSTM** | Long + Focused | High | Moderate | Accurate, explainable forecasting | #### Practical Guidelines for Future Projects 1. **Start simple, scale up**: always benchmark against a basic LSTM or ARIMA. 2. **Watch for overfitting**: use dropout, early stopping, and regularization. 3. **Visualize your model’s reasoning**: plot attention weights or feature importances. 4. **Tune systematically**: adjust learning rate, lookback window, and units gradually. 5. **Document experiments**: reproducibility is key in forecasting research. 6. **Interpret results**: always link predictions back to domain insights. #### The Future of Time Series Forecasting Modern forecasting is shifting towards: - **Transformers for Time Series** (e.g., *Informer*, *Autoformer*) - **Hybrid Neural–Statistical Models** (combining ARIMA + DL) - **Multimodal Forecasting** (mixing images, text, and sensors) - **Explainable AI (XAI)** for better decision trust --- ### Recommended Reading - https://medium.com/data-science-collective/future-forecasting-of-time-series-using-lstm-a-quick-guide-for-business-leaders-370661c574c9 - https://medium.com/@myskill.id/time-series-prediction-lstm-759763728c48 ## Notebook Example This notebook provides a practical implementation of an Attention-based LSTM model for time series forecasting, demonstrating how attention mechanisms enhance interpretability and accuracy in sequential data prediction. https://colab.research.google.com/drive/1mW44w-OBS_720puYvpx4oZ0i6ADrKbil?usp=sharing