# Pair Trading: Cointegration & Trading Algorithms
## Introduction
Pair trading is a market-neutral strategy exploiting the statistical relationship of **cointegration** between two assets. Cointegrated assets tend to maintain a consistent price difference over time, despite short-term fluctuations. This presents profitable opportunities when their prices diverge, as they're expected to revert to their historical mean.
In this note, we'll:
* Identify cointegrated asset pairs using statistical tests.
* Backtest three trading models:
* Least Squares (LS) regression (fixed parameters)
* Rolling LS regression (time-varying parameters)
* Kalman Filter (time-varying parameters)
## Cointegration
### Cointegration Test
**Goal:** To determine if two non-stationary time series, $x_t$ and $y_t$, exhibit a long-term equilibrium relationship. This means that despite short-term fluctuations, they tend to move together over time. This equilibrium relationship is known as cointegration.
### Statistical Model (Engle-Granger Two-Step Method)
1.**Estimate the Long-Term Relationship:** We employ linear regression to model the long-term relationship between the two time series: $$y_t = \alpha + \beta x_t + r_t$$ where $r_t$ represents the residuals or the deviations from the estimated long-term relationship.
2. **Test for Stationarity of Residuals:** We apply a unit root test, such as the Augmented Dickey-Fuller (ADF) test, to the residuals $r_t$ to assess if they are stationary.
#### Interpretation
* **Cointegration:** If the residuals are stationary (ADF test p-value is less than a chosen significance level, typically 0.05), we conclude that the two time series are cointegrated.
* **No Cointegration:** If the residuals are non-stationary (ADF test p-value is greater than or equal to the significance level), we cannot conclude that the series are cointegrated.
### Examples: Cointegration in Practice
#### SPY vs. IVV
These two ETFs track the S&P 500 index. Due to their identical underlying assets, we anticipate a cointegrated relationship between them.
```python=
import yfinance as yf
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Fetch adjusted closing prices
tickers = ["SPY", "IVV"]
prices = yf.download(tickers, start="2022-01-01", end="2022-12-31")['Adj Close']
# Visualize prices
plt.figure(figsize=(12, 5))
plt.plot(prices)
plt.title("SPY vs. IVV Prices (2022)")
plt.xlabel("Date")
plt.ylabel("Adjusted Closing Price")
plt.legend(tickers)
plt.grid(axis='y')
plt.show()
```

##### Cointegration Test and Error Correction Model (ECM)
```python=+
def cointegration_test(data, ticker1, ticker2):
# Cointegration test (Engle-Granger two-step method)
result = sm.tsa.stattools.coint(data[ticker1], data[ticker2])
p_value = result[1]
print(f"Cointegration Test p-value ({ticker1} vs. {ticker2}): {p_value:.4f}")
# Fit ECM using OLS
X = sm.add_constant(data[ticker1])
model = sm.OLS(data[ticker2], X).fit()
residuals = model.resid
#Display Results
print(f"Estimated model: {ticker2} = {model.params[0]:.4f} + {model.params[1]:.4f} * {ticker1} + Residuals")
# Plot residuals
plt.figure(figsize=(12, 5))
plt.plot(residuals)
plt.title(f"Residuals from ECM ({ticker1} vs. {ticker2})")
plt.xlabel("Date")
plt.ylabel("Residuals")
plt.axhline(0, color='red', linestyle='--') # Add a horizontal line at 0
plt.grid(axis='y')
plt.show()
cointegration_test(prices, "SPY", "IVV")
```
```
Cointegration Test p-value (SPY vs. IVV): 0.0025
Estimated model: IVV = -0.4236 + 1.0049 * SPY + Residuals
```

The very low p-value (0.0025) strongly suggests the presence of cointegration between SPY and IVV. This is further supported by the residuals plot, which demonstrates stationarity - the residuals fluctuate around zero without any discernible trend.
#### GLD vs. IAU
These ETFs both track gold prices, leading us to expect cointegration between them as well.
```python=
# Fetch adjusted closing prices
tickers = ["GLD", "IAU"]
prices = yf.download(tickers, start="2020-01-01", end="2020-12-31")['Adj Close']
# Visualize original and normalized prices
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Original Prices
axes[0].plot(prices)
axes[0].set_title("GLD vs. IAU Prices (2020)")
axes[0].set_xlabel("Date")
axes[0].set_ylabel("Adjusted Closing Price")
axes[0].legend(tickers)
axes[0].grid(axis='y')
# Normalized Prices (for clearer comparison)
normalized_prices = prices / prices.iloc[0]
axes[1].plot(normalized_prices)
axes[1].set_title("GLD vs. IAU Normalized Prices (2020)")
axes[1].set_xlabel("Date")
axes[1].set_ylabel("Normalized Price")
axes[1].legend(tickers)
axes[1].grid(axis='y')
plt.show()
```

##### Cointegration Test and ECM
```python=
cointegration_test(prices, "GLD", "IAU")
```
```
Cointegration Test p-value (GLD vs. IAU): 0.0000
Estimated model: IAU = -0.1377 + 0.2040 * GLD + Residuals
```

Similar to the previous example, the p-value close to zero offers compelling evidence for cointegration between GLD and IAU. The residuals plot, exhibiting mean-reverting behavior around zero, further solidifies this conclusion.
## Trading Algorithms
### Least Squares Regression (Fixed Parameters)
When two assets are cointegrated, their log prices exhibit a linear relationship that tends to persist over time. We can express this relationship mathematically as:
$$
y_{1t} = \mu + \gamma y_{2t} + \epsilon_t
$$
where:
* $y_{1t}$ and $y_{2t}$ represent the log prices of the two assets at time $t$.
* μ (Intercept): The average difference in log prices when the log price of the second asset $y_{2t}$ is zero.
* $\gamma$ (Hedge Ratio): Quantifies the change in the log price of the first asset $y_{1t}$ for a one-unit change in the log price of the second asset $y_{2t}$.
* $\epsilon_t$ (Residual Error): Accounts for deviations from the linear relationship caused by random market noise or temporary inefficiencies.
To obtain the best estimates for $\mu$ and $\gamma$, we employ Ordinary Least Squares (OLS) regression, which aims to:
$$
\mathop{\text{min}}_{\mu,\gamma} \ \sum^T_{t=1} \left(y_{1t} - (\mu + \gamma y_{2t}) \right)^2
$$
OLS identifies the line that minimizes the cumulative squared distances between the actual data points and the values predicted by the model.
#### Implementation
1. **Data Preparation:**
* We begin by loading historical adjusted closing prices for a pair of potentially cointegrated assets. For this illustration, we'll use China Mobile Limited (`0941.HK`) and HSBC Holdings plc (`0005.HK`).
* Next, we convert these prices to log prices. This transformation helps stabilize variance and enhances the suitability of the data for linear regression analysis.
```python=
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Load price data
tickers = ["0941.HK", "0005.HK"]
prices = yf.download(tickers, start="2015-01-01", end="2023-12-31")["Adj Close"]
logprices = np.log(prices)
# Plot log prices
plt.figure(figsize=(12, 6))
plt.plot(logprices)
plt.title("Log Prices of 0941.HK and 0005.HK")
plt.xlabel("Date")
plt.ylabel("Log Price")
plt.legend(tickers)
plt.grid(axis='y')
plt.show()
```

2. **Model Estimation:**
* The data is partitioned into training and testing sets, typically using a 70/30 split.
* OLS regression is performed on the training data to estimate the intercept $\mu$and the hedge ratio $\gamma$.
```python=+
# Split data into training and testing sets
split_ratio = 0.7
split_index = int(split_ratio * len(logprices))
train_data, test_data = logprices[:split_index], logprices[split_index:]
# Train linear regression model
X = np.array(train_data["0005.HK"]).reshape(-1, 1)
y = np.array(train_data["0941.HK"])
model = LinearRegression().fit(X, y)
# Extract coefficients
mu, gamma = model.intercept_, model.coef_[0]
print(f'Estimated parameters: mu = {mu:.4f}, gamma = {gamma:.4f}')
```
```
Estimated parameters: mu = 3.0142, gamma = 0.2429
```
3. **Spread Calculation and Visualization:**
* The estimated model is applied to the entire dataset to compute the spread.
* The spread is then visualized to assess its mean-reverting behavior, a key characteristic of cointegrated pairs.
```python=+
# Calculate spread using the model
spread = logprices["0941.HK"] - (mu + gamma * logprices["0005.HK"])
# Plot the spread
plt.figure(figsize=(12, 6))
plt.plot(spread)
plt.title("Spread between 0941.HK and 0005.HK")
plt.xlabel("Date")
plt.ylabel("Spread")
plt.axvline(x = logprices.index[split_index], color='r', linestyle='--')
plt.show()
```

#### Limitations of Fixed Parameters
* **Changing Market Dynamics:** The linear relationship between assets may not remain constant. A fixed-parameter model might struggle to capture shifts in market conditions or evolving cointegration relationships.
* **Sensitivity to Outliers:** OLS can be susceptible to the influence of outliers, potentially leading to biased parameter estimates.
* **Limited Adaptability:** The model's inability to adapt to changing market conditions can result in missed trading opportunities or the generation of inaccurate signals.
### Adapting to Dynamic Relationships: Time-Varying Parameters
In the real world, the relationship between asset prices within a cointegrated pair is seldom static. The fixed parameters derived from least squares regression can become obsolete as market conditions fluctuate. To counter this, we introduce models that accommodate time-varying parameters.
#### Model Enhancement
We enhance our pair trading model to allow the parameters to evolve over time:
\begin{split}
y_{1t} &= \mu_t + \gamma x_t+ \epsilon_{1t}\\
y_{2t} &= x_t + \epsilon_{2t}
\end{split}
where:
* $\mu_t$: Time-varying intercept
* $\gamma_t$: Time-varying hedge ratio
* $x_t$: A stochastic trend component, often represented as a random walk, capturing the non-stationary common trend influencing both asset prices.
* $\epsilon_{1t}, \epsilon_{2t}$: Residual error terms
The random walk component, $x_t$ is defined as:
$$
x_t = x_{t-1} + w_t
$$
where $w_t$ is the random shock or innovation at time $t$.
#### Methods for Estimating Time-Varying Parameters
1. **Rolling Least Squares (LS):**
* Re-estimates $\mu$ and $\gamma$ periodically over a fixed-length lookback window.
* As the window advances, the estimates adjust to recent price movements.
* Proves effective in capturing gradual shifts in cointegration relationships.
$$
\mathop{\text{min}}_{\mu,\gamma}\ \sum^{t_0}_{t=1+t_0 - T_{\text{lookback}}} \left(y_{1t} - (\mu + \gamma y_{2t}) \right)^2
$$
2. **Kalman Filter:**
* A sophisticated recursive algorithm for estimating the hidden state of a system (in our context, $\mu_t$ and $\gamma_t$) based on noisy observations (asset prices).
* Models the evolution of parameters as a state-space process.
* While computationally more intensive than rolling LS, it excels at handling abrupt changes in parameters.
\begin{split}
\alpha_{t+1} &= T_t\alpha_t+R_t\eta_t \quad&& \text{(State Equation)}\\
y_{1t} &= Z_t\alpha_t+\epsilon_t \quad&& \text{(Observation Equation)}
\end{split} where:
* $\alpha_t = \begin{bmatrix}\mu_t \\ \gamma_t \end{bmatrix}$: The state vector representing the parameters at time $t$.
* $T_t = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$: The state transition matrix describing how the state evolves from one time step to the next.
* $R_t = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$: The state noise matrix modeling the uncertainty in the state evolution.
* $\eta_t \sim \mathcal{N}(0, Q_t)$: The state noise, assumed to be normally distributed with zero mean and covariance matrix $Q_t = \begin{bmatrix} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{bmatrix}$
* $Z_t = \begin{bmatrix} 1 & y_{2t} \end{bmatrix}$: The observation matrix linking the state to the observed log price $y_{1t}$
* $\epsilon_t \sim \mathcal{N}(0, h)$: The observation noise, also assumed to be normally distributed with zero mean and variance $h$.
#### Evaluating Estimation Methods with Synthetic Data
Before deploying these methods on real market data, we will gauge their performance using synthetic data. This controlled environment allows us to define the true values of $\mu_t$ and $\gamma_t$ and evaluate how effectively each method tracks these changing parameters.
```python=
from scipy.stats import norm
# Set seed for reproducibility
np.random.seed(42)
# Generate time-varying parameters
T_syn = 1000
date_range = pd.date_range("2010-01-01", periods=T_syn)
mu_true = pd.Series(np.concatenate([
np.repeat(0.6, T_syn // 4),
np.repeat(0.4, T_syn // 2),
np.repeat(0.6, T_syn // 4)
]), index=date_range, name="mu-true")
gamma_true = pd.Series(np.concatenate([
np.repeat(0.2, T_syn // 2),
np.repeat(0.8, T_syn // 2)
]), index=date_range, name="gamma-true")
# Smooth parameters for realistic simulation
window_size = 50
mu_true = mu_true.rolling(window_size, min_periods=1, center=True).mean()
gamma_true = gamma_true.rolling(window_size, min_periods=1, center=True).mean()
# Visualize true parameters
plt.figure(figsize=(12, 6))
plt.plot(mu_true, color="blue", label="μ_true")
plt.plot(gamma_true, color="green", label="γ_true")
plt.title("True Time-Varying Parameters")
plt.xlabel("Date")
plt.ylabel("Value")
plt.legend()
plt.grid(axis='y')
plt.show()
```

```python=+
# Simulate synthetic price data
trend = pd.Series(np.cumsum(norm.rvs(size=T_syn, scale=0.01)), index=date_range)
residuals = pd.DataFrame(norm.rvs(size=(T_syn, 2), scale=0.05), index=date_range, columns=["ε1", "ε2"])
residuals["ε1"] = 0.9 * residuals["ε1"].shift(1) + residuals["ε1"] # Add some autocorrelation
residuals["ε2"] = 0.9 * residuals["ε2"].shift(1) + residuals["ε2"]
Y_syn = pd.DataFrame({
"Y1": mu_true + gamma_true * trend + residuals["ε1"],
"Y2": trend + residuals["ε2"]
}).ffill().bfill() # Forward and backward fill any missing values
# Visualize synthetic price data
plt.figure(figsize=(12, 6))
plt.plot(Y_syn)
plt.title("Synthetic Log Price Data")
plt.xlabel("Date")
plt.ylabel("Log Price")
plt.legend(["Y1", "Y2"])
plt.grid(axis='y')
plt.show()
```

#### Implementation: Rolling LS and Kalman Filter
##### Rolling Least Squares
```python=
# Rolling LS estimation
T_lookback = 400 # Lookback window length
T_shift = 10 # Frequency of updates
mu_rolling_LS = pd.Series(np.nan, index=date_range, name="mu-rolling-LS")
gamma_rolling_LS = pd.Series(np.nan, index=date_range, name="gamma-rolling-LS")
# Loop through the time periods
for t0 in np.arange(T_lookback, T_syn - T_shift + 1, T_shift):
window = slice(t0 - T_lookback, t0)
X = Y_syn.iloc[window, 1].replace([np.inf, -np.inf], np.nan).dropna() # Handle potential infinite values
X = sm.add_constant(X)
y = Y_syn.iloc[window, 0][X.index] # Align y to match X's index
# Estimate parameters using OLS
ls_coeffs = np.linalg.lstsq(X, y, rcond=None)[0]
# Store estimated parameters
mu_rolling_LS.iloc[t0] = ls_coeffs[0]
gamma_rolling_LS.iloc[t0] = ls_coeffs[1]
# Forward fill initial NaN values
mu_rolling_LS = mu_rolling_LS.ffill()
gamma_rolling_LS = gamma_rolling_LS.ffill()
# Optional: Smoothing (e.g., moving average) for less noisy estimates
from scipy.signal import lfilter
smoothing_window = T_lookback // 4
filter_coeffs = np.repeat(1.0 / smoothing_window, smoothing_window)
mu_rolling_LS_smoothed = lfilter(filter_coeffs, 1, mu_rolling_LS)
mu_rolling_LS_smoothed = pd.Series(mu_rolling_LS_smoothed, index=date_range).bfill()
gamma_rolling_LS_smoothed = lfilter(filter_coeffs, 1, gamma_rolling_LS)
gamma_rolling_LS_smoothed = pd.Series(gamma_rolling_LS_smoothed, index=date_range).bfill()
# Visualize results
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(12, 8))
axes[0].plot(mu_true, label="True μ", linestyle="--")
axes[0].plot(mu_rolling_LS_smoothed, label="Rolling LS μ")
axes[0].set_title("Rolling LS Estimation of μ")
axes[0].legend()
axes[1].plot(gamma_true, label="True γ", linestyle="--")
axes[1].plot(gamma_rolling_LS_smoothed, label="Rolling LS γ")
axes[1].set_title("Rolling LS Estimation of γ")
axes[1].legend()
plt.tight_layout()
plt.show()
```

##### Kalman Filter
```python=
pip install pykalman
```
```python=
from pykalman import KalmanFilter
# Kalman Filter estimation
kf = KalmanFilter(
transition_matrices=np.eye(2),
observation_matrices=np.column_stack([np.ones(T_syn), Y_syn["Y2"]]).reshape(T_syn, 1, 2),
observation_covariance=1e-4 * np.eye(1),
transition_covariance=np.diag([1e-3, 1e-2]),
initial_state_mean=np.array([0.6, 0.2]),
initial_state_covariance=np.diag([1e-3, 1e-2])
)
# Run Kalman filtering and smoothing
filtered_state_means, _ = kf.filter(Y_syn["Y1"])
mu_kalman, gamma_kalman = pd.Series(filtered_state_means[:, 0], index= date_range), pd.Series(filtered_state_means[:, 1], index= date_range)
# smoothing will give better results but we can skip that in this example.
# Visualize results
plt.figure(figsize=(12, 8))
plt.subplot(2, 1, 1)
plt.plot(mu_true, label="True μ", linestyle="--")
plt.plot(mu_rolling_LS_smoothed, label="Rolling LS μ")
plt.plot(mu_kalman, label="Kalman Filter μ")
plt.title("Comparison of μ Estimates")
plt.legend()
plt.subplot(2, 1, 2)
plt.plot(gamma_true, label="True γ", linestyle="--")
plt.plot(gamma_rolling_LS_smoothed, label="Rolling LS γ")
plt.plot(gamma_kalman, label="Kalman Filter γ")
plt.title("Comparison of γ Estimates")
plt.legend()
plt.tight_layout()
plt.show()
```

## Real-World Performance: EWH vs. EWZ Pair Trading
To showcase the efficacy of time-varying parameter models in a real-world scenario, we will backtest our trading strategies using a pair of ETFs representing distinct emerging markets:
* `EWH`: iShares MSCI Hong Kong ETF
* `EWZ`: iShares MSCI Brazil ETF
This pair is well-suited for evaluating our models' adaptability to dynamic cointegration relationships, given the distinct economic and geopolitical factors influencing these markets.
### Data Preparation and Visualization
```python=
# Load price data
stock_pair = ["EWH", "EWZ"]
data = yf.download(stock_pair, start="2014-07-01", end="2024-07-01")['Adj Close'] # Using Adjusted Close for accuracy
# Plot prices
plt.figure(figsize=(12, 6))
plt.plot(data)
plt.title("EWH vs. EWZ Adjusted Closing Prices")
plt.xlabel("Date")
plt.ylabel("Price")
plt.legend(stock_pair)
plt.grid(axis='y')
plt.show()
```

### Estimating Parameters and Spread Calculation
We apply the `estimate_mu_gamma_LS`, `estimate_mu_gamma_rolling_LS`, and `estimate_mu_gamma_kalman` functions to estimate the model parameters (`mu` and `gamma`) for each of our three models: Least Squares (LS), Rolling Least Squares, and Kalman Filter. Then, we utilize the `compute_spread` function to calculate the spread for each model based on its estimated parameters.
```python=+
def estimate_mu_gamma_LS(Y, pct_training=0.3):
"""
Estimates mu and gamma parameters using linear regression.
Args:
Y: A pandas DataFrame with two columns, where the first column is the dependent variable
and the second column is the independent variable.
pct_training: The percentage of data to use for training the linear regression model (default is 0.3).
Returns:
A dictionary containing two pandas Series:
- mu: Estimated mu values
- gamma: Estimated gamma values
"""
T = len(Y)
T_trn = round(pct_training * T)
# Prepare training data
X_train = Y.iloc[:T_trn, 1].values.reshape(-1, 1)
y_train = Y.iloc[:T_trn, 0].values
# Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
# Extract coefficients
mu_value = model.intercept_
gamma_value = model.coef_[0]
# Create mu and gamma series
mu = pd.Series([mu_value] * T, index=Y.index, name='mu-LS')
gamma = pd.Series([gamma_value] * T, index=Y.index, name='gamma-LS')
return {'mu': mu, 'gamma': gamma}
def estimate_mu_gamma_rolling_LS(Y, pct_training=0.3):
"""
Estimates rolling mu and gamma using linear regression with a lookback window.
Args:
Y: A pandas DataFrame with two columns, where the first column is the dependent variable
and the second column is the independent variable.
pct_training: Percentage of data to use for initial training (default is 0.3).
Returns:
A dictionary containing:
- mu: A pandas Series of estimated mu values.
- gamma: A pandas Series of estimated gamma values.
"""
T = Y.shape[0]
T_start = round(pct_training * T)
T_lookback = 300 # Lookback window length
T_shift = 10 # How often to update the estimates
# Initialize empty arrays
mu_rolling_LS = pd.Series(np.nan, index=Y.index)
gamma_rolling_LS = pd.Series(np.nan, index=Y.index)
# Loop over time periods to update estimates
t0_update = range(max(T_start, T_lookback), T - T_shift + 1, T_shift)
for t0 in t0_update:
T_lookback_ = min(t0 - 1, T_lookback - 1) # Adjust lookback for early periods
Y_window = Y.iloc[(t0 - T_lookback_):t0]
# Linear regression (using NumPy for better performance than statsmodels)
X = Y_window.iloc[:, 1].values.reshape(-1, 1) # Independent variable
X = np.hstack([np.ones_like(X), X]) # Add intercept term
y = Y_window.iloc[:, 0].values # Dependent variable
weights = np.arange(1, T_lookback_ + 1)[::-1]
coeffs = np.linalg.lstsq(X * weights[:, np.newaxis], y * weights, rcond=None)[0]
mu_rolling_LS[Y.index[t0]] = coeffs[0]
gamma_rolling_LS[Y.index[t0]] = coeffs[1]
# Fill missing values and apply smoothing
mu_rolling_LS = mu_rolling_LS.bfill()
gamma_rolling_LS = gamma_rolling_LS.bfill()
return {"mu": mu_rolling_LS, "gamma": gamma_rolling_LS}
def estimate_mu_gamma_kalman(Y):
Y_syn = np.array(Y)
T_syn = len(Y_syn) # Number of time steps
# Kalman filter parameters
transition_matrix = np.eye(2)
observation_matrix = np.zeros((T_syn, 1, 2)) # Time-varying observation matrix
for t in range(T_syn):
observation_matrix[t, 0, :] = [1, Y_syn[t, 1]]
observation_covariance = 1e-3 * np.ones((T_syn, 1, 1))
transition_covariance = np.diag([1e-8, 1e-8])
init = estimate_mu_gamma_LS(Y)
initial_state_mean = np.array([init['mu'][0], init['gamma'][0]])
initial_state_covariance = 1e-5 * np.eye(2)
# Create the Kalman filter object
kf = KalmanFilter(
transition_matrices=transition_matrix,
observation_matrices=observation_matrix,
observation_covariance=observation_covariance,
transition_covariance=transition_covariance,
initial_state_mean=initial_state_mean,
initial_state_covariance=initial_state_covariance,
)
# Run Kalman filtering and smoothing
filtered_state_means, filtered_state_covariances = kf.filter(Y_syn[:, 0])
smoothed_state_means, smoothed_state_covariances = kf.smooth(Y_syn[:, 0])
# Extract mu and gamma values
mu_Kalman_filtering = filtered_state_means[:, 0]
gamma_Kalman_filtering = filtered_state_means[:, 1]
mu_Kalman_smoothing = smoothed_state_means[:, 0]
gamma_Kalman_smoothing = smoothed_state_means[:, 1]
return {'mu': mu_Kalman_filtering, 'gamma': gamma_Kalman_filtering}
```
```python=
def compute_spread(Y, gamma, mu, name=None):
"""
Calculates the spread based on input data.
Args:
Y (numpy.ndarray): Input data (typically a 2-dimensional array).
gamma (float or numpy.ndarray): Gamma parameter.
mu (float): Mu parameter.
name (str, optional): Name for the result (column name if Y is 2D).
Returns:
numpy.ndarray: Calculated spread.
"""
spread = Y[Y.columns[0]] - mu - gamma * Y[Y.columns[1]]
return spread
```
```python=
# Combine the data for plotting
mu_data = pd.DataFrame({
'LS': estimate_mu_gamma_LS(data)['mu'],
'Rolling LS': estimate_mu_gamma_rolling_LS(data)['mu'],
'Kalman': estimate_mu_gamma_kalman(data)['mu']
})
gamma_data = pd.DataFrame({
'LS': estimate_mu_gamma_LS(data)['gamma'],
'Rolling LS': estimate_mu_gamma_rolling_LS(data)['gamma'],
'Kalman': estimate_mu_gamma_kalman(data)['gamma']
})
# Calculate spreads
spread_LS = compute_spread(data, gamma_data['LS'], mu_data['LS'], "LS")
spread_rolling_LS = compute_spread(data, gamma_data['Rolling LS'], mu_data['Rolling LS'], "rolling-LS")
spread_Kalman = compute_spread(data, gamma_data['Kalman'], mu_data['Kalman'], "Kalman")
# Combine spreads into a DataFrame
spreads = pd.DataFrame({
'LS': spread_LS,
'Rolling LS': spread_rolling_LS,
'Kalman': spread_Kalman,
})
```
### Visualizing Parameter Estimates and Spreads
```python=+
# Plotting parameter estimates
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
mu_data.plot(ax=axes[0])
axes[0].set_title('Tracking of μ (Intercept)')
axes[0].axvline(x=data.index[round(0.3 * len(data))], color='red', linestyle='dashed', linewidth=2, label='Training End')
axes[0].legend(loc='upper left')
gamma_data.plot(ax=axes[1])
axes[1].set_title('Tracking of γ (Hedge Ratio)')
axes[1].axvline(x=data.index[round(0.3 * len(data))], color='red', linestyle='dashed', linewidth=2, label='Training End')
axes[1].legend(loc='upper left')
plt.tight_layout()
plt.show()
```

```python=+
# Plotting spreads
plt.figure(figsize=(12, 6))
plt.plot(spreads)
plt.title("Spreads: EWH vs. EWZ")
plt.xlabel("Date")
plt.ylabel("Spread Value")
plt.legend(spreads.columns, loc='upper left')
plt.grid(axis='y', linestyle='--')
plt.axvline(x=data.index[round(0.3 * len(data))], color='red', linestyle='dashed', linewidth=2, label='Training End')
plt.show()
```

### Trading Strategy
We'll implement a straightforward Z-score-based trading strategy across all models. This strategy involves:
* **Buying (going long)** when the spread's Z-score falls below a predetermined negative threshold.
* **Selling (going short)** when the Z-score rises above a positive threshold.
```python=
def generate_signal(Z_score, threshold_long, threshold_short):
signal = Z_score.copy()
signal.iloc[:] = np.nan # Initialize signal to NaN
signal.name = "signal" # Set name of Series
# Initial position
signal.iloc[0] = 0
if Z_score.iloc[0] <= threshold_long.iloc[0]:
signal.iloc[0] = 1
elif Z_score.iloc[0] >= threshold_short.iloc[0]:
signal.iloc[0] = -1
# Loop over time
for t in range(1, len(Z_score)):
if signal.iloc[t-1] == 0: # No position
if Z_score.iloc[t] <= threshold_long.iloc[t]:
signal.iloc[t] = 1
elif Z_score.iloc[t] >= threshold_short.iloc[t]:
signal.iloc[t] = -1
else:
signal.iloc[t] = 0
elif signal.iloc[t-1] == 1: # Long position
if Z_score.iloc[t] >= 0:
signal.iloc[t] = 0
else:
signal.iloc[t] = 1
else: # Short position
if Z_score.iloc[t] <= 0:
signal.iloc[t] = 0
else:
signal.iloc[t] = -1
return signal
```
```python=
THRESHOLD = 0.5
pnls = {}
def signal_pnl(spread, strategy):
# Generate signal using the function
threshold_short = pd.Series(np.full(len(spread), THRESHOLD), index=spread.index)
threshold_long = pd.Series(np.full(len(spread), -THRESHOLD), index=spread.index)
signal = generate_signal(spread, threshold_long, threshold_short)
# Plotting
plt.figure(figsize=(15, 6))
plt.plot(spread, label="Z-score")
plt.plot(signal, label="Signal", linestyle='-', drawstyle='steps-post')
plt.title("Z-score and Trading Signal")
plt.legend(loc="upper left")
plt.show()
# Calculate spread returns
spread_return = spread.diff()
# Calculate traded returns (with lag)
traded_return = spread_return * signal.shift(1) # Shift signal by 1 for the lag
traded_return.fillna(0, inplace=True) # Fill NA values with 0
traded_return.name = "traded spread"
# Cumulative P&L
cum_pnl_no_reinvest = traded_return.cumsum()
pnls[strategy] = cum_pnl_no_reinvest
```
```python=
signal_pnl(spread_LS, "LS")
```

```python=
signal_pnl(spread_rolling_LS, "Rolling LS")
```

```python!
signal_pnl(spread_Kalman, 'Kalman')
```

### Performance Comparison
```python=
def analyze_pnl_series(pnl_dict, risk_free_rate=0):
"""Plots, calculates drawdown, and Sharpe ratio for a dictionary of P&L Series.
Args:
pnl_dict: Dictionary where keys are names and values are Pandas Series with P&L data.
risk_free_rate: Annual risk-free rate (default is 0).
Returns:
drawdowns: Dictionary of max drawdowns for each P&L series.
sharpe_ratios: Dictionary of annualized Sharpe ratios for each P&L series.
"""
plt.figure(figsize=(15, 8))
drawdowns = {}
sharpe_ratios = {}
for name, pnl_series in pnl_dict.items():
pnl_series.bfill().ffill()
# Plotting
plt.plot(pnl_series, label=name)
# Drawdown Calculation
drawdown = [(pnl_series[i] - pnl_series[:i].max()) / pnl_series[-1]for i in range(1, len(pnl_series))]
drawdowns[name] = -min(drawdown)
daily_returns = [pnl_series[i] / pnl_series[i - 1] if pnl_series[i - 1] != 0 else 1 for i in range(1, len(pnl_series))]
sharpe_ratios[name] = statistics.mean(daily_returns) / statistics.stdev(daily_returns)
plt.xlabel('Date')
plt.ylabel('Cumulative P&L')
plt.title('Cumulative P&L and Drawdowns')
plt.legend()
plt.grid(axis='y')
plt.tight_layout()
plt.show()
return drawdowns, sharpe_ratios
# Calculate and display performance metrics (Sharpe Ratio and Max Drawdown)
drawdowns, sharpe_ratios = analyze_pnl_series(cumulative_pnls)
print("\nPerformance Comparison:")
print("-" * 25)
print(f"{'Model':<15} {'Sharpe Ratio':<15} {'Max Drawdown':<15}")
print("-" * 25)
for model in ["LS", "Rolling LS", "Kalman"]:
print(f"{model:<15} {sharpe_ratios[model]:<15.2f} {drawdowns[model]:<15.2%}")
print("-" * 25)
```

```
Performance Comparison:
-------------------------
Model Sharpe Ratio Max Drawdown
-------------------------
LS 0.21 73.28%
Rolling LS 4.73 21.76%
Kalman 7.62 3.05%
-------------------------
```
#### Key Insights
* **Dynamic Models Excel:** Time-varying parameter models (Rolling LS and Kalman Filter) significantly outperform the fixed LS model, emphasizing the importance of adapting to changing relationships in real-world markets.
* **Kalman Filter's Robustness:** The Kalman Filter again demonstrates superior performance. Its ability to efficiently incorporate new information allows it to respond to market shifts, resulting in a higher Sharpe ratio and a remarkably lower max drawdown.
* **Real-World Applicability:** The results underscore the practical value of these techniques for pair trading in actual markets. The Kalman filter, in particular, offers a promising approach for capturing and profiting from dynamic cointegration relationships.
## Reference
* Palomar. (n.d.). Pairs Trading. [Lecture slides](https://palomar.home.ece.ust.hk/MAFS5310_lectures/slides_pairs_trading.pdf). Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology.
* Palomar. Pairs Trading with R. [link](https://palomar.home.ece.ust.hk/MAFS5310_lectures/Rsession_pairs_trading_with_R.html)