# Pair Trading: Cointegration & Trading Algorithms ## Introduction Pair trading is a market-neutral strategy exploiting the statistical relationship of **cointegration** between two assets. Cointegrated assets tend to maintain a consistent price difference over time, despite short-term fluctuations. This presents profitable opportunities when their prices diverge, as they're expected to revert to their historical mean. In this note, we'll: * Identify cointegrated asset pairs using statistical tests. * Backtest three trading models: * Least Squares (LS) regression (fixed parameters) * Rolling LS regression (time-varying parameters) * Kalman Filter (time-varying parameters) ## Cointegration ### Cointegration Test **Goal:** To determine if two non-stationary time series, $x_t$ and $y_t$, exhibit a long-term equilibrium relationship. This means that despite short-term fluctuations, they tend to move together over time. This equilibrium relationship is known as cointegration. ### Statistical Model (Engle-Granger Two-Step Method) 1.**Estimate the Long-Term Relationship:** We employ linear regression to model the long-term relationship between the two time series: $$y_t = \alpha + \beta x_t + r_t$$ where $r_t$ represents the residuals or the deviations from the estimated long-term relationship. 2. **Test for Stationarity of Residuals:** We apply a unit root test, such as the Augmented Dickey-Fuller (ADF) test, to the residuals $r_t$ to assess if they are stationary. #### Interpretation * **Cointegration:** If the residuals are stationary (ADF test p-value is less than a chosen significance level, typically 0.05), we conclude that the two time series are cointegrated. * **No Cointegration:** If the residuals are non-stationary (ADF test p-value is greater than or equal to the significance level), we cannot conclude that the series are cointegrated. ### Examples: Cointegration in Practice #### SPY vs. IVV These two ETFs track the S&P 500 index. Due to their identical underlying assets, we anticipate a cointegrated relationship between them. ```python= import yfinance as yf import matplotlib.pyplot as plt import statsmodels.api as sm # Fetch adjusted closing prices tickers = ["SPY", "IVV"] prices = yf.download(tickers, start="2022-01-01", end="2022-12-31")['Adj Close'] # Visualize prices plt.figure(figsize=(12, 5)) plt.plot(prices) plt.title("SPY vs. IVV Prices (2022)") plt.xlabel("Date") plt.ylabel("Adjusted Closing Price") plt.legend(tickers) plt.grid(axis='y') plt.show() ``` ![image](https://hackmd.io/_uploads/rkOe3qUiA.png) ##### Cointegration Test and Error Correction Model (ECM) ```python=+ def cointegration_test(data, ticker1, ticker2): # Cointegration test (Engle-Granger two-step method) result = sm.tsa.stattools.coint(data[ticker1], data[ticker2]) p_value = result[1] print(f"Cointegration Test p-value ({ticker1} vs. {ticker2}): {p_value:.4f}") # Fit ECM using OLS X = sm.add_constant(data[ticker1]) model = sm.OLS(data[ticker2], X).fit() residuals = model.resid #Display Results print(f"Estimated model: {ticker2} = {model.params[0]:.4f} + {model.params[1]:.4f} * {ticker1} + Residuals") # Plot residuals plt.figure(figsize=(12, 5)) plt.plot(residuals) plt.title(f"Residuals from ECM ({ticker1} vs. {ticker2})") plt.xlabel("Date") plt.ylabel("Residuals") plt.axhline(0, color='red', linestyle='--') # Add a horizontal line at 0 plt.grid(axis='y') plt.show() cointegration_test(prices, "SPY", "IVV") ``` ``` Cointegration Test p-value (SPY vs. IVV): 0.0025 Estimated model: IVV = -0.4236 + 1.0049 * SPY + Residuals ``` ![image](https://hackmd.io/_uploads/BJ0ZnqUiR.png) The very low p-value (0.0025) strongly suggests the presence of cointegration between SPY and IVV. This is further supported by the residuals plot, which demonstrates stationarity - the residuals fluctuate around zero without any discernible trend. #### GLD vs. IAU These ETFs both track gold prices, leading us to expect cointegration between them as well. ```python= # Fetch adjusted closing prices tickers = ["GLD", "IAU"] prices = yf.download(tickers, start="2020-01-01", end="2020-12-31")['Adj Close'] # Visualize original and normalized prices fig, axes = plt.subplots(1, 2, figsize=(15, 5)) # Original Prices axes[0].plot(prices) axes[0].set_title("GLD vs. IAU Prices (2020)") axes[0].set_xlabel("Date") axes[0].set_ylabel("Adjusted Closing Price") axes[0].legend(tickers) axes[0].grid(axis='y') # Normalized Prices (for clearer comparison) normalized_prices = prices / prices.iloc[0] axes[1].plot(normalized_prices) axes[1].set_title("GLD vs. IAU Normalized Prices (2020)") axes[1].set_xlabel("Date") axes[1].set_ylabel("Normalized Price") axes[1].legend(tickers) axes[1].grid(axis='y') plt.show() ``` ![image](https://hackmd.io/_uploads/r11Xnc8sA.png) ##### Cointegration Test and ECM ```python= cointegration_test(prices, "GLD", "IAU") ``` ``` Cointegration Test p-value (GLD vs. IAU): 0.0000 Estimated model: IAU = -0.1377 + 0.2040 * GLD + Residuals ``` ![image](https://hackmd.io/_uploads/SyXPhqIoR.png) Similar to the previous example, the p-value close to zero offers compelling evidence for cointegration between GLD and IAU. The residuals plot, exhibiting mean-reverting behavior around zero, further solidifies this conclusion. ## Trading Algorithms ### Least Squares Regression (Fixed Parameters) When two assets are cointegrated, their log prices exhibit a linear relationship that tends to persist over time. We can express this relationship mathematically as: $$ y_{1t} = \mu + \gamma y_{2t} + \epsilon_t $$ where: * $y_{1t}$ and $y_{2t}$ represent the log prices of the two assets at time $t$. * μ (Intercept): The average difference in log prices when the log price of the second asset $y_{2t}$ is zero. * $\gamma$ (Hedge Ratio): Quantifies the change in the log price of the first asset $y_{1t}$ for a one-unit change in the log price of the second asset $y_{2t}$. * $\epsilon_t$ (Residual Error): Accounts for deviations from the linear relationship caused by random market noise or temporary inefficiencies. To obtain the best estimates for $\mu$ and $\gamma$, we employ Ordinary Least Squares (OLS) regression, which aims to: $$ \mathop{\text{min}}_{\mu,\gamma} \ \sum^T_{t=1} \left(y_{1t} - (\mu + \gamma y_{2t}) \right)^2 $$ OLS identifies the line that minimizes the cumulative squared distances between the actual data points and the values predicted by the model. #### Implementation 1. **Data Preparation:** * We begin by loading historical adjusted closing prices for a pair of potentially cointegrated assets. For this illustration, we'll use China Mobile Limited (`0941.HK`) and HSBC Holdings plc (`0005.HK`). * Next, we convert these prices to log prices. This transformation helps stabilize variance and enhances the suitability of the data for linear regression analysis. ```python= import yfinance as yf import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression # Load price data tickers = ["0941.HK", "0005.HK"] prices = yf.download(tickers, start="2015-01-01", end="2023-12-31")["Adj Close"] logprices = np.log(prices) # Plot log prices plt.figure(figsize=(12, 6)) plt.plot(logprices) plt.title("Log Prices of 0941.HK and 0005.HK") plt.xlabel("Date") plt.ylabel("Log Price") plt.legend(tickers) plt.grid(axis='y') plt.show() ``` ![image](https://hackmd.io/_uploads/BJvRhqUjR.png) 2. **Model Estimation:** * The data is partitioned into training and testing sets, typically using a 70/30 split. * OLS regression is performed on the training data to estimate the intercept $\mu$and the hedge ratio $\gamma$. ```python=+ # Split data into training and testing sets split_ratio = 0.7 split_index = int(split_ratio * len(logprices)) train_data, test_data = logprices[:split_index], logprices[split_index:] # Train linear regression model X = np.array(train_data["0005.HK"]).reshape(-1, 1) y = np.array(train_data["0941.HK"]) model = LinearRegression().fit(X, y) # Extract coefficients mu, gamma = model.intercept_, model.coef_[0] print(f'Estimated parameters: mu = {mu:.4f}, gamma = {gamma:.4f}') ``` ``` Estimated parameters: mu = 3.0142, gamma = 0.2429 ``` 3. **Spread Calculation and Visualization:** * The estimated model is applied to the entire dataset to compute the spread. * The spread is then visualized to assess its mean-reverting behavior, a key characteristic of cointegrated pairs. ```python=+ # Calculate spread using the model spread = logprices["0941.HK"] - (mu + gamma * logprices["0005.HK"]) # Plot the spread plt.figure(figsize=(12, 6)) plt.plot(spread) plt.title("Spread between 0941.HK and 0005.HK") plt.xlabel("Date") plt.ylabel("Spread") plt.axvline(x = logprices.index[split_index], color='r', linestyle='--') plt.show() ``` ![image](https://hackmd.io/_uploads/r1yb65UsR.png) #### Limitations of Fixed Parameters * **Changing Market Dynamics:** The linear relationship between assets may not remain constant. A fixed-parameter model might struggle to capture shifts in market conditions or evolving cointegration relationships. * **Sensitivity to Outliers:** OLS can be susceptible to the influence of outliers, potentially leading to biased parameter estimates. * **Limited Adaptability:** The model's inability to adapt to changing market conditions can result in missed trading opportunities or the generation of inaccurate signals. ### Adapting to Dynamic Relationships: Time-Varying Parameters In the real world, the relationship between asset prices within a cointegrated pair is seldom static. The fixed parameters derived from least squares regression can become obsolete as market conditions fluctuate. To counter this, we introduce models that accommodate time-varying parameters. #### Model Enhancement We enhance our pair trading model to allow the parameters to evolve over time: \begin{split} y_{1t} &= \mu_t + \gamma x_t+ \epsilon_{1t}\\ y_{2t} &= x_t + \epsilon_{2t} \end{split} where: * $\mu_t$: Time-varying intercept * $\gamma_t$: Time-varying hedge ratio * $x_t$: A stochastic trend component, often represented as a random walk, capturing the non-stationary common trend influencing both asset prices. * $\epsilon_{1t}, \epsilon_{2t}$: Residual error terms The random walk component, $x_t$ is defined as: $$ x_t = x_{t-1} + w_t $$ where $w_t$ is the random shock or innovation at time $t$. #### Methods for Estimating Time-Varying Parameters 1. **Rolling Least Squares (LS):** * Re-estimates $\mu$ and $\gamma$ periodically over a fixed-length lookback window. * As the window advances, the estimates adjust to recent price movements. * Proves effective in capturing gradual shifts in cointegration relationships. $$ \mathop{\text{min}}_{\mu,\gamma}\ \sum^{t_0}_{t=1+t_0 - T_{\text{lookback}}} \left(y_{1t} - (\mu + \gamma y_{2t}) \right)^2 $$ 2. **Kalman Filter:** * A sophisticated recursive algorithm for estimating the hidden state of a system (in our context, $\mu_t$ and $\gamma_t$) based on noisy observations (asset prices). * Models the evolution of parameters as a state-space process. * While computationally more intensive than rolling LS, it excels at handling abrupt changes in parameters. \begin{split} \alpha_{t+1} &= T_t\alpha_t+R_t\eta_t \quad&& \text{(State Equation)}\\ y_{1t} &= Z_t\alpha_t+\epsilon_t \quad&& \text{(Observation Equation)} \end{split} where: * $\alpha_t = \begin{bmatrix}\mu_t \\ \gamma_t \end{bmatrix}$: The state vector representing the parameters at time $t$. * $T_t = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$: The state transition matrix describing how the state evolves from one time step to the next. * $R_t = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$: The state noise matrix modeling the uncertainty in the state evolution. * $\eta_t \sim \mathcal{N}(0, Q_t)$: The state noise, assumed to be normally distributed with zero mean and covariance matrix $Q_t = \begin{bmatrix} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{bmatrix}$ * $Z_t = \begin{bmatrix} 1 & y_{2t} \end{bmatrix}$: The observation matrix linking the state to the observed log price $y_{1t}$ * $\epsilon_t \sim \mathcal{N}(0, h)$: The observation noise, also assumed to be normally distributed with zero mean and variance $h$. #### Evaluating Estimation Methods with Synthetic Data Before deploying these methods on real market data, we will gauge their performance using synthetic data. This controlled environment allows us to define the true values of $\mu_t$ and $\gamma_t$ and evaluate how effectively each method tracks these changing parameters. ```python= from scipy.stats import norm # Set seed for reproducibility np.random.seed(42) # Generate time-varying parameters T_syn = 1000 date_range = pd.date_range("2010-01-01", periods=T_syn) mu_true = pd.Series(np.concatenate([ np.repeat(0.6, T_syn // 4), np.repeat(0.4, T_syn // 2), np.repeat(0.6, T_syn // 4) ]), index=date_range, name="mu-true") gamma_true = pd.Series(np.concatenate([ np.repeat(0.2, T_syn // 2), np.repeat(0.8, T_syn // 2) ]), index=date_range, name="gamma-true") # Smooth parameters for realistic simulation window_size = 50 mu_true = mu_true.rolling(window_size, min_periods=1, center=True).mean() gamma_true = gamma_true.rolling(window_size, min_periods=1, center=True).mean() # Visualize true parameters plt.figure(figsize=(12, 6)) plt.plot(mu_true, color="blue", label="μ_true") plt.plot(gamma_true, color="green", label="γ_true") plt.title("True Time-Varying Parameters") plt.xlabel("Date") plt.ylabel("Value") plt.legend() plt.grid(axis='y') plt.show() ``` ![image](https://hackmd.io/_uploads/r1kQp5IsR.png) ```python=+ # Simulate synthetic price data trend = pd.Series(np.cumsum(norm.rvs(size=T_syn, scale=0.01)), index=date_range) residuals = pd.DataFrame(norm.rvs(size=(T_syn, 2), scale=0.05), index=date_range, columns=["ε1", "ε2"]) residuals["ε1"] = 0.9 * residuals["ε1"].shift(1) + residuals["ε1"] # Add some autocorrelation residuals["ε2"] = 0.9 * residuals["ε2"].shift(1) + residuals["ε2"] Y_syn = pd.DataFrame({ "Y1": mu_true + gamma_true * trend + residuals["ε1"], "Y2": trend + residuals["ε2"] }).ffill().bfill() # Forward and backward fill any missing values # Visualize synthetic price data plt.figure(figsize=(12, 6)) plt.plot(Y_syn) plt.title("Synthetic Log Price Data") plt.xlabel("Date") plt.ylabel("Log Price") plt.legend(["Y1", "Y2"]) plt.grid(axis='y') plt.show() ``` ![image](https://hackmd.io/_uploads/BJUE6q8iC.png) #### Implementation: Rolling LS and Kalman Filter ##### Rolling Least Squares ```python= # Rolling LS estimation T_lookback = 400 # Lookback window length T_shift = 10 # Frequency of updates mu_rolling_LS = pd.Series(np.nan, index=date_range, name="mu-rolling-LS") gamma_rolling_LS = pd.Series(np.nan, index=date_range, name="gamma-rolling-LS") # Loop through the time periods for t0 in np.arange(T_lookback, T_syn - T_shift + 1, T_shift): window = slice(t0 - T_lookback, t0) X = Y_syn.iloc[window, 1].replace([np.inf, -np.inf], np.nan).dropna() # Handle potential infinite values X = sm.add_constant(X) y = Y_syn.iloc[window, 0][X.index] # Align y to match X's index # Estimate parameters using OLS ls_coeffs = np.linalg.lstsq(X, y, rcond=None)[0] # Store estimated parameters mu_rolling_LS.iloc[t0] = ls_coeffs[0] gamma_rolling_LS.iloc[t0] = ls_coeffs[1] # Forward fill initial NaN values mu_rolling_LS = mu_rolling_LS.ffill() gamma_rolling_LS = gamma_rolling_LS.ffill() # Optional: Smoothing (e.g., moving average) for less noisy estimates from scipy.signal import lfilter smoothing_window = T_lookback // 4 filter_coeffs = np.repeat(1.0 / smoothing_window, smoothing_window) mu_rolling_LS_smoothed = lfilter(filter_coeffs, 1, mu_rolling_LS) mu_rolling_LS_smoothed = pd.Series(mu_rolling_LS_smoothed, index=date_range).bfill() gamma_rolling_LS_smoothed = lfilter(filter_coeffs, 1, gamma_rolling_LS) gamma_rolling_LS_smoothed = pd.Series(gamma_rolling_LS_smoothed, index=date_range).bfill() # Visualize results fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(12, 8)) axes[0].plot(mu_true, label="True μ", linestyle="--") axes[0].plot(mu_rolling_LS_smoothed, label="Rolling LS μ") axes[0].set_title("Rolling LS Estimation of μ") axes[0].legend() axes[1].plot(gamma_true, label="True γ", linestyle="--") axes[1].plot(gamma_rolling_LS_smoothed, label="Rolling LS γ") axes[1].set_title("Rolling LS Estimation of γ") axes[1].legend() plt.tight_layout() plt.show() ``` ![image](https://hackmd.io/_uploads/HJXy1sIjR.png) ##### Kalman Filter ```python= pip install pykalman ``` ```python= from pykalman import KalmanFilter # Kalman Filter estimation kf = KalmanFilter( transition_matrices=np.eye(2), observation_matrices=np.column_stack([np.ones(T_syn), Y_syn["Y2"]]).reshape(T_syn, 1, 2), observation_covariance=1e-4 * np.eye(1), transition_covariance=np.diag([1e-3, 1e-2]), initial_state_mean=np.array([0.6, 0.2]), initial_state_covariance=np.diag([1e-3, 1e-2]) ) # Run Kalman filtering and smoothing filtered_state_means, _ = kf.filter(Y_syn["Y1"]) mu_kalman, gamma_kalman = pd.Series(filtered_state_means[:, 0], index= date_range), pd.Series(filtered_state_means[:, 1], index= date_range) # smoothing will give better results but we can skip that in this example. # Visualize results plt.figure(figsize=(12, 8)) plt.subplot(2, 1, 1) plt.plot(mu_true, label="True μ", linestyle="--") plt.plot(mu_rolling_LS_smoothed, label="Rolling LS μ") plt.plot(mu_kalman, label="Kalman Filter μ") plt.title("Comparison of μ Estimates") plt.legend() plt.subplot(2, 1, 2) plt.plot(gamma_true, label="True γ", linestyle="--") plt.plot(gamma_rolling_LS_smoothed, label="Rolling LS γ") plt.plot(gamma_kalman, label="Kalman Filter γ") plt.title("Comparison of γ Estimates") plt.legend() plt.tight_layout() plt.show() ``` ![image](https://hackmd.io/_uploads/B1ihks8oC.png) ## Real-World Performance: EWH vs. EWZ Pair Trading To showcase the efficacy of time-varying parameter models in a real-world scenario, we will backtest our trading strategies using a pair of ETFs representing distinct emerging markets: * `EWH`: iShares MSCI Hong Kong ETF * `EWZ`: iShares MSCI Brazil ETF This pair is well-suited for evaluating our models' adaptability to dynamic cointegration relationships, given the distinct economic and geopolitical factors influencing these markets. ### Data Preparation and Visualization ```python= # Load price data stock_pair = ["EWH", "EWZ"] data = yf.download(stock_pair, start="2014-07-01", end="2024-07-01")['Adj Close'] # Using Adjusted Close for accuracy # Plot prices plt.figure(figsize=(12, 6)) plt.plot(data) plt.title("EWH vs. EWZ Adjusted Closing Prices") plt.xlabel("Date") plt.ylabel("Price") plt.legend(stock_pair) plt.grid(axis='y') plt.show() ``` ![image](https://hackmd.io/_uploads/B11AJoLi0.png) ### Estimating Parameters and Spread Calculation We apply the `estimate_mu_gamma_LS`, `estimate_mu_gamma_rolling_LS`, and `estimate_mu_gamma_kalman` functions to estimate the model parameters (`mu` and `gamma`) for each of our three models: Least Squares (LS), Rolling Least Squares, and Kalman Filter. Then, we utilize the `compute_spread` function to calculate the spread for each model based on its estimated parameters. ```python=+ def estimate_mu_gamma_LS(Y, pct_training=0.3): """ Estimates mu and gamma parameters using linear regression. Args: Y: A pandas DataFrame with two columns, where the first column is the dependent variable and the second column is the independent variable. pct_training: The percentage of data to use for training the linear regression model (default is 0.3). Returns: A dictionary containing two pandas Series: - mu: Estimated mu values - gamma: Estimated gamma values """ T = len(Y) T_trn = round(pct_training * T) # Prepare training data X_train = Y.iloc[:T_trn, 1].values.reshape(-1, 1) y_train = Y.iloc[:T_trn, 0].values # Linear Regression model = LinearRegression() model.fit(X_train, y_train) # Extract coefficients mu_value = model.intercept_ gamma_value = model.coef_[0] # Create mu and gamma series mu = pd.Series([mu_value] * T, index=Y.index, name='mu-LS') gamma = pd.Series([gamma_value] * T, index=Y.index, name='gamma-LS') return {'mu': mu, 'gamma': gamma} def estimate_mu_gamma_rolling_LS(Y, pct_training=0.3): """ Estimates rolling mu and gamma using linear regression with a lookback window. Args: Y: A pandas DataFrame with two columns, where the first column is the dependent variable and the second column is the independent variable. pct_training: Percentage of data to use for initial training (default is 0.3). Returns: A dictionary containing: - mu: A pandas Series of estimated mu values. - gamma: A pandas Series of estimated gamma values. """ T = Y.shape[0] T_start = round(pct_training * T) T_lookback = 300 # Lookback window length T_shift = 10 # How often to update the estimates # Initialize empty arrays mu_rolling_LS = pd.Series(np.nan, index=Y.index) gamma_rolling_LS = pd.Series(np.nan, index=Y.index) # Loop over time periods to update estimates t0_update = range(max(T_start, T_lookback), T - T_shift + 1, T_shift) for t0 in t0_update: T_lookback_ = min(t0 - 1, T_lookback - 1) # Adjust lookback for early periods Y_window = Y.iloc[(t0 - T_lookback_):t0] # Linear regression (using NumPy for better performance than statsmodels) X = Y_window.iloc[:, 1].values.reshape(-1, 1) # Independent variable X = np.hstack([np.ones_like(X), X]) # Add intercept term y = Y_window.iloc[:, 0].values # Dependent variable weights = np.arange(1, T_lookback_ + 1)[::-1] coeffs = np.linalg.lstsq(X * weights[:, np.newaxis], y * weights, rcond=None)[0] mu_rolling_LS[Y.index[t0]] = coeffs[0] gamma_rolling_LS[Y.index[t0]] = coeffs[1] # Fill missing values and apply smoothing mu_rolling_LS = mu_rolling_LS.bfill() gamma_rolling_LS = gamma_rolling_LS.bfill() return {"mu": mu_rolling_LS, "gamma": gamma_rolling_LS} def estimate_mu_gamma_kalman(Y): Y_syn = np.array(Y) T_syn = len(Y_syn) # Number of time steps # Kalman filter parameters transition_matrix = np.eye(2) observation_matrix = np.zeros((T_syn, 1, 2)) # Time-varying observation matrix for t in range(T_syn): observation_matrix[t, 0, :] = [1, Y_syn[t, 1]] observation_covariance = 1e-3 * np.ones((T_syn, 1, 1)) transition_covariance = np.diag([1e-8, 1e-8]) init = estimate_mu_gamma_LS(Y) initial_state_mean = np.array([init['mu'][0], init['gamma'][0]]) initial_state_covariance = 1e-5 * np.eye(2) # Create the Kalman filter object kf = KalmanFilter( transition_matrices=transition_matrix, observation_matrices=observation_matrix, observation_covariance=observation_covariance, transition_covariance=transition_covariance, initial_state_mean=initial_state_mean, initial_state_covariance=initial_state_covariance, ) # Run Kalman filtering and smoothing filtered_state_means, filtered_state_covariances = kf.filter(Y_syn[:, 0]) smoothed_state_means, smoothed_state_covariances = kf.smooth(Y_syn[:, 0]) # Extract mu and gamma values mu_Kalman_filtering = filtered_state_means[:, 0] gamma_Kalman_filtering = filtered_state_means[:, 1] mu_Kalman_smoothing = smoothed_state_means[:, 0] gamma_Kalman_smoothing = smoothed_state_means[:, 1] return {'mu': mu_Kalman_filtering, 'gamma': gamma_Kalman_filtering} ``` ```python= def compute_spread(Y, gamma, mu, name=None): """ Calculates the spread based on input data. Args: Y (numpy.ndarray): Input data (typically a 2-dimensional array). gamma (float or numpy.ndarray): Gamma parameter. mu (float): Mu parameter. name (str, optional): Name for the result (column name if Y is 2D). Returns: numpy.ndarray: Calculated spread. """ spread = Y[Y.columns[0]] - mu - gamma * Y[Y.columns[1]] return spread ``` ```python= # Combine the data for plotting mu_data = pd.DataFrame({ 'LS': estimate_mu_gamma_LS(data)['mu'], 'Rolling LS': estimate_mu_gamma_rolling_LS(data)['mu'], 'Kalman': estimate_mu_gamma_kalman(data)['mu'] }) gamma_data = pd.DataFrame({ 'LS': estimate_mu_gamma_LS(data)['gamma'], 'Rolling LS': estimate_mu_gamma_rolling_LS(data)['gamma'], 'Kalman': estimate_mu_gamma_kalman(data)['gamma'] }) # Calculate spreads spread_LS = compute_spread(data, gamma_data['LS'], mu_data['LS'], "LS") spread_rolling_LS = compute_spread(data, gamma_data['Rolling LS'], mu_data['Rolling LS'], "rolling-LS") spread_Kalman = compute_spread(data, gamma_data['Kalman'], mu_data['Kalman'], "Kalman") # Combine spreads into a DataFrame spreads = pd.DataFrame({ 'LS': spread_LS, 'Rolling LS': spread_rolling_LS, 'Kalman': spread_Kalman, }) ``` ### Visualizing Parameter Estimates and Spreads ```python=+ # Plotting parameter estimates fig, axes = plt.subplots(2, 1, figsize=(12, 8)) mu_data.plot(ax=axes[0]) axes[0].set_title('Tracking of μ (Intercept)') axes[0].axvline(x=data.index[round(0.3 * len(data))], color='red', linestyle='dashed', linewidth=2, label='Training End') axes[0].legend(loc='upper left') gamma_data.plot(ax=axes[1]) axes[1].set_title('Tracking of γ (Hedge Ratio)') axes[1].axvline(x=data.index[round(0.3 * len(data))], color='red', linestyle='dashed', linewidth=2, label='Training End') axes[1].legend(loc='upper left') plt.tight_layout() plt.show() ``` ![image](https://hackmd.io/_uploads/rku6xi8j0.png) ```python=+ # Plotting spreads plt.figure(figsize=(12, 6)) plt.plot(spreads) plt.title("Spreads: EWH vs. EWZ") plt.xlabel("Date") plt.ylabel("Spread Value") plt.legend(spreads.columns, loc='upper left') plt.grid(axis='y', linestyle='--') plt.axvline(x=data.index[round(0.3 * len(data))], color='red', linestyle='dashed', linewidth=2, label='Training End') plt.show() ``` ![image](https://hackmd.io/_uploads/SyuRxjLoA.png) ### Trading Strategy We'll implement a straightforward Z-score-based trading strategy across all models. This strategy involves: * **Buying (going long)** when the spread's Z-score falls below a predetermined negative threshold. * **Selling (going short)** when the Z-score rises above a positive threshold. ```python= def generate_signal(Z_score, threshold_long, threshold_short): signal = Z_score.copy() signal.iloc[:] = np.nan # Initialize signal to NaN signal.name = "signal" # Set name of Series # Initial position signal.iloc[0] = 0 if Z_score.iloc[0] <= threshold_long.iloc[0]: signal.iloc[0] = 1 elif Z_score.iloc[0] >= threshold_short.iloc[0]: signal.iloc[0] = -1 # Loop over time for t in range(1, len(Z_score)): if signal.iloc[t-1] == 0: # No position if Z_score.iloc[t] <= threshold_long.iloc[t]: signal.iloc[t] = 1 elif Z_score.iloc[t] >= threshold_short.iloc[t]: signal.iloc[t] = -1 else: signal.iloc[t] = 0 elif signal.iloc[t-1] == 1: # Long position if Z_score.iloc[t] >= 0: signal.iloc[t] = 0 else: signal.iloc[t] = 1 else: # Short position if Z_score.iloc[t] <= 0: signal.iloc[t] = 0 else: signal.iloc[t] = -1 return signal ``` ```python= THRESHOLD = 0.5 pnls = {} def signal_pnl(spread, strategy): # Generate signal using the function threshold_short = pd.Series(np.full(len(spread), THRESHOLD), index=spread.index) threshold_long = pd.Series(np.full(len(spread), -THRESHOLD), index=spread.index) signal = generate_signal(spread, threshold_long, threshold_short) # Plotting plt.figure(figsize=(15, 6)) plt.plot(spread, label="Z-score") plt.plot(signal, label="Signal", linestyle='-', drawstyle='steps-post') plt.title("Z-score and Trading Signal") plt.legend(loc="upper left") plt.show() # Calculate spread returns spread_return = spread.diff() # Calculate traded returns (with lag) traded_return = spread_return * signal.shift(1) # Shift signal by 1 for the lag traded_return.fillna(0, inplace=True) # Fill NA values with 0 traded_return.name = "traded spread" # Cumulative P&L cum_pnl_no_reinvest = traded_return.cumsum() pnls[strategy] = cum_pnl_no_reinvest ``` ```python= signal_pnl(spread_LS, "LS") ``` ![image](https://hackmd.io/_uploads/ryT4QiLjR.png) ```python= signal_pnl(spread_rolling_LS, "Rolling LS") ``` ![image](https://hackmd.io/_uploads/Hy6wQjIiC.png) ```python! signal_pnl(spread_Kalman, 'Kalman') ``` ![image](https://hackmd.io/_uploads/S15Fms8iR.png) ### Performance Comparison ```python= def analyze_pnl_series(pnl_dict, risk_free_rate=0): """Plots, calculates drawdown, and Sharpe ratio for a dictionary of P&L Series. Args: pnl_dict: Dictionary where keys are names and values are Pandas Series with P&L data. risk_free_rate: Annual risk-free rate (default is 0). Returns: drawdowns: Dictionary of max drawdowns for each P&L series. sharpe_ratios: Dictionary of annualized Sharpe ratios for each P&L series. """ plt.figure(figsize=(15, 8)) drawdowns = {} sharpe_ratios = {} for name, pnl_series in pnl_dict.items(): pnl_series.bfill().ffill() # Plotting plt.plot(pnl_series, label=name) # Drawdown Calculation drawdown = [(pnl_series[i] - pnl_series[:i].max()) / pnl_series[-1]for i in range(1, len(pnl_series))] drawdowns[name] = -min(drawdown) daily_returns = [pnl_series[i] / pnl_series[i - 1] if pnl_series[i - 1] != 0 else 1 for i in range(1, len(pnl_series))] sharpe_ratios[name] = statistics.mean(daily_returns) / statistics.stdev(daily_returns) plt.xlabel('Date') plt.ylabel('Cumulative P&L') plt.title('Cumulative P&L and Drawdowns') plt.legend() plt.grid(axis='y') plt.tight_layout() plt.show() return drawdowns, sharpe_ratios # Calculate and display performance metrics (Sharpe Ratio and Max Drawdown) drawdowns, sharpe_ratios = analyze_pnl_series(cumulative_pnls) print("\nPerformance Comparison:") print("-" * 25) print(f"{'Model':<15} {'Sharpe Ratio':<15} {'Max Drawdown':<15}") print("-" * 25) for model in ["LS", "Rolling LS", "Kalman"]: print(f"{model:<15} {sharpe_ratios[model]:<15.2f} {drawdowns[model]:<15.2%}") print("-" * 25) ``` ![image](https://hackmd.io/_uploads/S1q8Ei8jR.png) ``` Performance Comparison: ------------------------- Model Sharpe Ratio Max Drawdown ------------------------- LS 0.21 73.28% Rolling LS 4.73 21.76% Kalman 7.62 3.05% ------------------------- ``` #### Key Insights * **Dynamic Models Excel:** Time-varying parameter models (Rolling LS and Kalman Filter) significantly outperform the fixed LS model, emphasizing the importance of adapting to changing relationships in real-world markets. * **Kalman Filter's Robustness:** The Kalman Filter again demonstrates superior performance. Its ability to efficiently incorporate new information allows it to respond to market shifts, resulting in a higher Sharpe ratio and a remarkably lower max drawdown. * **Real-World Applicability:** The results underscore the practical value of these techniques for pair trading in actual markets. The Kalman filter, in particular, offers a promising approach for capturing and profiting from dynamic cointegration relationships. ## Reference * Palomar. (n.d.). Pairs Trading. [Lecture slides](https://palomar.home.ece.ust.hk/MAFS5310_lectures/slides_pairs_trading.pdf). Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology. * Palomar. Pairs Trading with R. [link](https://palomar.home.ece.ust.hk/MAFS5310_lectures/Rsession_pairs_trading_with_R.html)