# Identifying Patterns in Financial Markets with Regime-Switching Models
## Introduction
Financial markets are characterized by their inherent dynamism and susceptibility to volatility. The ability to understand and anticipate changes in market regimes—distinct periods with different characteristics—is paramount for developing successful investment strategies. This document delves into a range of robust statistical tools specifically designed to address this challenge:
* **Hidden Markov Models (HMMs):** These models excel at capturing the underlying, unobservable states (regimes) that drive market behavior. They model transitions between these states, providing valuable insights into crucial market phases like bull and bear markets, high and low volatility regimes, and more.
* **Clustering Algorithms (K-means, K-means++, Sparse K-means):** These algorithms are essential tools for grouping similar data points together, aiding in both the initialization and interpretation of regime-switching models.
* **Jump Models:** These models complement HMMs by explicitly addressing sudden, significant price changes known as "jumps." They are vital for understanding and potentially predicting extreme market events that can have a substantial impact on investment portfolios.
By strategically combining these techniques, we can construct a comprehensive framework for analyzing and forecasting market regimes. This framework empowers investors with the knowledge needed to make well-informed decisions in the face of ever-changing market conditions.
## Hidden Markov Models (HMMs)
Hidden Markov Models (HMMs) serve as a potent tool for unraveling the complex dynamics of financial markets. They are particularly effective in discerning shifts between distinct market regimes, such as bull vs. bear markets or periods of high vs. low volatility. HMMs are adept at capturing both abrupt market transitions and phases of persistent behavior, making them a versatile choice for financial analysis.
### Core Principles
HMMs are statistical models grounded in the assumption that the system under study possesses underlying, unobservable "hidden" states that evolve over time based on probabilistic transitions. In the context of financial markets, these components translate to:
* **Hidden States:** These represent the different market conditions or regimes, such as bull, bear, or neutral markets.
* **Observed Data:** This encompasses market indicators like asset prices, returns, or trading volumes.
* **Transitions:** These govern the probability of switching between hidden states, capturing the dynamics of regime changes.
* **Emissions:** These describe the probability of observing specific data values given the current hidden state, linking the unobservable states to measurable market data.
The key assumptions underpinning HMMs are:
1. **Markov Property:** The future state depends solely on the present state, not on the entire history of states.
2. **Stationarity:** The transition probabilities between states remain constant over time.
### Mathematical Formulation
To illustrate, let's consider a simplified HMM for financial markets:
* **Hidden States:** $S_t \in \{1,2\}$ where 1 represents a bull market and 2 a bear market.
* **Observed Data:** $Y_t$, representing the daily log-return of an asset.
* **Transition Probabilities:**
* $\gamma_{11}$: Probability of staying in a bull market.
* $\gamma_{12}$: Probability of transitioning from a bull to bear market.
* $\gamma_{21}$: Probability of transitioning from a bear to bull market.
* $\gamma_{22}$: Probability of staying in a bear market.
* **Emission Probabilities:** Assumed to follow Gaussian distributions:
* Bull market: $Y_t|S_1 \sim N(\mu_1, \sigma_1^2)$
* Bear market: $Y_t|S_2 \sim N(\mu_2, \sigma_2^2)$
#### Parameter Estimation
* **Traditional (Batch) Methods:** The Expectation-Maximization (EM) algorithm is commonly employed to estimate HMM parameters by iteratively maximizing the likelihood of the observed data.
* **Adaptive (Online) Methods:** Often based on gradient ascent optimization, these methods enable real-time parameter updates as new data arrives, making them well-suited for the dynamic nature of financial markets.
### Key Considerations
* **Flexibility:** HMMs can be tailored to accommodate diverse emission distributions (beyond Gaussian) and various assumptions about state durations.
* **Model Selection:** The choice of the number of hidden states $K$ is crucial and often necessitates careful analysis based on data characteristics and domain expertise.
### Code Implementation
The [`hmmlearn`](https://hmmlearn.readthedocs.io/en) library provides a convenient way to implement HMMs in Python. Refer to the provided code example for a practical demonstration.
## K-means Clustering and its Extensions
K-means clustering serves as a foundational tool in unsupervised machine learning for partitioning datasets into distinct groups or clusters. While primarily a clustering technique, it also holds significance in initializing other models, such as jump models, due to its capacity to swiftly identify potential regime states within financial data. Let's explore K-means along with two of its notable extensions: K-means++ and Sparse K-means.
### K-means Clustering
#### Mathematical Formulation
K-means operates by minimizing the within-cluster sum of squares (WCSS), a metric that quantifies how tightly data points are clustered around their respective centers. Mathematically, this objective is expressed as:
$$
J = \sum_{i=1}^{N} \sum_{k=1}^{K} r_{ik} || \textbf{y}_i - \pmb{\mu}_k ||^2
$$
where:
* $N$ is the total number of data points.
* $K$ is the predefined number of clusters.
* $\textbf{y}_i$ is the i-th data point.
* $\pmb{\mu}_k$ is the center (centroid) of the $k$-th cluster.
* $r_{ik}$ is an indicator variable: 1 if $\textbf{x}_i$ belongs to cluster $k$, and 0 otherwise.
### Algorithm
:::success
**K-means Clustering for Time Series Data**
**Input:**
* `Y = {y[0], y[1], ..., y[T-1]}`: Time series data
* `K`: Number of clusters
**Output:**
* `mu = {mu[0], mu[1], ..., mu[K-1]}`: Set of K cluster centroids
**Algorithm**
* **Initialization**
1. For `k = 0 to K-1`:
* `mu[k] = random_select(Y)` // Randomly select initial centroids from `Y`
2. Set `max_iterations = 100`
3. Set `iteration = 0`
4. Set `converged = false`
* **Main Loop**
* While `not converged` and `iteration < max_iterations`:
1. For `n = 0 to T-1`:
* Set `min_distance = INFINITY`
* Set `assigned_cluster = -1`
* For `k = 0 to K-1`:
* `distance = calculate_distance(y[n], mu[k])`
* If `distance < min_distance`:
* Set `min_distance = distance`
* Set `assigned_cluster = k`
* Set `r[n][assigned_cluster] = 1` // Assign `y[n]` to the closest cluster
* For` k = 0 to K-1`:
* If `k != assigned_cluster`:
* Set `r[n][k] = 0`
* **Update Step**
1. Initialize `new_mu` as an empty list of K elements
2. For `k = 0 to K-1`:
* Set `sum = 0`
* Set `count = 0`
* For `n = 0 to T-1`:
* If `r[n][k] == 1`:
* `sum += y[n]`
* `count += 1`
* If `count > 0`:
* `new_mu[k] = sum / count`
* Else:
* `new_mu[k] = mu[k]` // Keep the old centroid if the cluster is empty
* **Convergence Check**
1. If `new_mu == mu`:
* Set `converged = true`
2. Else:
* Set `mu = new_mu`
3. Increment `iteration`
* **Return**
* Return `mu`
:::
#### Considerations
* The optimal number of clusters, $K$, often needs to be determined using techniques like the Elbow method or Silhouette analysis.
* K-means can be sensitive to the initial centroid placement.
* The algorithm might converge to a local minimum; multiple runs with different initializations can help.
* K-means assumes spherical and roughly equal-sized clusters, which might not always be realistic.
### K-means++
K-means++ enhances the standard K-means algorithm by employing a smarter initialization strategy. It aims to select initial centroids that are well-spread across the dataset, leading to improved clustering results.
#### Mathematical Formulation
The core distinction lies in the probability distribution used to select new centers:
$$
\Pr(\text{choosing $\mathbf{x}$ as a new center}) = \frac{D(\mathbf{x})^2}{\sum_{\mathbf{x} \in X} D(\mathbf{x})^2}
$$
where $D(\mathbf{x})$ is the distance from data point $\mathbf{x}$ to the nearest existing center, and $X$ is the set of all data points.
#### Advantages
* Often leads to better clustering quality (lower WCSS) than random initialization.
* Offers theoretical guarantees on its performance relative to the optimal clustering.
#### Limitation
* Incurs a slightly higher computational cost during initialization compared to standard K-means.
#### Algorithm
:::success
**K-means++ Initialization Algorithm**
**Input**
* `Y = {y[0], y[1], ..., y[T-1]}`: Time series data
* `K`: Number of clusters
**Output**
* `C = {c[0], c[1], ..., c[K-1]}`: Set of K initial cluster centroids
**Algorithm**
* **Initialization**
1. Set `C` as an empty list
2. Randomly select `c[0]` from `Y`
3. Add `c[0]` to `C`
* **Choose Remaining Centers**
* Repeat `K - 1` times:
1. Initialize `D` as an empty list of length `T`
2. Set `total_distance = 0`
* **Calculate Distances**
* For `n = 0 to T-1`:
* If `y[n]` not in `C`:
* `D[n] = min_squared_distance(y[n], C)`
* `total_distance += D[n]`
* Else:
* `D[n] = 0`
* **Select Next Centroid**
1. Generate `r = random(0, total_distance)`
2. Set `accumulated_distance = 0`
3. For `n = 0 to T-1`:
* `accumulated_distance += D[n]`
* If `accumulated_distance > r`:
* Set `c[i] = y[n]`
* Add `c[i] to C`
* Break the loop
* **Return**
* Return `C`
**Helper Functions**
* `min_squared_distance(y, C)`
1. Set `min_dist = INFINITY`
2. For each centroid `c` in `C`:
* `dist = calculate_distance(y, c)`
* If `dist < min_dist`:
* Set `min_dist = dist`
3. Return `min_dist * min_dist`
:::
### Sparse K-means Clustering
Sparse K-means extends traditional K-means by incorporating feature selection, enhancing both model interpretability and performance. It achieves this by assigning weights to features, determining their importance in the clustering process.
#### Objective Function
Sparse K-means aims to minimize the weighted within-cluster sum of squares while maximizing the weighted between-cluster sum of squares, subject to constraints on the feature weights.
\begin{split}
\min_{\mathbf{w}, \mu, s}& \quad \mathbf{w} \odot \sum_{t=1}^T n_k (\pmb{\mu}_k - \bar{\pmb{\mu}})^2 \\
\text{subject to}& \quad ||\mathbf{w}||_2^2 \leq 1, \quad ||\mathbf{w}||_1 \leq \kappa, \quad w_i \geq 0 \quad \forall i
\end{split}
where:
* $\mathbf{w} = (w_1, \dots, w_p)$ is the vector of feature weights
* $n_k$ is the number of data points in cluster $k$.
* $\pmb{\mu}_k$ is the center of cluster $k$.
* $\bar{\pmb{\mu}}$ is the mean of the cluster centers.
* $\kappa$ is the tuning parameter that controls the degree of sparsity
#### Key Points
* Performs simultaneous clustering and feature selection.
* Provides insights into feature importance through the assigned weights.
* Allows control over the degree of sparsity (number of features used) through a tuning parameter.
#### Algorithm
:::success
**Sparse K-means Algorithm**
**Input**
* `Y = {y[0], y[1], ..., y[T-1]}`: Time series data, where each `y[i]` is a P-dimensional vector
* `K`: Number of clusters
* `kappa`: Sparsity tuning parameter (L1-norm constraint)
**Output**
* Cluster assignments for each data point
* `C = {c[0], c[1], ..., c[K-1]}`: Set of K cluster centroids
* `w = {w[0], w[1], ..., w[P-1]}`: Weights assigned to each of the P features
**Algorithm**
* **Initialization**
1. Initialize feature weights: `w[j] = 1 / sqrt(P)` for `j = 0 to P-1`
2. Initialize centroids `C` (e.g., using K-means++ initialization)
* **Main Loop**
* Repeat until convergence or maximum iterations reached:
1. Cluster Assignment
* For each data point `y[i]` in `Y`:
* For each centroid `c[k]` in `C`:
* Calculate weighted distance:
* `d = sqrt(sum(w[j]^2 * (y[i][j] - c[k][j])^2)` for `j = 0 to P-1)`
* Assign `y[i]` to the cluster with the minimum weighted distance
2. Update Centroids
* For each cluster `k`:
* Update centroid
* `c[k]`: `c[k][j] = sum(w[j]^2 * y[i][j] for y[i] in cluster k) / sum(w[j]^2 for y[i] in cluster k)` for `j = 0 to P-1`
3. Update Feature Weights
* Solve the Sparse K-means optimization problem:
* Maximize the weighted between-cluster sum of squares (BCSS):
* `BCSS = sum(w[j]^2 * BCSS[j] for j = 0 to P-1)` where `BCSS[j]` is the between-cluster sum of squares for feature j
* Subject to constraints:
* `||w||_1 ≤ kappa` (L1-norm constraint)
* `w[j] ≥ 0` for all j
* Update `w` with the solution of the optimization problem
* **Return**
* Return cluster assignments, centroids `C`, and feature weights `w`
**Helper Functions**
* `calculate_BCSS(Y, C, cluster_assignments)`
1. For each feature `j`:
* Calculate the total mean of the feature
* Calculate the between-cluster sum of squares:
* `BCSS[j] = sum((c[k][j] - total_mean[j])^2 * size(cluster k)` for `k = 0 to K-1)`
2. Return `BCSS`
* `solve_optimization_problem(BCSS, kappa)`
* Implement a method to solve the constrained optimization problem
* This can be done using techniques like coordinate descent or proximal gradient methods
* Return the optimal feature weights `w`
:::
### Code Implementation
Python libraries like [`scikit-learn`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and [`pysparcl`](https://github.com/tsurumeso/pysparcl/tree/master) provide implementations for K-means, K-means++, and Sparse K-means.
## Jump Models and Their Extensions
Jump models serve as a specialized class of statistical models tailored to capture abrupt changes or "jumps" in time-series data. They achieve this by assuming the data is generated by a sequence of distinct states or regimes, with transitions between these states occurring at discrete points in time. This makes jump models exceptionally well-suited for analyzing financial data, where sudden shifts in market conditions are commonplace. Let's explore the core concepts of jump models along with their continuous and sparse extensions.
### Jump Models
#### Objective Function
A common approach to fitting jump models involves minimizing the following objective function:
$$
\min_{\mu, s} \sum_{t=1}^{T-1} \left[ \ell(\mathbf{y}_t, \pmb{\mu}_{s_t}) + \lambda I_{s_t \neq s_{t+1}} \right] + \ell(\mathbf{y}_T, \mu_{s_T})
$$
where:
* $\mathbf{y}_1, \dots, \mathbf{y}_T \in \mathbb{R}^p$ are the observed data points with p features (e.g., asset prices or returns).
* $\pmb{\mu}_1, \dots, \pmb{\mu}_K \in \mathbb{R}^p$ are the model parameters associated with each of the $K$ possible states (regimes).
* $s_1, \dots, s_T$ are the state assignments for each observation, indicating which state generated that particular data point.
* $\ell(\mathbf{y},\pmb{\mu}) = ||\mathbf{y} - \pmb{\mu}||^2$ is a loss function measuring the fit between the observation $\mathbf{y}$ and the state parameter $\pmb{\mu}$.
* $\lambda \geq 0$ is a regularization parameter that controls the "jump penalty" – the cost of transitioning between states.
* $I_{s_t \ne s_{t+1}}$ is an indicator function that equals 1 if the state changes between time $t$ and $t+1$, and 0 otherwise.
#### Key Points
* **Connection to K-means:** When the jump penalty $\lambda$ is zero, the objective function becomes equivalent to that of K-means clustering, allowing K-means to serve as an efficient initialization method for jump models.
* **Controlling Model Persistence:** The regularization parameter $\lambda$ governs how long the model tends to stay in a given state. Larger $\lambda$ favors fewer jumps and longer state durations.
* **Solution Approach:** Jump models are typically fit using a coordinate descent algorithm, iteratively optimizing model parameters, the state sequence, and potentially feature weights (in sparse jump models).
* **Dynamic Programming:** Dynamic programming is employed to efficiently estimate the most likely sequence of states.
#### Algorithm
:::success
**Jump Model Algorithm**
**Input**
* `Y`: Time series data, shape (n_samples, n_features)
* `n_components`: Number of clusters
* `jump_penalty`: Penalty for state transitions
* `max_iter`: Maximum number of iterations
* `tol`: Convergence tolerance
* `n_init`: Number of initializations
**Output**
* Optimized model parameters including cluster centers, state probabilities, and labels
**Algorithm**
* **Initialization**
1. Set `best_val = infinity`
2. Compute `jump_penalty_mx = jump_penalty * (ones_matrix(n_components, n_components) - identity_matrix(n_components))`
* **Main Loop (repeat `n_init` times)**
1. Initialize `centers` using k-means++ algorithm
2. Set `labels_pre = None`, `val_pre = infinity`
3. Perform E-step:
* Compute `loss matrix: loss_mx = 0.5 * squared_euclidean_distance(X, centers)`
* `labels_, val_ = dp_viterbi(loss_mx, jump_penalty_mx)`
* `proba_ = raise_labels_into_proba(labels_, n_components)`
4. While `num_iter < max_iter` and not converged:
* M-step: Update centers
* `centers = weighted_mean_cluster(X, proba_)`
* E-step:
* Recompute `loss_mx`
* `labels_, val_ = dp_viterbi(loss_mx, jump_penalty_mx)`
* `proba_ = raise_labels_into_proba(labels_, n_components)`
* Check convergence:
* If `labels_ == labels_pre` or `val_pre - val_ <= tol`, break
* Update `labels_pre, val_pre = labels_, val_ `
5. If `val_ < best_val`:
* Update `best_val, best_centers, best_labels, best_proba = val_, centers, labels_, proba_`
* **Finalization**
1. Sort states based on cluster frequencies or provided return series
2. Compute transition matrix from `best_labels`
3. Return `best_centers`, `best_proba`, `best_labels`, `transition_matrix`
**Helper Functions**
* `dp_viterbi(loss_mx, penalty_mx)`
1. Solve the dynamic programming problem:
* `min sum(L(t, s_t) for t = 0 to T-1) + sum(Λ(s_(t-1), s_t) for t = 1 to T-1)`
2. Return optimal state sequence and its value
* `weighted_mean_cluster(X, proba)`
1. Compute weighted mean of `X` for each cluster using `proba` as weights
2. Return cluster centers
:::
### Sparse Jump Models
Sparse jump models further extend jump models by incorporating feature selection. They simultaneously identify the timing and magnitude of jumps and the features most relevant to these shifts.
#### Objective Function
Sparse jump models maximize an objective function that balances maximizing the weighted between-cluster sum of squares and penalizing state transitions, subject to constraints on feature weights to enforce sparsity.
\begin{split}
\max_{\mathbf{w},\mu,s}& \quad \mathbf{w} \odot \sum_{k=1}^K n_k (\pmb{\mu}_k - \bar{\pmb{\mu}})^2 - \lambda \sum_{t=1}^{T-1} I_{s_t \neq s_{t+1}} \\
\text{subject to}& \quad ||\mathbf{w}||_2^2 \leq 1, \quad ||\mathbf{w}||_1 \leq \kappa, \quad w_i \geq 0 \quad \forall i
\end{split}
#### Key Points
* **Simultaneous Jump Detection and Feature Selection:** Sparse jump models perform both tasks concurrently.
* **Enhanced Interpretability:** Feature weights provide insights into the relative importance of each feature in driving regime shifts.
* **Mitigating Overfitting:** Feature selection helps prevent overfitting, especially when dealing with high-dimensional financial data.
#### Algorithm
:::success
**Sparse Jump Model Algorithm**
**Input**
* `Y`: Time series data, shape (n_samples, n_features)
* `n_components`: Number of clusters
* `n_feats`: Target number of features (used to calculate sparsity constraint)
* `jump_penalty`: Penalty for state transitions
* `max_iter`: Maximum number of iterations
* `w_tol`: Tolerance for feature weight convergence
* `jm_max_iter`: Maximum number of iterations for inner Jump Model
* `jm_tol`: Convergence tolerance for inner Jump Model
* `jm_n_init`: Number of initializations for inner Jump Model
**Output**
* Optimized model parameters including cluster centers, state probabilities, labels, and feature weights
**Algorithm**
* **Initialization**
1. Set `s = sqrt(n_feats)` (sparsity constraint)
2. Initialize feature weights `w = ones(n_features) / sqrt(n_features)`
3. Initialize Jump Model instance `jm` with given parameters
* **Main Loop**
* While `n_iter < max_iter` and not converged:
1. Set `w_old = w`
2. Compute feature weights for Jump Model: `feat_weights = sqrt(w)`
3. Fit Jump Model:
* If `n_iter > 1`: Update `jm.centers_ = centers_ * feat_weights`
* Call `jm.fit(X, feat_weights=feat_weights)`
* Update `centers_ = weighted_mean_cluster(X, jm.proba_)`
4. Update feature weights:
* Compute Between-Cluster Sum of Squares (BCSS):
* `BCSS = compute_BCSS(X, jm.proba_, centers_)`
* If `BCSS` is all close to 0, break the loop (all in one cluster)
* Solve Lasso problem to update `w`:
* `w = solve_lasso(BCSS / 1e3, s)`
5. Check convergence:
* If `norm(w - w_old, 1) / norm(w_old, 1) <= w_tol`, break the loop
6. Increment `n_iter`
* **Finalization**
1. Set final feature weights `self.w = w`
2. If `refit`:
* Set `feat_weights[w > 0] = 1`, `feat_weights[w == 0] = 0`
* Refit Jump Model with binary feature weights
3. Else:
* Set `self.feat_weights = sqrt(w)`
4. Compute final centers, labels, and probabilities
5. Return optimized model
**Helper Functions**
* `compute_BCSS(X, proba, centers)`
1. If `centers` is None, compute `centers = weighted_mean_cluster(X, proba)`
2. Compute `BCSS`:
* `BCSS = sum(proba, axis=0) @ ((centers - mean(X, axis=0))^2)`
3. Return `BCSS`
* `solve_lasso(a, s)`
1. Solve the Lasso problem:
* `min 0.5 * ||a - w||_2^2 + lambda * ||w||_1 s.t. ||w||_2 = 1, w >= 0` where `lambda` is chosen to satisfy `sum(w) = s`
2. Use soft-thresholding and binary search to find the solution
3. Return the optimal `w`
* `weighted_mean_cluster(X, proba)`
1. Compute weighted mean of `X` for each cluster using `proba` as weights
2. Return cluster centers
:::
### Continuous Jump Models
In financial applications, it's often valuable to estimate the probability of a time period belonging to a specific regime, rather than just assigning a single label. Continuous jump models address this by extending hidden states from discrete labels to probability vectors over all possible states.
#### Model Formulation
The hidden state at time $t$ is represented as a probability vector $\mathbf{s}_t$ on the probability simplex. The model aims to minimize a combination of a weighted loss function and a penalty for abrupt changes in state probabilities.
$$
\min_{S} \left[ \sum_{t=1}^{T} L(y_t, \Theta, \mathbf{s}_t) + \frac{\lambda}{4} \sum_{t=2}^{T} \|\mathbf{s}_{t-1} - \mathbf{s}_t\|_1^2 \right]
$$
where:
* $\mathbf{s}_t \in \Delta^{K-1} := \left\{ \mathbf{s} = (s_0, \ldots, s_{K-1})^T \in \mathbb{R}^K : \sum_{k=0}^{K-1} s_k = 1, \mathbf{s} \geq 0 \right\}$
* $S = (\mathbf{s}_1, \dots, \mathbf{s}_{T})$ is the sequence of state vectors.
* $L(y_t, \Theta, \mathbf{s}_t) = \sum^{K-1}_{k=0} s_{t,k} \ell(y_t, \theta_k)$ is a weighted loss function combining the losses of each state.
* $\Theta = (\theta_0, \dots, \theta_{K-1})$ are the model parameters for each state.
* $\ell(y_t, \theta_k)$ is a loss function, often the scaled squared Euclidean distance.
* $\lambda$ is a regularization parameter controlling the penalty for transitions (jumps) between states.
#### Key Points
* **Probabilistic Interpretation:** Continuous jump models can be viewed as generalized Hidden Markov Models (GHMMs).
* **Mode Loss:** An additional term is often added to the objective function to encourage sparsity in state vectors, favoring higher probabilities for fewer regimes at each time.
* **Optimization:** A coordinate descent approach is used to iteratively optimize model parameters and the hidden state sequence.
#### Algorithm
:::success
**Continuous Jump Model Algorithm**
**Input**
* `Y`: Time series data, shape (n_samples, n_features)
* `n_components`: Number of clusters
* `jump_penalty`: Penalty for state transitions
* `grid_size`: Size of grid for discretizing probability simplex
* `max_iter`: Maximum number of iterations
* `tol`: Convergence tolerance
* `n_init`: Number of initializations
* `alpha`: Power parameter for jump penalty (default: 2)
**Output**
* Optimized model parameters including cluster centers, state probabilities, and labels
**Algorithm**
* **Initialization**
1. Set `best_val = infinity`
2. Compute `prob_vecs = discretize_prob_simplex(n_components, grid_size)`
3. Compute pairwise L1 distances: `pairwise_l1_dist = cityblock_distance(prob_vecs, prob_vecs) / 2`
4. Compute `jump_penalty_mx = jump_penalty * (pairwise_l1_dist ^ alpha)`
5. Add mode loss to `jump_penalty_mx` if required
* **Main Loop (repeat `n_init` times)**
1. Initialize `centers` using k-means++ algorithm
2. Set `labels_pre = None`, `val_pre = infinity`
3. Perform E-step:
* Compute `loss matrix: loss_mx = 0.5 * squared_euclidean_distance(X, centers)`
* Compute `loss_mx = loss_mx @ prob_vecs.T`
* `labels_, val_ = dp_viterbi(loss_mx, jump_penalty_mx)`
* `proba_ = prob_vecs[labels_]`
4. While `num_iter < max_iter` and not converged:
* M-step: Update centers
* `centers = weighted_mean_cluster(X, proba_)`
* E-step:
* Recompute `loss_mx`
* Compute `loss_mx = loss_mx @ prob_vecs.T`
* `labels_, val_ = dp_viterbi(loss_mx, jump_penalty_mx)`
* `proba_ = raise_labels_into_proba(labels_, n_components)`
* Check convergence:
* If `labels_ == labels_pre` or `val_pre - val_ <= tol`, break
* Update `labels_pre, val_pre = labels_, val_`
5. If `val_ < best_val`:
* Update `best_val, best_centers, best_labels, best_proba = val_, centers, labels_, proba_`
* **Finalization**
1. Sort states based on cluster frequencies or provided return series
2. Compute transition matrix from `best_labels`
3. Return `best_centers`, `best_proba`, `best_labels`, `transition_matrix`
**Helper Functions**
* `discretize_prob_simplex(n_components, grid_size)`
1. Generate all grid points on the probability simplex
2. Return array of probability vectors
* `dp_viterbi(loss_mx, penalty_mx)`
1. Solve the dynamic programming problem:
* `min sum(L(t, s_t) for t = 0 to T-1) + sum(Λ(s_(t-1), s_t) for t = 1 to T-1)`
2. Return optimal state sequence and its value
* `weighted_mean_cluster(X, proba)`
1. Compute weighted mean of `X` for each cluster using `proba` as weights
2. Return cluster centers
:::
### Code Implementation
The [`jump-models`](https://github.com/Yizhan-Oliver-Shu/jump-models) library in Python offers implementations for jump models, continuous jump models, and sparse jump models.
## Reference
* MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In *Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability* (Vol. 1, pp. 281-297). University of California Press.
* Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In *Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms* (pp. 1027-1035). Society for Industrial and Applied Mathematics.
* Witten, D. M., & Tibshirani, R. (2010). A Framework for Feature Selection in Clustering. Journal of the American Statistical Association, 105(490), 713–726. https://doi.org/10.1198/jasa.2010.tm09415
* Bemporad, A., Breschi, V., Piga, D., & Boyd, S. (2018). Fitting Jump Models, *Automatica*, 96, 11-21, 2018. https://doi.org/10.1016/j.automatica.2018.06.022
* Nystrup, P., Madsen, H., & Lindström, E. (2018). Dynamic portfolio optimization across hidden market regimes. *Quantitative Finance*, 18(1), 83-95. https://doi.org/10.1080/14697688.2017.1342857
* Nystrup, P., Lindström, E., & Madsen, H. (2020). Learning hidden Markov models with persistent states by penalizing jumps. *Expert Systems with Applications*, 150, 113307. https://doi.org/10.1016/j.eswa.2020.113307
* Nystrup, P., Kolm, P. N., & Lindström, E. (2021). Feature selection in jump models. *Expert Systems with Applications*, 184, 115558. https://doi.org/10.1016/j.eswa.2021.115558
* Cortese, F.P., Kolm, P.N. & Lindström, E. What drives cryptocurrency returns? A sparse statistical jump model approach. Digit Finance 5, 483–518 (2023). https://doi.org/10.1007/s42521-023-00085-x
* Aydınhan, A. O., Kolm, P. N., Mulvey, J. M., & Shu, Y. (2024). Identifying patterns in financial markets: Extending the statistical jump model for regime identification. http://dx.doi.org/10.2139/ssrn.4556048
* Statistical Jump Models (JMs): https://github.com/Yizhan-Oliver-Shu/jump-models