# [Deep Reinforcement Learning for Optimal Portfolio Allocation: A Comparative Study with Mean-Variance Optimization](https://icaps23.icaps-conference.org/papers/finplan/FinPlan23_paper_4.pdf) J.P. Morgan AI Research J.P. Morgan Global Equities; Oxford-Man Institute of Quantitative Finance 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org) ## Introduction - Compare a simple and robust DRL framework, that was designed around risk-adjusted returns, with one of the traditional finance methods for portfolio optimization, Mean-Variance Portfolio Optimization (MVO) - Across previous studies, there is usually a discrepancy between the reward function used to train the RL agent, and the objective function used for MVO (for e.g., daily returns maximization vs risk minimization). - In order to make a fair comparison, it is crucial that both approaches optimize for the same goal. - Additionally, none of these works provide implementation details for the MVO frameworks they used for their comparison ## Background - Modern portfolio theory (MPT): - A framework that allows an investor to mathematically balance risk tolerance and return expectations to obtain efficiently diversified portfolios - A rational investor will prefer a portfolio with less risk for a specified level of re- turn and concludes that risk can be reduced by diversifying a portfolio. - Mean-Variance Portfolio Optimization is a technique of MPT - The goal are usually 1. Maximize returns for a given level of risk 2. Achieve a desired rate of return while minimizing risk 3. Maximize returns generated per unit risk - Risk is usually measured by the volatility of a portfolio (or asset), which is the variance of its rate of return. - For a given set of assets, this process requires as inputs the rates of returns for each asset, along with their covariances. - As the true asset returns are unknown, in practice, these are estimated or forecasted using various techniques that leverage historical data. - Objectives: - Let $w$ be the weight vector for a set of assets, $\mu$ be the expected return and $\Sigma$ be the covariance matirx - $w^T\Sigma w$ indicates the portfolio risk - Given a desired rate of return $\mu^{*}$, find by $min_{w} w^T\Sigma w$ subject to $w^T\mu \geq \mu, w_i \geq 0, \Sigma w_i = 1$ - Maximize Sharpe ratio: $max_w \frac{\mu^T w - R_f}{(w^T\Sigma w)^{\frac{1}{2}}}$ (This is chosen as the objective function in this study) ## Problem Setup - Action - Portfolio weights $w = [w_1, ..., w_n], \sum^N_{i=1}w_i = 1, 0 \leq w_i \leq 1$ - States - ![image](https://hackmd.io/_uploads/Hkm8_IH0a.png) - $w_i$: portfolio weights. This might differ slightly from the portfolio weights it chooses at the timestep before, as we convert the continuous weights into an actual allocation (whole shares only), and rebalance the allocation such that it sums to 1. - $r_{i, t} = log(\frac{P_t}{P_{t-1}})$: daily log retuns using end-of-day close prices of asset $i$. - T = 60 days - $c$: cash - $VIX$: ![image](https://hackmd.io/_uploads/rkcsTFL0T.png) - Reward: - The Sharpe ratio is the most widely-used measure for this, however, it is inappropriate for online learning settings as it is defined over a period of time $T$ - To combat this, we use the Differential Sharpe Ratio $D_t$ which represents the risk-adjusted returns at each timestep $t$ and has been found yield more consistent returns than maximizing profit - We define the Sharpe Ratio over a period of $t$ returns $R_t$: $S_t = \frac{A_t}{K_t(B_t - A^2_t)^{\frac{1}{2}}}, A_t=\frac{1}{t}\sum_{i=1}^{t}R_i, B_t=\frac{1}{t}\sum_{i=1}^{t}R_i^2$ and normalizing factor $K_t = (\frac{t}{t-1}) ^ \frac{1}{2}$ - $A$ and $B$ can be recursively estimated as exponential mov- ing averages of the returns and standard deviation of returns on time scale $\eta^{-1}$. In the case of annual Sharpe Ratio, $\eta=252$ - We can obtain a differential Sharpe Ratio $D_t$ by expanding $S_t$ to first order in $\eta$: $S_t \approx S_{t-1} + \eta D_t + O(\eta ^ 2)$, where differential Sharpe ratio $D_t = \frac{\partial S_t}{\partial \eta} = \frac{B_{t-1}\Delta A_t - \frac{1}{2}A_{t-1}\Delta B_t}{(B_{t-1} - A^2_{t-1})^{3/2}}$ - $A_t = A_{t-1} + \eta\Delta A_t, \Delta A_t = R_t - A_{t-1}$ - $B_t = B_{t-1} + \eta\Delta B_t, \Delta B_t = R_t^2 - B_{t-1}$ - $A_0=B_0=0$ - Algorithm: PPO - Enviroment specifics: - Assume that there are no transaction costs in the environment, and we allow for immediate rebalancing of the portfolio. - At the beginning of each timestep $t$, the environment calculates the current portfolio value: $PV_t = \sum P_{i, t} \cdot shares_{i, t-1} + c_{t-1}$ - The enviromnet allocates $PV_t$ to the assets and cash according to the new weights $w_i$. - Rebalances $w_i$ to $w_{i_reb}$ by multiplying $w_i$ with $PV_t$, rounding down the number of shares and converting the remaining shares into cash - After rebalancing, the environment creates the next state $S_{t+1}$ and go the the next timestep $t+1$. ## Experiments - Data and features - Adjusted close price data of the S&P500 sector indices and VIX index between 2006 and 2021 extracted from Yahoo Finance - ![image](https://hackmd.io/_uploads/SyYODaLR6.png) - To capture market regime, we compute three volatility metrics from the S&P500 index. - $vol20$: the 20-day rolling window standard deviation of the daily S&P500 index returns - $vol60$: the 60-day rolling window standard deviation of the daily S&P500 index returns - $\frac{vol20}{vol60}$: the short-term versus the long-term volatility trend. If >1, that indicates that the past 20-day daily returns of the S&P500 have been more volatile than the past 60-day daily returns, which might indicate a movement from lower volatility to a higher volatility regime. - These values are standardized by subtracting the mean and dividing by the standard deviation, where the mean and standard deviation are estimated using an expanding lookback window to prevent information leakage. - Training Process - The data is split into 10 sliding window groups (shifted by 1-year). - Each group contains 7 years worth of data, the first 5 years are used for training, the next 1 year is a burn year used for training validation, and the last year is kept out-of-sample for backtesting. - During the first round of training (2006~2010), we initialize 5 agents with different seeds - At the end of the first round of training, we save the best performing agent (based on highest mean episode validation reward in 2011). - This agent is used as a seed policy for the next group of 5 agents in the following training window (2007~2011). - This process continues till we reach the final validation period of 2020, generating a total of 50 agents (10 periods x 5 agents), and 10 corresponding backtests - Implementation details of PPO - StableBaselines3 implementation of PPO - Vectorized SubProcVecEnv environment wrappers: - collect experience rollouts through multiprocessing across independent instances of our environment - instead of training the DRL agent on one environment per step, we trained our model on n envs = 10 environments per step in order to gain more diverse experience and speed up training. - Model structure - [64, 64] fully-connected architecture with tanh activations - intiailize the policy with a log standard deviation `log_std_init = −1` - Hyperparameters: - ![image](https://hackmd.io/_uploads/ryy2tpICT.png) - This parameters are selected by a coarse grid search over held-out validation data. - Use deterministic mode for evaluating and back testing - Implementation details of MinVar - Use the implementation in PyPortfolioOpt - As we wish to compare the model-free DRL approach with MVO, we equalize the training and operational conditions. - For training, the MVO approach uses a 60-day lookback period (same as DRL) to estimate the means and covariances of assets - Optimize for the Sharpe Maximization problem and obtain the weights at every timestep. ## Result - Overall performance throughout the backtesting period - For DRL, we average the performance across the 5 agents (each trained on a different seed) for each year and then average performance across all backtest periods. - ![image](https://hackmd.io/_uploads/Hk52eRUCa.png) - ![image](https://hackmd.io/_uploads/HJT1bR8A6.png) - DRL is experiencing more steady returns month-to-month than MVO - For DRL we observe positive returns for almost all backtest years which is a lot more consistent than the behavior of MVO’s annual returns. - The DRL monthly returns distribution has a lower standard deviation and spread than MVO and a positive mean. - ![image](https://hackmd.io/_uploads/H1cfl08Ra.png) - $\Delta p_w$: the absolute value of the element-wise difference between two allocations. - $\Delta p_w$ lies between 0 and 2: For example, take a case where the portfolio at time t − 1 is concentrated in non-cash asset A, and at time t is entirely concentrated in non-cash asset B. This requires selling all holdings of A, and acquiring the equivalent shares in B, leading to $\Delta p_w = 2.0$. - DRL is has less frequent changes to its portfolio. In practice, this would result in lower average transaction costs - The DRL strategy’s performance is derived from the average of five individual agents initialized with different seeds, providing additional regularization which is likely to result in a more stable out- of-sample strategy compared to the MVO strategy. ## Conclusion - We have designed a simple environment that serves as a wrapper for the market, sliding over historical data us- ing market replay. The environment can allocate multiple assets and can be easily modified to reflect transaction costs. - Our experiments demonstrate the improved performance of the DRL framework over Mean- Variance portfolio optimization. - Future work - We would like to model transaction costs and slippage either by explicitly calculating them during asset reallocation or as a penalty term to our reward. - Another area of exploration is training a regime switching model which will balance its funds amongst two agents de- pending on market volatility (low vs high).