Cost-Sensitive Portfolio Selection via Deep Reinforcement Learning

# Cost-Sensitive Portfolio Selection via Deep Reinforcement Learning link: https://arxiv.org/abs/2003.03051 ## Introduction - Main challenges from existing portfolio selection methods - How to represent the non-stationary price series - Traditional methods: use handcraft features and perform unsatisfactorily because of poor representation abilities - DNNs: directly extract price sequential patterns and asset correlations simultaneously, but the dynamic nature of portfolio selection and lack of well-labeled data make DNNs hard to train. - How to control costs in decision-making - Transaction cost: may lead to aggressive trading - Risk cost: incurred by fluctuation of returns - Most existing methods consider either one of them but do not constrain both costs simultaneously, which may limit their practical performance. - Main contribution - A novel two-stream network architecture to capture both price sequential information and asset correlation information. - Cost-sensitive reward function enable the proposed method to maximize the accumulated return while controlling both costs - The wealth growth rate can be close the the theoretical optimum - Extensive experiments on real-world datasets ## Problem settings - Notation: - $n$: number of financial periods - $m$: number of asset ($m+1$ means including cash) - $d$: number of prices. $d=4$ in this study (open, high, low, close) - $p_t \in \mathbb{R}_{+}^{(m+1)\times d}$: prices of all assets at timestep $t$ - $p_{t,i} \in \mathbb{R}_{+}^{d}$: feature of asset $i$ at timestep $t$ - $P_t = \{p_{t-k}, ..., p_{t-1}\}$: price series ($k$ is the series length) - $x_t = \frac{p^c_t}{p^c_{t-1}}$: price related vector (price change) at timestep $t$ ($x_{t, 0}$ is the price change of cash, assuming no inflation and deflation, $\{\forall x_{t, 0} = 1\}$) - $a_t = [a_{t,0}, a_{t,1}, ...,a_{t,m}] \in \mathbb{R}^{m+1}$: portfolio vector (by model), where $a_{t, i}$ is the proportion of asset $i$. Initialized by $a_0 = [1, 0, 0, ...0]$ - $\hat{a}_{t-1} = \frac{a_{t-1}\odot x_{t-1}}{a_{t-1}^{\intercal}x_{t-1}}$: current portfolio at timestep $t$ (by portfolio vector and close price at $t-1$) - $c_t$: propotion of transaction cost at timestep $t$ (0.25% in this case) - $\omega_t = 1 - c_t$: the proportion of net wealth - $S$: Gross wealth. $S_0 = 1$, $S_n = S_0\Pi^n_{t=1} a_t^\intercal x_t$ if no cost. $S_n = S_0\Pi^n_{t=1} a_t^\intercal x_t(1-c_t)$ if considering cost - Assumption: 1. Perfect liquity: each investment can be carried out immediately 2. Zero-market-impact: the investment by the agent has no influence on the financial market (action will not effect environment) - Markov Decision Process for Portfolio Selection - ![](https://hackmd.io/_uploads/ByS4xs2f6.png) ## Portforlio Policy Network - General Architecture ![](https://hackmd.io/_uploads/rJ3dMshGT.png) ![](https://hackmd.io/_uploads/SJDaRMCza.png) - Sequential Information Net - Price sequential pattern reflects the price changes of each asset - Use LSTM to extract the nonstationary sequential pattern of portfolio - The sequential information net processes each asset separately, and concatenates the feature of each asset along the height dimension as a whole feature map. - ![](https://hackmd.io/_uploads/H1MiXj2f6.png) - Correlation Infromation Net - Use temporal correlation convolution block (TCCB) to extract the asset correlation and model the price series - ![](https://hackmd.io/_uploads/r1Y3WR3M6.png) - Dilated causal convolution - ![](https://hackmd.io/_uploads/rkT8DA3f6.png) - Causal convolution can keep the sequence order invariant and guarantee no information leakage from the future to the past by using padding and filter shifting - The causal convolution usually requires very large kernel sizes or too many layers to increase the receptive field, leading to a large number of parameters. - To overcome this, we use the dilated operation since it can guarantee exponentially large receptive fields - Correlational convolution - Dilated causal convolutions, can hardly extract asset correlations, since they process the price of each asset separately by using 1D convolutions. - To combine the price information from different assets, we use 1x1 CNN to fuse the features of all assets at every time step - Decision Making Module - The decision-making requires considering the action from last period, which helps to discourage huge changes between portfolios and thus constrains aggressive trading - We directly concatenate the portfolio vector from last period into feature maps. - the recursive portfolio vector $a_{t-1} \in \mathbb{R}^m$ also excludes the cash term - We then add a fixed cash bias into all feature maps in order to construct complete portfolios, and decide the final portfolio $a_{t} \in \mathbb{R}^{m+1}$ ## Reinforcement Learning - (Too much math. Cannot understand.) - Cost sensitive reward ![](https://hackmd.io/_uploads/rJqHKepz6.png) - Risk sensitive reward: - Log-return: $\hat{r}^c_t = \log(r^c_k)$ - Empirical variance of log-return on sampled portfolio data: $\sigma^2(\hat{r}^c_t)$ - Transaction cost constraint: - By proof, $||a_t - \hat{a}_{t-1}||_t \in (0, \frac{2(1-\psi)}{1+\psi}]$, where $\psi$ is the transaction cost rate ## Experiment Settings - Metrics: - APV (accumulated portfolio value) = $S_n = S_0 \Pi^n_{t=1} a_t^\intercal x_t(1-c_t) (S_0=1)$ - SR (Sharpe Ratio) = $\frac{Avg(r_t^c)}{Std(r^c_t)}$ - Calmar Ratio (CR) = $\frac{S_n}{MDD}$, where $MDD = max_{t:\tau > t}\frac{S_t - S_{\tau}}{S_t}$ - TO (turnover): $TO = \frac{1}{2n}\sum^{n}_{t=1}||\hat{a}_{t-1} - a_t\omega_t||_1$ - It is used to examing the influence of transaction cost, since it estimates the average trading volume - Datasets and preprocessing - Datasets are crypto-currency accessed with Poloniex - We set the bitcoin as the risk-free cash and select risk assets with top month trading volumes in Poloniex. - The price window of each asset spans 30 trading periods, where each period is with 30-minute length. - Normalization: since the decision-making of portfolio selection relies on the relative price change rather than the absolute change, we thus normalize the price series with the price of the last period ($P_t = \frac{P_t}{P_{t, 30}}\in \mathbb{R}^{m\times30\times4}$) - ![](https://hackmd.io/_uploads/By2BKfAMa.png) - Evaluation on Profitability ![](https://hackmd.io/_uploads/HkyW2QCfT.png) - EIIE and PPN-based methods perform better than all other baselines in terms of APV. Since the three methods adopt neural networks to learn policies via reinforcement learning - PPN-based methods performing better than EIIE implies that they can extract better sequential feature representation. - PPN outperforming PPN-I in terms of APV confirms the effectiveness and significance of the asset correlation in portfolio selection. - Variants of proposed method - PPN-I: full PPN without correlation convolution - PPN-LSTM: only LSTM as backbone - PPN-TCB: TCCB without correlation convolution as backbone - PPN-TCCB: only TCCB as backbone - PPN-TCB-LSTM: cascading TCB and LSTM as backbone - PPN-TCCB-LSTM: cascading TCCB and LSTM as backbone - ![](https://hackmd.io/_uploads/H1iQ27CM6.png) - All variants that consider asset correlations outperform their independent variants - All combined variants PPN, PPN-I and cascaded modules, outperform the variants that only adopt LSTM, TCB or TCCB. - Backtesting: - ![](https://hackmd.io/_uploads/SyKKGEAGa.png) - PPN is not always the best throughout the backtest - Considering that the correlation between two price events decreases exponentially with their sequential distance, this result demonstrates better generalization abilities of PPN. - There are some periods that all methods suffer from significant drawdown; this may result from the market factor instead of model themselves - Evaluation on Cost-sensitivity - Influences of transaction costs ![](https://hackmd.io/_uploads/B1BYSERMp.png) - PPN achieves the best APV performance across a wide range of transaction cost rates. This observation further confirms the profitability of PPN. - Compared to EIIE, PPN-based methods obtain relatively low TO - When the transaction cost rate is very large (c=5%), PPN-based algorithms tend to stop trading and make nearly no gains or losses, while EIIE, however, loses most of the wealth with relatively high TO. - Cost-sensitivity to transaction costs ![](https://hackmd.io/_uploads/ByLeo40Ma.png) ![](https://hackmd.io/_uploads/SyD71BRzT.png) - With the increase of $\gamma$, TO values of PPN decrease - Introducing $||a_t - \hat{a}_{t-1}||_1$ can prevent the meaningless trading when the transaction cost outweighs the benefit of trading - $\gamma = 10^{-3}$ PPN achieves the best APV performance - If $\gamma$ is too small, PPN tends to trade aggressively, leading to a large number of transaction costs - If $\gamma$ is too large, PPN tends to trade passively, thus limiting the model to seeking better profitability - Cost-sensitive to risk cost ![](https://hackmd.io/_uploads/Byzvp4CGT.png) - With the increase of λ, the STD and MDD values of PPN asymptotically decrease on all datasets. - This result implies that constraining the volatility of returns is helpful to control the downward risk. ## Discussion - Reinforcement Learning Algorithm Selection: Why not use A2C - AC requires learning a "critic" network to approximate the value function, which then generates the policy gradient to update the "actor" network - Three kinds of value function - **State value**: measure the performance of the current state, - **State-Action value** (Q value): measure the performance of the determined action in the current state - **Advantage value**: measure the advantage of the determined action than the average performance in the current state. - The state value is unsuitable for our case, since the action of PPN does not affect the environment state due to the general assumption 2. - The Q value is also unsuitable, since the Q network is often hard to train regarding the non-stationary decision process - The advantage value is inappropriate, since its optimization relies on the accurate estimations of both state and Q values. ![](https://hackmd.io/_uploads/B1MwPBRzp.png) - Application to Stock Portfolio Selection ![](https://hackmd.io/_uploads/BySS_S0Ma.png) ![image.png](https://hackmd.io/_uploads/S1dEudJm6.png) ## Conclusion - By devising a new two-stream architecture, the proposed network is able to extract both price sequential patterns and asset correlations - In addition, to maximize the accumulated return while controlling both transaction and risk costs, we develop a new cost-sensitive reward function and adopt the direct policy gradient algorithm to optimize it.