# Cost-Sensitive Portfolio Selection via Deep Reinforcement Learning
link: https://arxiv.org/abs/2003.03051
## Introduction
- Main challenges from existing portfolio selection methods
- How to represent the non-stationary price series
- Traditional methods: use handcraft features and perform unsatisfactorily because of poor representation abilities
- DNNs: directly extract price sequential patterns and asset correlations simultaneously, but the dynamic nature of portfolio selection and <font color="#f00"> lack of well-labeled data </font> make DNNs hard to train.
- How to control costs in decision-making
- Transaction cost: may lead to aggressive trading
- Risk cost: incurred by fluctuation of returns
- Most existing methods consider either one of them but do not constrain both costs simultaneously, which may limit their practical performance.
- Main contribution
- A novel two-stream network architecture to capture both price sequential information and asset correlation information.
- Cost-sensitive reward function enable the proposed method to maximize the accumulated return while controlling both costs
- The wealth growth rate can be close the the theoretical optimum
- Extensive experiments on real-world datasets
## Problem settings
- Notation:
- $n$: number of financial periods
- $m$: number of asset ($m+1$ means including cash)
- $d$: number of prices. $d=4$ in this study (open, high, low, close)
- $p_t \in \mathbb{R}_{+}^{(m+1)\times d}$: prices of all assets at timestep $t$
- $p_{t,i} \in \mathbb{R}_{+}^{d}$: feature of asset $i$ at timestep $t$
- $P_t = \{p_{t-k}, ..., p_{t-1}\}$: price series ($k$ is the series length)
- $x_t = \frac{p^c_t}{p^c_{t-1}}$: price related vector (price change) at timestep $t$ ($x_{t, 0}$ is the price change of cash, assuming no inflation and deflation, $\{\forall x_{t, 0} = 1\}$)
- $a_t = [a_{t,0}, a_{t,1}, ...,a_{t,m}] \in \mathbb{R}^{m+1}$: portfolio vector (by model), where $a_{t, i}$ is the proportion of asset $i$. Initialized by $a_0 = [1, 0, 0, ...0]$
- $\hat{a}_{t-1} = \frac{a_{t-1}\odot x_{t-1}}{a_{t-1}^{\intercal}x_{t-1}}$: current portfolio at timestep $t$ (by portfolio vector and close price at $t-1$)
- $c_t$: propotion of transaction cost at timestep $t$ (0.25% in this case)
- $\omega_t = 1 - c_t$: the proportion of net wealth
- $S$: Gross wealth. $S_0 = 1$, $S_n = S_0\Pi^n_{t=1} a_t^\intercal x_t$ if no cost. $S_n = S_0\Pi^n_{t=1} a_t^\intercal x_t(1-c_t)$ if considering cost
- Assumption:
1. Perfect liquity: each investment can be carried out immediately
2. Zero-market-impact: the investment by the agent has no influence on the financial market <font color="#f00">(action will not effect environment)</font>
- Markov Decision Process for Portfolio Selection
- 
## Portforlio Policy Network
- General Architecture


- Sequential Information Net
- Price sequential pattern reflects the price changes of each asset
- Use LSTM to extract the nonstationary sequential pattern of portfolio
- <font color="#f00">The sequential information net processes each asset separately</font>, and concatenates the feature of each asset along the height dimension as a whole feature map.
- 
- Correlation Infromation Net
- Use temporal correlation convolution block (TCCB) to extract the asset correlation and model the price series
- 
- Dilated causal convolution
- 
- Causal convolution can <font color="#f00">keep the sequence order invariant and guarantee no information leakage from the future </font>to the past by using padding and filter shifting
- The causal convolution usually requires very large kernel sizes or too many layers to increase the receptive field, leading to a large number of parameters.
- To overcome this, we use the dilated operation since it can guarantee exponentially large receptive fields
- Correlational convolution
- Dilated causal convolutions, can hardly extract asset correlations, since they process the price of each asset separately by using 1D convolutions.
- To combine the price information from different assets, we use 1x1 CNN to <font color="#f00">fuse the features of all assets at every time step</font>
- Decision Making Module
- The decision-making requires considering the action from last period, which helps to discourage huge changes between portfolios and thus constrains aggressive trading
- We directly concatenate the portfolio vector from last period into feature maps.
- the recursive portfolio vector $a_{t-1} \in \mathbb{R}^m$ also excludes the cash term
- We then add a fixed cash bias into all feature maps in order to construct complete portfolios, and decide the final portfolio $a_{t} \in \mathbb{R}^{m+1}$
## Reinforcement Learning
- (Too much math. Cannot understand.)
- Cost sensitive reward

- Risk sensitive reward:
- Log-return: $\hat{r}^c_t = \log(r^c_k)$
- Empirical variance of log-return on sampled portfolio data: $\sigma^2(\hat{r}^c_t)$
- Transaction cost constraint:
- By proof, $||a_t - \hat{a}_{t-1}||_t \in (0, \frac{2(1-\psi)}{1+\psi}]$, where $\psi$ is the transaction cost rate
## Experiment Settings
- Metrics:
- APV (accumulated portfolio value) = $S_n = S_0 \Pi^n_{t=1} a_t^\intercal x_t(1-c_t) (S_0=1)$
- SR (Sharpe Ratio) = $\frac{Avg(r_t^c)}{Std(r^c_t)}$
- Calmar Ratio (CR) = $\frac{S_n}{MDD}$, where $MDD = max_{t:\tau > t}\frac{S_t - S_{\tau}}{S_t}$
- TO (turnover): $TO = \frac{1}{2n}\sum^{n}_{t=1}||\hat{a}_{t-1} - a_t\omega_t||_1$
- It is used to examing the influence of transaction cost, since it estimates the average trading volume
- Datasets and preprocessing
- Datasets are crypto-currency accessed with Poloniex
- We set the bitcoin as the risk-free cash and select risk assets with top month trading volumes in Poloniex.
- The price window of each asset spans 30 trading periods, where each period is with 30-minute length.
- Normalization: since the decision-making of portfolio selection relies on the relative price change rather than the absolute change, we thus normalize the price series with the price of the last period ($P_t = \frac{P_t}{P_{t, 30}}\in \mathbb{R}^{m\times30\times4}$)
- 
- Evaluation on Profitability

- EIIE and PPN-based methods perform better than all other baselines in terms of APV. Since the three methods adopt neural networks to learn policies via reinforcement learning
- PPN-based methods performing better than EIIE implies that they can extract better sequential feature representation.
- PPN outperforming PPN-I in terms of APV confirms the effectiveness and significance of the <font color="#f00">asset correlation</font> in portfolio selection.
- Variants of proposed method
- PPN-I: full PPN without correlation convolution
- PPN-LSTM: only LSTM as backbone
- PPN-TCB: TCCB without correlation convolution as backbone
- PPN-TCCB: only TCCB as backbone
- PPN-TCB-LSTM: cascading TCB and LSTM as backbone
- PPN-TCCB-LSTM: cascading TCCB and LSTM as backbone
- 
- All variants that consider asset correlations outperform their independent variants
- All combined variants PPN, PPN-I and cascaded modules, outperform the variants that only adopt LSTM, TCB or TCCB.
- Backtesting:
- 
- PPN is not always the best throughout the backtest
- Considering that the correlation between two price events decreases exponentially with their sequential distance, this result demonstrates better generalization abilities of PPN.
- There are some periods that all methods suffer from significant drawdown; this may result from the market factor instead of model themselves
- Evaluation on Cost-sensitivity
- Influences of transaction costs

- PPN achieves the best APV performance across a wide range of transaction cost rates. This observation further confirms the profitability of PPN.
- Compared to EIIE, PPN-based methods obtain relatively low TO
- When the transaction cost rate is very large (c=5%), PPN-based algorithms tend to <font color="#f00">stop trading and make nearly no gains or losses</font>, while EIIE, however, loses most of the wealth with relatively high TO.
- Cost-sensitivity to transaction costs


- With the increase of $\gamma$, TO values of PPN decrease
- Introducing $||a_t - \hat{a}_{t-1}||_1$ can prevent the meaningless trading when the transaction cost outweighs the benefit of trading
- <font color="#f00">$\gamma = 10^{-3}$ </font>PPN achieves the best APV performance
- If $\gamma$ is too small, PPN tends to trade aggressively, leading to a large number of transaction costs
- If $\gamma$ is too large, PPN tends to trade passively, thus limiting the model to seeking better profitability
- Cost-sensitive to risk cost

- With the increase of λ, the STD and MDD values of PPN asymptotically decrease on all datasets.
- This result implies that constraining the volatility of returns is helpful to control the downward risk.
## Discussion
- Reinforcement Learning Algorithm Selection: Why not use A2C
- AC requires learning a "critic" network to approximate the value function, which then generates the policy gradient to update the "actor" network
- Three kinds of value function
- **State value**: measure the performance of the current state,
- **State-Action value** (Q value): measure the performance of the determined action in the current state
- **Advantage value**: measure the advantage of the determined action than the average performance in the current state.
- The state value is unsuitable for our case, since the action of PPN does not affect the environment state due to the general assumption 2.
- The Q value is also unsuitable, since the Q network is often hard to train regarding the non-stationary decision process
- The advantage value is inappropriate, since its optimization relies on the accurate estimations of both state and Q values.

- Application to Stock Portfolio Selection


## Conclusion
- By devising a new two-stream architecture, the proposed network is able to extract both price sequential patterns and asset correlations
- In addition, to maximize the accumulated return while controlling both transaction and risk costs, we develop a new cost-sensitive reward function and adopt the direct policy gradient algorithm to optimize it.