Explainable Deep Reinforcement Learning for Portfolio Management: An Empirical Approach

# Explainable Deep Reinforcement Learning for Portfolio Management: An Empirical Approach link: https://arxiv.org/abs/2111.03995 ACM 2021 ## Introduction - The explanation of a portfolio management strategy is important to investment banks, asset management companies and hedge funds. - It is challenging to explain a DRL-based portfolio management strategy due to the black-box nature of deep neural networks. - Existing methods such as saliency maps are used for CV and NLP, but have not been applied to financial applications yet. - Some researchers explain the DRL based portfolio management strategy using an attention model. However, it does not explain the decision-making process of a DRL agent in a proper financial context. - Contribution - Empirical approach to understand the strategies of DRL: use the coefficients of a linear model in hindsight as the reference feature weights. - For a DRL strategy, we use integrated gradients to define the feature weights. - We quantify the prediction power by calculating the linear correlations between the feature weights of a DRL agent and the reference feature weights - We consider both the single-step case and multiple-step case. ## Integrated gradients - It integrates the gradient of the output with respect to input features. For an input $x \in \mathbb{R}^n$, the 𝑖-th entry of integrated gradient is defined as - ![image](https://hackmd.io/_uploads/BysV5BB8p.png), where $F$ denotes a DRL model, 𝒙′ is a perturbed version of 𝒙, say replacing all entries with zeros. It explains the relationship between a model’s predictions in terms of its features. - reference: https://medium.com/ai-academy-taiwan/%E5%8F%AF%E8%A7%A3%E9%87%8B-ai-xai-%E7%B3%BB%E5%88%97-02-%E5%9F%BA%E6%96%BC%E6%A2%AF%E5%BA%A6%E7%9A%84%E6%96%B9%E6%B3%95-gradient-based-b639932c1620 ## Portfolio Management Task - Notations: - $N$: number of risky assets - $T$: number of time slots - $\textbf{p}(t)\in\mathbb{R}^N$: closing prices of all assets at time $t$ - $\textbf{p}(0)$ is the opening price - $\textbf{y}(t)\in\mathbb{R}^N$: price relative vector ![image](https://hackmd.io/_uploads/SyA_NLBU6.png) - $\textbf{w}(t)\in\mathbb{R}^N$: portfolio weights, updated at the beginning of time slot $t$ - $v(t) \in \mathbb{R}$: portfolio value at the beginning of time slot $t+1$ - $v(0)$ is the initial capital - $\rho(t)$: rate of portfolio return ![image](https://hackmd.io/_uploads/SyDV88BLp.png) - $r(t)$: the logarithmic rate of portfolio return ![image](https://hackmd.io/_uploads/rJfkwLSI6.png) - The risk of a portfolio is defined as the variance of the rate of portfolio return - ![image](https://hackmd.io/_uploads/H1P8v8SUa.png) - $\Sigma(t) = Cov(\textbf{y}(t)) \in \mathbb{R}^{N\times N}$ - If there is no transaction cost, the final portfolio value is $v(T) = v(0)\Pi^{T}_{t=1}\textbf{w}(t)^\intercal \textbf{y}(t)$ - The portfolio management task aims to find a portfolio weight vector such that ![image](https://hackmd.io/_uploads/BydbqIBL6.png) - Since $\textbf{y}(t)$ and $\Sigma(t)$ are revealed at the end of time slot $t$, we estimate them by CAPM(Capital Asset Pricing Model). (8) can be reformulated as ![image](https://hackmd.io/_uploads/BJX4sLBLa.png) ## Deep Reinforcement Learning for Portfolio Management ![image](https://hackmd.io/_uploads/ryH1Q_HIa.png) - State space: ![image](https://hackmd.io/_uploads/B1IlXdrL6.png) - Action space: portfolio weight vector $\textbf{w}(t) \in \mathbb{R}^N$ that can satisfy (9) - Rewrad function: ![image](https://hackmd.io/_uploads/BkJ2Q_HIa.png) - Feature Weights Using Integrated Gradients - We use integrated gradients to measure the feature weights. ($\gamma$ is the discount factor of RL) - ![image](https://hackmd.io/_uploads/Hko5Kqr8T.png) - ![image](https://hackmd.io/_uploads/H1PnF5H8a.png) ## Explanation Method - Reference feature weights - We use a linear model in hindsight as a reference model - The weights are optimized by actual stock returns and the actual sample covariance matrix - It is the upper bound performance that any linear predictive model would have been able to achieve. - Get the portfolio value relative vector by $\textbf{q}(t) = \textbf{w}(t) \odot \textbf{y}(t) \in \mathbb{R}^N$ - Build the linear model ![image](https://hackmd.io/_uploads/HJh_i5r8p.png) - ![image](https://hackmd.io/_uploads/B1qNh5HLa.png) characterizes the total contribution of $k$-th feature to the portfolio value at time $t$ - Feature weight for DRL trading agent - At the beginning of a trading slot $t$, it takes the feature vectors and covariance matrix as input and outputs an action vector $\textbf{w}(t)$ - Get the portfolio value relative vector by $\textbf{q}(t) = \textbf{w}(t) \odot \textbf{y}(t) \in \mathbb{R}^N$ - We can represent it as linear regression model ![image](https://hackmd.io/_uploads/rkRibsS86.png) - By(15), we can define the feature weights for the k-th feature as ![image](https://hackmd.io/_uploads/rytLGsSI6.png) (equalilty holds since $\textbf{w}^{\intercal}(t+l)\cdot\textbf{y}(t+l) = \textbf{q}(t+l)$, $\frac{\partial\textbf{q}(t+l)}{\partial f^k(k+l)} = c_k(t+l)$) - Assume the time dependency of features on stocks follows the power law, $\frac{f^k(t+l)_i}{f^k(t)_i} = l^{-\alpha}$ for $l \ge 1$, than the feature weights are ![image](https://hackmd.io/_uploads/H1hswjrLp.png) - Note that $\textbf{M}^{\pi}(t)_k$ has a similar form as (18) - DRL agents find portfolio weights with a long-term goal. - Conventional Machine Learning Methods with Forward-Pass - Predict stock returns with machine learning methods using features: $\hat{\textbf{y}}(t) = g(f^1(t), ..., f^K(t))$ where $g$ is the machine learing model - rely on single-step prediction - Find optimal portfolio weights under predicted stock returns: $\textbf{q}^*(t) = \textbf{w}^*(t)\odot\textbf{y}(t)$ - Build a regression model between portfolio return and features: $\textbf{q}^*(t)=b_0(t)\cdot \textbf{1} + b_1(t) \cdot f^1(t) +...+b_K(t)\cdot f^K(t)+\epsilon(t)$ - We define feature weights $\textbf{b}(t)_k$ by ![image](https://hackmd.io/_uploads/SJpPTjr8a.png) - Quantitative Comparison - We quantify the prediction power by calculating the linear correlations 𝜌 (·) between the feature weights of a DRL agent and the reference feature weights and similarly for machine learning methods. - For machine learning models, the single-step and multi-step prediction power is measuered by ![image](https://hackmd.io/_uploads/HJWqPCBLT.png) where ![image](https://hackmd.io/_uploads/Hk5TvRBLp.png) - for DRL models ![image](https://hackmd.io/_uploads/Bk6jwArU6.png) - Both the single-step prediction and multi-step prediction power are expected to be positively correlated to the portfolio’s performance. - The DRL agents make decisions with a long-term goal. Therefore the multi-step prediction power of DRL agents is expected to outperform their single-step prediction power - The portfolio management strategy with machine learning methods relies on single-step prediction power. Therefore, the single-step prediction power of machine learning methods is expected to outperform their multi-step prediction power. ## Experiment Results - Data - We use the FinRL library and the stock data of Dow Jones 30 constituent stocks - ![image](https://hackmd.io/_uploads/Bkva_CS8a.png) - 4 technical indicators - MACD: Moving Average Convergence Divergence - RSI: Relative Strength Index - CCI: The Commodity Channel Index - ADX: Average Directional Index - All data and features are measured in a daily time granularity. - Performance Comparison ![image](https://hackmd.io/_uploads/By2RqABIa.png) ![image](https://hackmd.io/_uploads/rJzWs0SLp.png) - Explanation Analysis - We calculate the histogram of correlation coefficients with 1770 samples for 295 trading days - ![image](https://hackmd.io/_uploads/ByFYi0SLa.png) - ![image](https://hackmd.io/_uploads/rJqO2ArIp.png) - In single-step prediction, the machine learning methods(DT) show greater significance in mean correlation coefficient (single-step) than DRL agents. - In multi-step prediction, the DRL agents show stonger significance in mean correlation coefficient than machine learning methods.