# Explainable Deep Reinforcement Learning for Portfolio Management: An Empirical Approach
link: https://arxiv.org/abs/2111.03995
ACM 2021
## Introduction
- The explanation of a portfolio management strategy is important to investment banks, asset management companies and hedge funds.
- It is challenging to explain a DRL-based portfolio management strategy due to the black-box nature of deep neural networks.
- Existing methods such as saliency maps are used for CV and NLP, but have not been applied to financial applications yet.
- Some researchers explain the DRL based portfolio management strategy using an attention model. However, it does not explain the decision-making process of a DRL agent in a proper financial context.
- Contribution
- Empirical approach to understand the strategies of DRL: use the coefficients of a linear model in hindsight as the reference feature weights.
- For a DRL strategy, we use integrated gradients to define the feature weights.
- We quantify the prediction power by calculating the linear correlations between the feature weights of a DRL agent and the reference feature weights
- We consider both the single-step case and multiple-step case.
## Integrated gradients
- It integrates the gradient of the output with respect to input features. For an input $x \in \mathbb{R}^n$, the đť‘–-th entry of integrated gradient is defined as
- , where $F$ denotes a DRL model, 𝒙′ is a perturbed version of 𝒙, say replacing all entries with zeros. It explains the relationship between a model’s predictions in terms of its features.
- reference: https://medium.com/ai-academy-taiwan/%E5%8F%AF%E8%A7%A3%E9%87%8B-ai-xai-%E7%B3%BB%E5%88%97-02-%E5%9F%BA%E6%96%BC%E6%A2%AF%E5%BA%A6%E7%9A%84%E6%96%B9%E6%B3%95-gradient-based-b639932c1620
## Portfolio Management Task
- Notations:
- $N$: number of risky assets
- $T$: number of time slots
- $\textbf{p}(t)\in\mathbb{R}^N$: closing prices of all assets at time $t$
- $\textbf{p}(0)$ is the opening price
- $\textbf{y}(t)\in\mathbb{R}^N$: price relative vector 
- $\textbf{w}(t)\in\mathbb{R}^N$: portfolio weights, updated at the beginning of time slot $t$
- $v(t) \in \mathbb{R}$: portfolio value at the beginning of time slot $t+1$
- $v(0)$ is the initial capital
- $\rho(t)$: rate of portfolio return 
- $r(t)$: the logarithmic rate of portfolio return 
- The risk of a portfolio is defined as the variance of the rate of portfolio return
- 
- $\Sigma(t) = Cov(\textbf{y}(t)) \in \mathbb{R}^{N\times N}$
- If there is no transaction cost, the final portfolio value is $v(T) = v(0)\Pi^{T}_{t=1}\textbf{w}(t)^\intercal \textbf{y}(t)$
- The portfolio management task aims to find a portfolio weight vector such that 
- Since $\textbf{y}(t)$ and $\Sigma(t)$ are revealed at the end of time slot $t$, we estimate them by CAPM(Capital Asset Pricing Model). (8) can be reformulated as 
## Deep Reinforcement Learning for Portfolio Management

- State space: 
- Action space: portfolio weight vector $\textbf{w}(t) \in \mathbb{R}^N$ that can satisfy (9)
- Rewrad function: 
- Feature Weights Using Integrated Gradients
- We use integrated gradients to measure the feature weights. ($\gamma$ is the discount factor of RL)
- 
- 
## Explanation Method
- Reference feature weights
- We use a linear model in hindsight as a reference model
- The weights are optimized by actual stock returns and the actual sample covariance matrix
- It is the upper bound performance that any linear predictive model would have been able to achieve.
- Get the portfolio value relative vector by $\textbf{q}(t) = \textbf{w}(t) \odot \textbf{y}(t) \in \mathbb{R}^N$
- Build the linear model 
- 
characterizes the total contribution of $k$-th feature to the portfolio value at time $t$
- Feature weight for DRL trading agent
- At the beginning of a trading slot $t$, it takes the feature vectors and covariance matrix as input and outputs an action vector $\textbf{w}(t)$
- Get the portfolio value relative vector by $\textbf{q}(t) = \textbf{w}(t) \odot \textbf{y}(t) \in \mathbb{R}^N$
- We can represent it as linear regression model 
- By(15), we can define the feature weights for the k-th feature as 
(equalilty holds since $\textbf{w}^{\intercal}(t+l)\cdot\textbf{y}(t+l) = \textbf{q}(t+l)$, $\frac{\partial\textbf{q}(t+l)}{\partial f^k(k+l)} = c_k(t+l)$)
- Assume the time dependency of features on stocks follows the power law, $\frac{f^k(t+l)_i}{f^k(t)_i} = l^{-\alpha}$ for $l \ge 1$, than the feature weights are 
- Note that $\textbf{M}^{\pi}(t)_k$ has a similar form as (18)
- DRL agents find portfolio weights with a long-term goal.
- Conventional Machine Learning Methods with Forward-Pass
- Predict stock returns with machine learning methods using features: $\hat{\textbf{y}}(t) = g(f^1(t), ..., f^K(t))$ where $g$ is the machine learing model
- rely on single-step prediction
- Find optimal portfolio weights under predicted stock returns: $\textbf{q}^*(t) = \textbf{w}^*(t)\odot\textbf{y}(t)$
- Build a regression model between portfolio return and features: $\textbf{q}^*(t)=b_0(t)\cdot \textbf{1} + b_1(t) \cdot f^1(t) +...+b_K(t)\cdot f^K(t)+\epsilon(t)$
- We define feature weights $\textbf{b}(t)_k$ by 
- Quantitative Comparison
- We quantify the prediction power by calculating the linear correlations 𝜌 (·) between the feature weights of a DRL agent and the reference feature weights and similarly for machine learning methods.
- For machine learning models, the single-step and multi-step prediction power is measuered by

where 
- for DRL models

- Both the single-step prediction and multi-step prediction power are expected to be positively correlated to the portfolio’s performance.
- The DRL agents make decisions with a long-term goal. Therefore the multi-step prediction power of DRL agents is expected to outperform their single-step prediction power
- The portfolio management strategy with machine learning methods relies on single-step prediction power. Therefore, the single-step prediction power of machine learning methods is expected to outperform their multi-step prediction power.
## Experiment Results
- Data
- We use the FinRL library and the stock data of Dow Jones 30 constituent stocks
- 
- 4 technical indicators
- MACD: Moving Average Convergence Divergence
- RSI: Relative Strength Index
- CCI: The Commodity Channel Index
- ADX: Average Directional Index
- All data and features are measured in a daily time granularity.
- Performance Comparison


- Explanation Analysis
- We calculate the histogram of correlation coefficients with 1770 samples for 295 trading days
- 
- 
- In single-step prediction, the machine learning methods(DT) show greater significance in mean correlation coefficient (single-step) than DRL agents.
- In multi-step prediction, the DRL agents show stonger significance in mean correlation coefficient than machine learning methods.