owned this note
owned this note
Published
Linked with GitHub
---
title: Reinforcement Learning in Finance
tags: Quant
description: Well.
slideOptions:
theme: moon
---
# Reinforcement learning in Finance
---
## Let's make it clear:
In reinforcement learning we tweak the agent parameters to maximize expected returns by interacting with an environment.
- The agent influences generates it's own data.
- The reward function is not given to the agent.
- We don't know what the optimal action was (unlike SL, where the right class is known during training)
- IID data from the enviroment.
---
## Outline
- Portfolio Allocation
- Model free vs Model based
- Objective function
- The runner up: Contextual Bandit
- The :elephant: in the :house: : Generalization
- Market Making
---
## General Online Algorithm
1. Agent computes a portfolio vector $b_t,\quad \sum|b_t|=1$
3. The market reveals prices $x_t$
4. The agent incurs in reward $R_t = r(b_t, x_t)$
5. The agent improves it's strategy
---
### Model Based:
They try to predict some parts of the environment (next states, rewards or their samples), this can be achieved by having a real simulator (AlphaGo), or a learned one (MuZero). e.g. markowitz mean variance optimization.
---
### Model-free
Predicting the next state could be as hard as finding the optimal action, directly predict the best decision.
Both models have their tradeoffs, we do not yet know the limits of model free algorithms in finance.
---
### Risk-insensitive
Utility $U(\theta)$ is a function, such as profit, wealth or risk-sensitive ratios such as Sharpe, indicating which the scalar utility of a given parameter set $\theta$.
No transaction or holding costs, no explicit risk adversity and many other assumptions.
\begin{equation}
W_T=W_0 \prod_{t=0}^T (1+R_t)
\end{equation}
* ${R_t}$: The return of the $\theta$ strategy at time t.
* ${W_0}$: The initial wealth.
---
## Utility and loss function
We can not have $\prod$s in our objective function (stochastic gradient descent algorithms work on sum of losses), so we turn the product into a log sum.
\begin{equation}
W_T=\prod_{t=0}^T (1+R_t)=\exp(\sum_{t=0}^T(\log(1+R_t)))
\end{equation}
----
The exp is still in the way, but we can remove it, the solution will be the same (the solution path will not)
\begin{equation}
\tilde{W_T}=\sum_{t=0}^T \log(1+R_t)
\end{equation}
Does it consider risk? maybe, for small returns, the loss goes to infinity log(~0).
----
Nonetheless there are many other methods to account for risk adverse utilities in the loss function.
About linear loss functions:
$\sum 1+R_t$
It relies on the assumption that $log(1+x)\approx x$ might be true for slow moving, high volume markets, not true for unstable markets (crypto has 13x returns in 5min, not ~1)
---
## Portfolio allocation
---
## Risk mangement
Unlike in robust optiomization, upside risk is ok, so we look at the sterling ratio, $$\text{SterlingRatio}(\theta) = \frac{\text{Annualized mean return}(\theta)}{MDD(\theta)}$$
An example of such methods:
> J. Moody and M. Saffell, "Learning to trade via direct reinforcement," in IEEE Transactions on Neural Networks, vol. 12, no. 4, pp. 875-889, July 2001, doi: 10.1109/72.935097.
---
These approaches, usually at a lower trading frequency rely on the non-market impact assumption, in this case we can differentiate the simualtor!
Optimal control is the right framework for such otpimization problems
---
>D. Bertsekas. Dynamic Programming and Optimal Control. Athena Sci-
entific, 1995.
> Boyd, S., Busseti, E., Diamond, S., Kahn, R., Koh, K., Nystrup, P., & Speth, J. (2017). Multi-Period Trading via Convex Optimization. Foundations and Trends in Optimization, 3(1), 1–76. http://stanford.edu/~boyd/papers/cvx_portfolio.html
---
### The reward function for RL.
Remember in RL we have a discounted sum of rewards, $\sum R(s, a)$, yet wealth compounds $\prod R$, and even considering exponential growth we introduce a minor bias, we are not clear on what is the right reward function, but we know it's not the current.
----
### Self financing constraints
We don't have successful algorithms to explictily handle constraints such as the self-financing one so we have to impose architectural constraints (e.g. softmax), but softmax is long only.
$|x_t|=1, \forall t$
----
How do we encode short-long self constraints without losing the smoothness needded for deep learning?
remember that:
- The direct loss is in the log space, what is well behaved in linear space is not necessarly so in log space
- Softmax itself is meant for 1 vs all optimization, intuitively, allocating 100% of your portfolio in one asset is a really bad idea (but desired in classification and discrete action RL).
---
### Allocation via Contextual Bandits
If the transaction fees are reasonably small, then we can look at the Multi-period allocation problem as "multiple" single period maximization:
\begin{equation}
\max_{b} \tilde{W}_T \approx \sum_{t=0}^T \max_{b} \tilde{W}_t
\end{equation}
We call $x_t$ the context of the bandit for $K$ assets.
---
If one choses to parametrize the control as a differentiable function, then we can use the gradient feedback to update the allocator via SGD:
* $x_t \sim {X_t}, b_t \sim \pi(x_t)$
* $b_{t+1} \gets b_t + \nabla \tilde{W}_t$
----
Note that:
* we are fully model-free
* use the generalization power of Neural Network and SGD
---
## Generalization
In practice we would like to generalize to unseen transitions $x_t$. But how to measure generalization?
- Out of distirbution
- New distribution
- New assets
---
## Higher frequency
As the traded volume increases, by trading at a very high frequency, we can not assume no impact.
The agent now modifies the shape of the order book, this is where RL is necessary.
Given a LOB dynamics simulator exploit the dynamicity of market and limit orders
> Spooner, T., Fearnley, J., Savani, R., & Koukorinis, A. (2018). Market Making via Reinforcement Learning. Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 434–442.
---
## Market Making via Optimal Control
In Market Making we care about two things:
* make profit on the bid/ask spread
* minimize inventory risk
---
We can write the Market Making problem as an OC problem, as follows:
\begin{equation}
arg\max_{\pi} E_{\mu} \left[ \sum_{t=0}^T \Delta_{\text{pnl}_t} - \alpha \Delta_{p_t}N_t\right]
\end{equation}
* $N_t$ is the inventory at time $t$
* $\Delta_{\text{pnl}}$ is the PNL at time $t$
* $\Delta_{p_t}$ is the change in mid-price from $t\to t+1$
----
\begin{equation}
arg\max_{\pi} E_{\mu} \left[ \sum_{t=0}^T \Delta_{\text{pnl}_t} - \alpha \Delta_{p_t}N_t\right]
\end{equation}
* The control of the agents are in $(a_{\text{bid}}, a_{\text{ask}}) \in R^2$
* The state space can be defined in multiple ways but at least contains $N$
---
## Why is a control problem?
The agent has an impact on the market and it can influence the Liquidity Taker to change the the state of it' inventory.
In portfolio allocation, we don't have such luxury. The risk of the portfolio is **endogenous** to the agent as it flactuates regardless of it's actions.
---
## Why Reinforcement Learning?
---
* It's unclear how to simulate market impact
* Construcing a LOB from Real data is hard
* Generally peaking the LOB is not differentiable (non-smooth)
---
### Other applications of RL
- Optimal Execution
- Option Pricing and Hedging
- Robo-advising
- Smart Order Routing
- Arbitrage
---
### Compounding vs Additive Rewards
“Compound interest is the eighth wonder of the world. He who understands it earns it… he who doesn’t… pays it.” - Einstein
We optimize the expected growth rate $$\frac{1}{T}\sum \log b_t \cdot x_t$$ rather than the final cumulative wealth $$\prod \sum b_{t,i}x_{t,i}$$.
---
We are off by an $\exp$, this might not be a problem when we talk about optimal solutions, but it matters when we stop the optimization early (e.g. to avoid overfitting) as it's common in deep learning.
This might not be relevant yet, but lets not end up paying for the compound.