Manuel Del Verme
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    --- title: Reinforcement Learning in Finance tags: Quant description: Well. slideOptions: theme: moon --- # Reinforcement learning in Finance --- ## Let's make it clear: In reinforcement learning we tweak the agent parameters to maximize expected returns by interacting with an environment. - The agent influences generates it's own data. - The reward function is not given to the agent. - We don't know what the optimal action was (unlike SL, where the right class is known during training) - IID data from the enviroment. --- ## Outline - Portfolio Allocation - Model free vs Model based - Objective function - The runner up: Contextual Bandit - The :elephant: in the :house: : Generalization - Market Making --- ## General Online Algorithm 1. Agent computes a portfolio vector $b_t,\quad \sum|b_t|=1$ 3. The market reveals prices $x_t$ 4. The agent incurs in reward $R_t = r(b_t, x_t)$ 5. The agent improves it's strategy --- ### Model Based: They try to predict some parts of the environment (next states, rewards or their samples), this can be achieved by having a real simulator (AlphaGo), or a learned one (MuZero). e.g. markowitz mean variance optimization. --- ### Model-free Predicting the next state could be as hard as finding the optimal action, directly predict the best decision. Both models have their tradeoffs, we do not yet know the limits of model free algorithms in finance. --- ### Risk-insensitive Utility $U(\theta)$ is a function, such as profit, wealth or risk-sensitive ratios such as Sharpe, indicating which the scalar utility of a given parameter set $\theta$. No transaction or holding costs, no explicit risk adversity and many other assumptions. \begin{equation} W_T=W_0 \prod_{t=0}^T (1+R_t) \end{equation} * ${R_t}$: The return of the $\theta$ strategy at time t. * ${W_0}$: The initial wealth. --- ## Utility and loss function We can not have $\prod$s in our objective function (stochastic gradient descent algorithms work on sum of losses), so we turn the product into a log sum. \begin{equation} W_T=\prod_{t=0}^T (1+R_t)=\exp(\sum_{t=0}^T(\log(1+R_t))) \end{equation} ---- The exp is still in the way, but we can remove it, the solution will be the same (the solution path will not) \begin{equation} \tilde{W_T}=\sum_{t=0}^T \log(1+R_t) \end{equation} Does it consider risk? maybe, for small returns, the loss goes to infinity log(~0). ---- Nonetheless there are many other methods to account for risk adverse utilities in the loss function. About linear loss functions: $\sum 1+R_t$ It relies on the assumption that $log(1+x)\approx x$ might be true for slow moving, high volume markets, not true for unstable markets (crypto has 13x returns in 5min, not ~1) --- ## Portfolio allocation --- ## Risk mangement Unlike in robust optiomization, upside risk is ok, so we look at the sterling ratio, $$\text{SterlingRatio}(\theta) = \frac{\text{Annualized mean return}(\theta)}{MDD(\theta)}$$ An example of such methods: > J. Moody and M. Saffell, "Learning to trade via direct reinforcement," in IEEE Transactions on Neural Networks, vol. 12, no. 4, pp. 875-889, July 2001, doi: 10.1109/72.935097. --- These approaches, usually at a lower trading frequency rely on the non-market impact assumption, in this case we can differentiate the simualtor! Optimal control is the right framework for such otpimization problems --- >D. Bertsekas. Dynamic Programming and Optimal Control. Athena Sci- entific, 1995. > Boyd, S., Busseti, E., Diamond, S., Kahn, R., Koh, K., Nystrup, P., & Speth, J. (2017). Multi-Period Trading via Convex Optimization. Foundations and Trends in Optimization, 3(1), 1–76. http://stanford.edu/~boyd/papers/cvx_portfolio.html --- ### The reward function for RL. Remember in RL we have a discounted sum of rewards, $\sum R(s, a)$, yet wealth compounds $\prod R$, and even considering exponential growth we introduce a minor bias, we are not clear on what is the right reward function, but we know it's not the current. ---- ### Self financing constraints We don't have successful algorithms to explictily handle constraints such as the self-financing one so we have to impose architectural constraints (e.g. softmax), but softmax is long only. $|x_t|=1, \forall t$ ---- How do we encode short-long self constraints without losing the smoothness needded for deep learning? remember that: - The direct loss is in the log space, what is well behaved in linear space is not necessarly so in log space - Softmax itself is meant for 1 vs all optimization, intuitively, allocating 100% of your portfolio in one asset is a really bad idea (but desired in classification and discrete action RL). --- ### Allocation via Contextual Bandits If the transaction fees are reasonably small, then we can look at the Multi-period allocation problem as "multiple" single period maximization: \begin{equation} \max_{b} \tilde{W}_T \approx \sum_{t=0}^T \max_{b} \tilde{W}_t \end{equation} We call $x_t$ the context of the bandit for $K$ assets. --- If one choses to parametrize the control as a differentiable function, then we can use the gradient feedback to update the allocator via SGD: * $x_t \sim {X_t}, b_t \sim \pi(x_t)$ * $b_{t+1} \gets b_t + \nabla \tilde{W}_t$ ---- Note that: * we are fully model-free * use the generalization power of Neural Network and SGD --- ## Generalization In practice we would like to generalize to unseen transitions $x_t$. But how to measure generalization? - Out of distirbution - New distribution - New assets --- ## Higher frequency As the traded volume increases, by trading at a very high frequency, we can not assume no impact. The agent now modifies the shape of the order book, this is where RL is necessary. Given a LOB dynamics simulator exploit the dynamicity of market and limit orders > Spooner, T., Fearnley, J., Savani, R., & Koukorinis, A. (2018). Market Making via Reinforcement Learning. Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 434–442. --- ## Market Making via Optimal Control In Market Making we care about two things: * make profit on the bid/ask spread * minimize inventory risk --- We can write the Market Making problem as an OC problem, as follows: \begin{equation} arg\max_{\pi} E_{\mu} \left[ \sum_{t=0}^T \Delta_{\text{pnl}_t} - \alpha \Delta_{p_t}N_t\right] \end{equation} * $N_t$ is the inventory at time $t$ * $\Delta_{\text{pnl}}$ is the PNL at time $t$ * $\Delta_{p_t}$ is the change in mid-price from $t\to t+1$ ---- \begin{equation} arg\max_{\pi} E_{\mu} \left[ \sum_{t=0}^T \Delta_{\text{pnl}_t} - \alpha \Delta_{p_t}N_t\right] \end{equation} * The control of the agents are in $(a_{\text{bid}}, a_{\text{ask}}) \in R^2$ * The state space can be defined in multiple ways but at least contains $N$ --- ## Why is a control problem? The agent has an impact on the market and it can influence the Liquidity Taker to change the the state of it' inventory. In portfolio allocation, we don't have such luxury. The risk of the portfolio is **endogenous** to the agent as it flactuates regardless of it's actions. --- ## Why Reinforcement Learning? --- * It's unclear how to simulate market impact * Construcing a LOB from Real data is hard * Generally peaking the LOB is not differentiable (non-smooth) --- ### Other applications of RL - Optimal Execution - Option Pricing and Hedging - Robo-advising - Smart Order Routing - Arbitrage --- ### Compounding vs Additive Rewards “Compound interest is the eighth wonder of the world. He who understands it earns it… he who doesn’t… pays it.” - Einstein We optimize the expected growth rate $$\frac{1}{T}\sum \log b_t \cdot x_t$$ rather than the final cumulative wealth $$\prod \sum b_{t,i}x_{t,i}$$. --- We are off by an $\exp$, this might not be a problem when we talk about optimal solutions, but it matters when we stop the optimization early (e.g. to avoid overfitting) as it's common in deep learning. This might not be relevant yet, but lets not end up paying for the compound.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully