Alec Koppel
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Rebutal for ICML 4224 : Information-Directed Pessimism for Offline Reinforcement Learning Summary of changes to the paper: * In response to reviewers G419, Rj1u and NzhM, we clarified throughout the manuscript that our intended interpretation of sub-Gaussian was meant to be unimodal, not a specific characterization of tail probability decay rates. We agree with the reviewer that in finite spaces all distributions are sub-Gaussian, but standard pessimism operates upon a unimodality representation, whereas our approach permits a mixture family representation. * As requested by reviewer Gfwv, we added algorithm complexity analysis to Table 3 in Appendix A as requested by reviewer * Other issues related to typos, narrative exposition, and relationship to prior art are contained in the reviewer-specific rebuttals. Please see https://www.dropbox.com/scl/fi/cdp17x8fpqe1sx99bfwp0/ICML2024__Information_Directed_Pessimism__8_pages_without_refs_-2.pdf?rlkey=jgdvruinz4c3syih3ifsot1cd&dl=0 for revision. ## Response to Reviewer G419 ### Weaknesses: > * I find the paper to lacking in terms of presentation. Overall, I believe it's not immediate to grasp immediately what is the main result/message and its application. Several sections, esp. sec 3, can be shortened/improved in clarity to help with readability. **Response to Weakness 1:** Our main goal is to point out that when conducting policy search from offline data, the choice of penalty matters experimentally, even if theoretically it is optimal to use LCB and its variants. In particular, LCB implies the deviation of the transition model from its expected value under the behavioral policy as a good representation of mismatch in theory. However, in practice, we find that there is a large gap between this choice of representation and one that contains fuller distributional information of mismatch. To enable this finding, we adapt machinery from kernelized Stein points in Markov Chain Monte Carlo to the offline RL setting for the first time, and establish both in theory and practice there are situations where it can generalize the regret of prior art, or otherwise match it, depending on the underlying distributional properties of the MDP transition dynamics. > * Several times the authors mention non-Gaussianity, but I believe they mean non-sub-Gaussianity, although this does not seem to be well-defined in the paper. **Response to Weakness 2:** <!-- ~~We agree that non-Gaussianity is used as shorthand for non-sub-Gaussianity of the distribution mismatch, and that in several places we used this shortened terminology without explanation. We will endeavor to clarify throughout the intended interpretation, and appreciate the reviewer for drawing attention to this issue.~~--> We will clarify throught the paper that we mean multimodal distributions. Sub-Gaussianity refers to the decay rate of the tails, which in tabular settings are always satisfied; however, they require a central tendency. To be more precise, concentration inequalities are well-defined for unimodal distributions and form a part of most of the offline RL algorithms that use pessimism. The main contribution of the paper is to say there are situations where the transition distribution is multimodal and a pessimistic penalty based on Stein Discrepancy does better theoretically (Theorem 4.3) and empirically. Throughout, we have corrected the attribute of sub-Gaussianity we have honed in on to be associated with unimodality, so as to avoid this unintended alternate interpretation. > * Related work section/discussion is missing or mixed with the introduction. **Response to Weakness 3:** Expanded discussion of related work is in Appendix A, which is referred to at the end of the introduction following the itemized list. #### Other: > The authors mention the deepsea environment in the results, but it seems more like the riverswim environment. **Response:** Our instantiation of the DeepSea environment is a variant of Osband [46] taken from https://github.com/stratisMarkou/sample-efficient-bayesian-rl We agree this variant is similar to Riverswim, and will add a citation of the following reference: A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008. ## Response to Reviewer Rj1u ### Weaknesses: >* The writing can be improved. * The description of discrete Stein discrepancy is not clear. I get lost following the usage of subscript $i$ **Response to Weakness 1:** We agree that understanding the concept of discrete Stein discrepancy is intricate, and hinges upon first understanding discrete score function as the analogue of the gradient of the log-likelihood of a probability distribution. These concepts require a fuller explanation, but due to the spatial constraint, this connective tissue was deferred to Appendix B.2. We will revise the explanation of discrete Stein discrepancy to better refer to these explanations. Regarding the subscript $i$, we note the state $s\in\mathcal{S}$ is a vector containing of length $|\mathcal{S}|$ entries, e.g., in the case of a random walk over $|\mathcal{S}|$ integers, $s$ may be represented through its one-hot encoding, each of which is a vector in $\mathbb{R}^{|\mathcal{S}|}$ with $1$ in the position of $i$ and $0$ elsewhere. The rest of the definitions are computable with this encoding. We will add a footnote detailing this point, as it clarifies the subsequent evaluations. >* Some statements are unclear. For example, "algorithms that achieve this lower bound by dividing by mutual information, i.e., IDS and its variants" in line 212 is an oversimplified summary of information-directed sampling. In line 272, it is unclear what "high-fidelity model estimators are tuned to have a model" means. **Response to Weakness 2:** We agree that this description of IDS is a simplification of the methodology. We have caveated the explanation as follows: > line 212: "algorithms that achieve this lower bound, i.e., IDS and its variants, employ intrinsic exploration incentives in the form of mutual information, which augment sampling by uncertainty estimates with respect to the optimal policy." Weakness 3, line 272: We agree that this phrasing was ambiguous, and have refined it as follows: The knowledge of the `true' score function is feasible in certain financial applications [58], where access to a generative model is provided by a physics or market simulation engine. > * There are some minor typos, like in line 123, "the state $\mathcal{S}$ space" should be "the state space $\mathcal{S}$." **Response to Weakness 4:** This has been corrected. > * The proposed method does not improve the regret bound compared to previous methods; it is even worse. While it does not require the sub-Gaussian assumption, the regret bound is only available in tabular settings, where the distribution is always sub-Gaussian due to a finite (therefore bounded) support. **Response to Weakness 5:** While the distribution of bounded rewards is always sub-gaussian, it can still be multimodal. Concentration bounds are well-defined for unimodal distributions and under unimodal assumptions VI-LCB is optimal as shown in [17]. However, as we point out there are examples in the tabular setting (Fig. 2) that necessitate turning to characterizing distributional distances instead of concentration bounds due to multimodality. This work seeks to address these scenarios by incorporating a penalty fashioned using Stein Discrepancy. While the algorithmic elements remain close to VI-LCB, we obtain better rate in terms of $(1-\gamma)$ coming close to the lower bound. <!--This is a misreading of the landscape of related rate analyses on offline RL predicated on unclear explanation on our part, for which we apologize. The fastest possible learning rates available in the literature for the tabular setting (i) require multiple time-scales and operate upon different concentration bounds. Instead, our rate analyses is competitive with alternative approaches that employ pessimism in single time-scale algorithms -- see the discussion in Appendix A; (ii) Moreover, alternatives impose sub-Gaussianity on the distribution mismatch. While we agree that in the tabular setting, mismatch is in principle sub-Gaussian, the constants implied by employing sub-Gaussian-based concentration bounds are in practice are miscalibrated to the realized mismatch. Therefore, while in theory it may appear to hold, in practice it appears to be an invalid condition, unless the value distribution in the sense of Figure 2 is unimodal.--> >* The proposed method requires an additional assumption about uniform ergodicity, while previous methods do not require such an assumption. **Response to Weakness 6:** It is important to note that prior works, especially [15],[17],[18] suppose exponentially fast mixing of the transition model under a fixed policy to its occupancy measure induced by a fixed policy, without bothering to explain that ergodicity is required for this to hold. See [18], Assumption 1. We note that ergodicity is only required for the convergence of IDP-Q, not IDP-V, whose sufficient conditions are more comparable to [15],[17], which are value iteration-type. ### Questions: >* What do $\mathbb{S}$ and $\Delta^*$ in Equation 3.5 mean without subscript ? **Response to Question 1:** $\mathbb{S}$ is notation used to define the score function of a distribution $\mathbb{P}$, which we represent equivalently here through its transition probability matrix $P_{s,a}$ due to the tabular MDP setting. $\mathbb{S}$ may be defined with respect to continuous distributions as well -- see Appendicx B.2, as well as [32]. >* Why is a uniform ergodicity assumption required for the proposed method? How much does the proposed method rely on this assumption compared to previous methods like Li et al. [1]? **Response to Question 2:** Uniform ergodicity is only required for the convergence analysis of IDP-Q, not IDP-V. This condition also appears in [18]. We will better clarify how the necessity of the assumptions maps to our distinct theorem statements in Section 4, as currently this has not been delineated in sufficient detail. * Can the regret bound be proved in settings with an infinite state space like linear MDPs or kernelized MDPs? >**Response to Question 3:** The KSD machinery is much more natural in continuous space, and hence the pessimistic penalty based upon it would be easily definable in that case. We believe linear and kernelized MDPs are a meaningful extensions of this work for which our proposed machinery may be similarly able to handle multimodal mismatch, and hence are important avenues for future work. >* When is the proposed method provably better than previous methods? [1] Li, Gen, et al. "Settling the sample complexity of model-based offline reinforcement learning." The Annals of Statistics 52.1 (2024): 233-260. **Response to Question 4:** As seen from the extensive numerical studies, there are multiple scenarios where the multimodality in the distribution mismatch impacts the return using traditional concentration based approaches. While the approach of distributional distance based penalty is applicable in general, the performance improvement is conspicuous in the multimodal setting (see Fig. 3, Fig. 5). Concentration inequalities only loosely capture the discrepancy due to multimodality and hence translates to poor empirical performance. Theoretically, using the same algorithmic elements and under the same concentrability assumptions, pessimism based on Stein Discrepancy in Theorem 4.3 improves the order in terms of $(1-\gamma)$. ## Response to Reviewer Gfwv ### Questions >1) In the discussion of your method's computational cost, could you possibly include a comparative analysis within the existing literature in Table 3? **Response to Question 1:** We agree this would better serve to understand how our additional required computational complexity leads to the ability to model multimodal distributions, i.e., what is the cost of our improved performance empirically. We have added this to Table 3; however, the table is very large in clumsy to depict in Openreview format. Therefore, we have provided the updated version of Table 3 in a revision of the paper available at the anonymous link below: https://www.dropbox.com/scl/fi/cdp17x8fpqe1sx99bfwp0/ICML2024__Information_Directed_Pessimism__8_pages_without_refs_-2.pdf?rlkey=jgdvruinz4c3syih3ifsot1cd&dl=0 <!-- https://file.io/Yzvbke4toynD --> >2) Would it be possible to incorporate the standard deviation of your experimental results? **Response to Question 2:** $\pm$ one standard deviation is presented as shaded regions in the plots. <!-- ~~We omitted these values from the tables for clarity but can include these in the appendix.~~ --> >3) I find myself puzzled by the presentation in Table 6, specifically regarding the labeling of the first row as "Hard." Additionally, could you elucidate the criteria or rationale behind the selection of the specific five rows in Table 2? **Response to Question 3:** This terminology comes from Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020. and is meant to indicate that data sets that contain more trajectories from the optimal policy and random policies exhibit (i) smaller concentrability coefficient (smaller ratio closer to 1 of the data's occupancy measure to the optimal policy's occupancy measure); and (ii) better coverage of the state and action space (in the sense of persistent exploration condition), which is known to be related to ensuring that the mixing time to the stationary distribution induced by the current policy is not too large. We will add the interpretation of these aspects in the supplementary material where we describe the experiments in an expanded format. The specific rows in Table 2 aimed to depict the minimal settings (namely the environment and dataset) where the multimodal nature of our problems arose. <!-- Notably, Q-LCB may have been --> --------------------------------- ## Response to Reviewer NzhM ### Weakness: >* The presentation of the paper could be improved. **Response to Weakness:** ### Questions: >* The authors mentioned that Theorem 4.4 allows the mismatch to be in a mixture family, which is strictly more general than sub-Gaussian conditions in prior art. Could the authors provide a concrete theoretical example where prior work such as VI-LCB (which is shown to be optimal in theory) performs worse than IDP-VI and IDP-Q? **Response to Question:** <!--Mismatch is inherently a function of the MDP transition model. Theoretically, we expect sub-Gaussian distributions on the mismatch to not hold when the underlying transition model has slow-decaying tails. While all distributions over a finite set are sub-Gaussian in theory, in practice, they also must be unimodal for the concentration bound-based penalty to encapsulate the mismatch well. Therefore, for MDP transition models that are multimodal, e.g., an appropriately discretized beta distribution of the type in our portfolio optimization environment, VI-LCB underperforms.--> Mismatch is inherently a function of the MDP transition model. While all distributions over a finite set are sub-Gaussian in theory, in practice, they also must be unimodal for the concentration bound-based penalty to encapsulate the mismatch well. As a concrete example, consider the distribution of the true transition matrix to be from a multimodal distribution of the type in the PriorMDP environment as in one of our benchmark environments. In this case, using concentration inequalities to model the mismatch towards defining a penalty leads to a miscalibrated bound, as it captures the deviation from the unimodal beta whose mean is the average of the individual beta distributions. This looseness translates to poor performance as well as seen from Fig 1 (left) and Figure 5.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully