SUBHOJYOTI MUKHERJEE
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # [PreDeToR] NeuRIPS 2024 # ## Reviewer ukA6 ## We thank the reviewer for their feedback and are happy to hear that they found it to be well-motivated, intuitive, and our evaluation extensive. We think the concerns raised can be addressed with minor revisions and we respond in detail below. Cons: - Regarding real-world data, our main focus was to understand whether a transformer could automatically discover and exploit structure in a wide variety of bandit strucutes. For this reason, we focused on synthetic experiments under a wide variety of bandit problem types that have been previously studied in the literature. We do agree that an experiment based on real-world data would add to the paper and so we added the following experiment: *Experiment on Movielens Dataset:* In this experiment, we first construct a rank-4 approximation of the dataset over 1000 users and 200 movies such that each user prefers either movies $7,13,16$, or $20$ . We use a set of $10$ movies that have been rated by all the users as our set of "actions". The users are the "tasks". The features of the movies form the action feature set $\mathcal{X}$ shared by all the tasks (users). We learn the $\theta_{m, \star}$ from the actual data in the dataset for each task $m$. The goal is to identify the most preferred movies (which should be one of $7,13,16$, or 20 ) in each task as quickly as possible. We then generate the rewards from the learned $\theta_{m, \star}$ for each task $m$ at round $t$ using a linear structure such that $r_{m,t}= \theta_{m, \star}^\top \mathbf{x}_{m,t} + \eta_{m,t}$. We first train for $M=1000$ users/tasks, and then test in $20$ test users/tasks. We show the cumulative regret with increasing number of rounds in Table 1. | Algorithm | Regret (n=25) | Regret (n=50) | Regret (n=75) | Regret (n=100) | | ----------------- | ----------- | ------------ | ------------ | ------------ | | PreDeToR (Ours) | 5.43 | 9.9 | 13.93 | 17.75 | | PreDeToR-$\tau$ (Ours) | 5.4 | 9.1 | 12.9 | 15.7 | | DPT (Greedy) | 7.29 | 10.91 | 14.88 | 19.37 | | LinUCB | 5.08 | 7.39 | 9.42 | 11.98 | | Thomp | 9.56 | 15.30 | 22.5 | 29.92 | | AD | 9.52 | 15.88 | 21.6 | 29.12 | Table 1: Cumulative Regret of algorithms at different horizon time in linear MTRL setting for Movielens Dataset for $n = 100$, $M=1000$, and $A=10$. From Table1 we observe that PreDeToR $(-\tau)$ has lower cumulative regret than DPT-greedy and AD. Note that PreDeToR $(-\tau)$ matches the regret of LinUCB which as this is a linear setting. Finally PreDeToR- $\tau$ performs slightly better than PreDeToR as it conducts more exploration. The data is collected using Thompson sampling. - We think there is a slight misunderstanding leading to the concern about a linearity assumption. Please note that we do not make any assumption about underlying linear structure in the way that an algorithm like LinUCB does. Rather, our key finding is that PreDeToR can learn to disover such structure automatically when pre-trained to predict rewards on a family of tasks with common structure. Importantly, the common structure need not be linear as demonstrated by our experiments in the non-linear setting). Our theoretical results are relevant to PreDeToR because PreDeToR is trained to minimize the MSE of reward prediction. - We would like to clarify one misunderstanding of the paper by the reviewer. Note that PreDeToR does not have access to the feature of the actions. It only observes the action indices and rewards of the weak demonstrator across the source tasks and learns the reward of each action. Note that the actions are shared by all the tasks. Furthermore, PreDeToR minimizes the MSE loss during training, and during inference time, it selects an action based on the estimated reward it calculated from the data of the source tasks (transfer learning). We want to point out that we do not assume that the reward structure is linear for theoretical analysis. d method in the transfer learning setting. **We do not use any linear realtionship between the arm contexts and hidden parameter $\theta_{\star}$**. In line 109 we clearly state that $r_{m, t}=f\left(\mathbf{x}\left(I_{m, t}\right), \boldsymbol{\theta}_{m, *}\right)+\eta_{m, t}$ where the function $f(.)$ can be linear, non-linear, bilinear, latent, etc. - We apologize for putting the related works in the appendix. We chose to do so since the most important related works were already discussed in the introduction and this enabled us to include more empirical analysis in the main paper. We agree an expansive related works is a good addition to the main paper. We will add this and the above Movielens experiment using the additional page available in the camera-ready version. ## Reviewer djXe ## We thank the reviewer for their positive feedback on our work, particularly the extensive empirical evaluation, use of theory, and positive results. We answer the questions below: 1. We agree with the reviewer that PreDeToR cannot be readily extended to RL setting but respectfully submit that bandit problems themselves are still an important class of ML problems with many real world applications. Our aim was to develop a method that could improve over demonstrations without knowledge of the optimal action (as DPT requires), and be effective in a diverse variety of bandit problems with unknown structure. Finding a way to improve upon DPT with sub-optimal MDP demonstrations is a very interesting direction for future work. 2. The reviewer asked about SOTA in MT bandit learning. There are prior works that minimize regret under a linearity assumption. A key part of our contribution is that we do not require the linearity assumption but instead discover underlying structure from the pre-training data. We illustrate this point with our non-linear setting experiments, where PreDeTor outperforms methods derived with the linear assumption. To the best of our knowledge, there are no baselines that have been derived without the linearity assumption. 3. We agree that there are previous works that address regret minimization in the linear MTRL setting. These are [Yang et al. (2020)](https://arxiv.org/pdf/2010.06531), [Cella et al. (2023)](https://arxiv.org/pdf/2202.10066), [Hu et al. (2021)](https://proceedings.mlr.press/v139/hu21a/hu21a.pdf). However, these works use strong linear structure assumption to devise an algorithm, whereas PreDeToR learns the reward structure without knowledge of the underlying explicit structure. We compare against the MLin-Greedy algorithm of [Yang et al. (2022)](https://arxiv.org/pdf/2202.10066) in the non-linear setting. Our results in Table 2 show that PreDeToR consistently outperforms it. | Algorithm | Regret (n=50) | Regret (n=100) | Regret (n=150) | Regret (n=200) | Regret (n=250) | | ----------------- | ----------- | ------------ | ------------ | ------------ | ------------- | | PreDeToR (Ours) | 3.7 | 5.92 | 7.97 | 9.98 | 11.91 | | PreDeToR-$\tau$ (Ours) | 3.5 | 5.1 | 6.9 | 9.7 | 10.94 | | DPT (Greedy) | 6.86 | 11.49 | 15.69 | 19.81 | 23.94 | | Thomp | 9.56 | 16.33 | 21.56 | 26.09 | 29.93 | | LinUCB | 4.95 | 7.35 | 9.39 | 10.99 | 16.32 | | MLin-Greedy | 3.95 | 6.5 | 8.4 | 10.1 | 14.2 | Table 2: Cumulative Regret of algorithms at different horizon time in non-linear MTRL setting. From Table 2 we observe that PreDeToR $(-\tau)$ has lower cumulative regret than DPT-greedy. Note that PreDeToR $(-\tau)$ has lower regret than LinUCB which fails to perform well in this non-linear setting due to its algorithmic design. Finally PreDeToR- $\tau$ performs slightly better than PreDeToR in both the settings as it conducts more exploration. Finally we show that MLin-Greedy performs badly in this non-linear setting as it is designed for linear feedback setting. This shows that previous approaches of linear MTRL algorithms do not generalize well as PreDeToR for the non-linear and other multi-task structured bandit setting. ## Reviewer 5jdk ## We thank the reviewer for their feedback. We answer below: 1. One concrete use case is making recommendations where we want to make the best recommendation to the current user. This can be posed as a bandit problem where we make recommendations and the user provides evaluative feedback on how good they are. In real applications, we don't have just one user but might have had many past interactions with other users. While users have different evaluations of different items, it is reasonable that there is some shared structure in their item evaluation. Our aim is to automatically discover and exploit this structure for more efficient bandit learning. In the following, we evaluate our approach against other methods for movie rating with a real-world dataset. *Experiment on Movielens Dataset:* In this experiment, we first obtain a rank-4 approximation of the dataset over 1000 users and 200 movies such that all users prefer either movies $7,13,16$, or 20 . The movies are the actions and we choose 10 movies that have been rated by all the users. The features of the movies form the action set $\mathcal{X}$ shared by all the tasks (users). We learn the $\theta_{m, \star}$ from the actual data in the dataset for each task $m$. The goal is to identify the most preferred movies (which should be one of $7,13,16$, or 20 ) in each task as quickly as possible. We then generate the rewards from the learned $\theta_{m, \star}$ for each task $m$ at round $t$ using a linear structure such that $r_{m,t}= \theta_{m, \star}^\top \mathbf{x}_{m,t} + \eta_{m,t}$. We first train for $M=1000$ tasks, and then test in $20$ test environments. We show the cumulative regret with increasing number of rounds in Table 1. | Algorithm | Regret (n=25) | Regret (n=50) | Regret (n=75) | Regret (n=100) | | ----------------- | ----------- | ------------ | ------------ | ------------ | | PreDeToR (Ours) | 5.43 | 9.9 | 13.93 | 17.75 | | PreDeToR-$\tau$ (Ours) | 5.4 | 9.1 | 12.9 | 15.7 | | DPT (Greedy) | 7.29 | 10.91 | 14.88 | 19.37 | | LinUCB | 5.08 | 7.39 | 9.42 | 11.98 | | Thomp | 9.56 | 15.30 | 22.5 | 29.92 | | AD | 9.52 | 15.88 | 21.6 | 29.12 | Table 1: Cumulative Regret of algorithms at different horizon time in linear MTRL setting for Movielens Dataset for $n = 100$, $M=1000$, and $A=10$. 2. We thank the reviewer for talking about the adversarial setting. We implemented the adversarial setting from Laskin et al. and also found that PreDeToR outperforms AD in this setting. Note that Algorithm Distillation (AD) cannot outperform the demonstrator, whereas if data collected for the new optimal arm when the task switches is sufficient then PreDeToR can outperform AD. <!-- It's just the distribution of tasks where certain arms are optimal changes. I think PreDeToR would still work but it would have a higher prior on sub-optimal arms but would then switch to other arms with more pulls. --> Questions: 1. This is an excellent suggestion and we will add discussions on practical scenarios in our revision. We will also add the movie recommendation experiment to include an experiment based on real-world data. 2. The Algorithm Distillation is indeed more robust in the adversarial setting described by the reviewer. We implement the adversarial setting Laskin et al (2019) such that certain arms being optimal is 95% during training but then only 5% for the same arms during evaluation. We take half of the arms and only allow them to be optimal 5% of the time during training and then make them optimal 95% of the time during testing. We show the result in the Table 2 below. <!-- If the weak demosntrator (Thomson sampling) algorithm collects sufficient samples of optimal action between two changepoints then PreDeToR will outperform AD. Such changepoint adversarial setting, where the rewards of all actions change at changepoiints have been previously studied in [Besson et al. (2022)](https://www.jmlr.org/papers/volume23/20-1384/20-1384.pdf), [Maillard et al. (2019)](https://proceedings.mlr.press/v98/maillard19a/maillard19a.pdf), [Mukherjee et al. (2020)](https://arxiv.org/pdf/1905.13159). If the rewards change too fast, and the changepoint assumption is violated, then it is not clear if PreDeToR will outperform AD. We leave this line of work to future works. --> <!-- Pred [3.13, 3.9, 4.64, 5.44] Pred-tau [3.78, 4.89, 5.8, 6.76] DPT [7.46, 10.19, 12.83, 15.47] AD [6.12, 7.59, 8.64, 9.59] Thompson [8.07, 10.74, 11.95, 12.68] LinUCB [2.42, 2.84, 3.12, 3.3] --> | Algorithm | Regret (n=13) | Regret (n=25) | Regret (n=38) | Regret (n=50) | | ----------------- | ----------- | ------------ | ------------ | ------------ | | PreDeToR (Ours) | 3.13 | 3.9 | 4.64 | 5.44 | | PreDeToR-$\tau$ (Ours) | 3.78 | 4.89 | 5.8 | 6.76 | | DPT (Greedy) | 7.46 | 10.19 | 12.83 | 15.47 | | LinUCB | 2.42 | 2.84 | 3.12 | 3.3 | | Thomp | 8.07 | 10.74 | 11.95 | 12.68 | | AD | 6.12 | 7.59 | 8.64 | 9.59 | Table 2: Cumulative Regret of algorithms at different horizon time in linear MTRL setting under adversarial rewards. From table 2 we observe that PreDeToR has lower cumulative regret than DPT-greedy. Note that PreDeToR has higher regret than LinUCB as this is a linear feedback setting. Finally note that PreDeToR outperforms AD because even though different arms are optimal during training and testing time, the feature of the arms do not change. The PreDeToR exploits the latent representation of these arms and reward correlation to predict the mean rewards. ## Reviewer uY6b ## We thank the reviewer for their positive feedback. Weaknesses: 1. We show experiments with increasing attention heads in Appendix A.9. Questions: 1. We thank the reviewer for bringing this up. We compare against LinUCB and Thompson sampling for two reasons. First, we show that even when PreDeToR is trained on data collected by Thomp, which is a weak demonstrator and has no access to the feature information, our method is able to outperform the weak demonstrator. Secondly, we show LinUCB performance to demonstrate that PreDeToR exploits the latent linear structure (for the linear bandit experiment) and matches its performance. In Figure 5 we try to understand the behavior of PreDeToR and its ability to exploit the reward correlation across tasks under a linear multivariate Gaussian model, thereby comparing against BayesPred. We compare against DPT, and AD which leverages information across the tasks and are outperformed by PreDeToR. 2. Yes, the transformers for each of the linear and non-linear bandits are trained separately. 3. That's a great point. In multi-task setting, we generally assume the $M_{\text {pre }} \gg|\mathcal{A}|$ (cite works). When the tasks are small the multi-task learning algorithms fail to exploit the latent structure of the problem. This is also evident from our experiments in Appendix A.10. The experiments show that as the number of tasks $M$ increases the PreDeToR has lower cumulative regret. ------------------------------------------------------- # ICML 2024 # ## Reviewer Qjdp We thank the reviewer for their positive feedback of our work. We provide answers to the main points below. Weaknesses: 1. We will take into consideration your suggestion for not using a different color style for algorithm names. 2. Yes, we have tried an explorative version of PreDeTor as stated by the reviewer. We show the result in Table 1 where PreDeToR (Explore) captures the variant the reviewer has asked. Note that the greedy algorithm PreDeToR (Ours) outperforms PreDeToR (Explore) as it has a good estimation of the reward of the arms given a task and doesn't need to explore. However, there could be tasks, where such exploration might yield better result and we leave this to future work. We will discuss this in camera-ready version. We also highlight the main difference between DPT and our approach. DPT requires the optimal action for each task $m$. Then during training the transformer $\mathbf{T}_{\mathbf{w}}$, the DPT minimizes the cross-entropy loss as follows: $\mathcal{L}_t^{\mathrm{DPT}}=\operatorname{cross-entropy}\left(\widehat{a}_{m, t, *}, a_{m, *}\right)$ where, $\widehat{a}_{m, t, *}$ is the predicted optimal action at round $t$ by $\mathbf{T}_{\mathbf{w}}\left(\cdot \mid \mathcal{H}_m^t\right)$ (see eq (1)). However, the PreDeToR does not require the optimal action for each task and just uses the reward for next action to minimize the loss (see eq (3)). | Algorithm | Regret (H=50) | Regret (H=100) | Regret (H=150) | Regret (H=200) | | ----------------- | ----------- | ------------ | ------------ | ------------ | | PreDeToR (Ours) | 5.43 | 9.9 | 13.93 | 17.75 | PreDeToR (Explore) | 7.25 | 11.83 | 15.63 | 19.13 | | DPT (Greedy) | 7.29 | 12.54 | 17.48 | 22.37 | | Thomp | 9.56 | 16.30 | 21.59 | 25.92 | | LinUCB | 5.08 | 7.59 | 9.54 | 11.9 | Table 1: Cumulative Regret of algorithms at different horizon time in linear setting denoted by $H = \{50, 100, 150, 200\}$. For each horizon time we first train for $M=100000$ tasks, and then test in $200$ test environments. 3. Thanks for the suggestion. We will mention which are the strong learners and weak learners in the camera-ready version. We will state that LinUCB is using oracle knowledge of the structure of the problem and therefore performs optimally in the linear bandit setting. 4. We clarify the new action setting here. - First fix the total number of action across all the environments (tasks). Say this is $A=10$. Fix the number of in-variant arms across the tasks. Say this is $|\mathcal{A}^{inv}| = 8$. These arms are shared by all tasks. - So each task $m\in [M]$ now has two new arms that is drawn randomly from $\mathcal{N}(0, \mathbf{I}_d)$. Note that these arms are different for each task. This also implies that each of the test environments will have $2$ completely new arms unseen during training time ($\mathcal{D}_{\text {test }} \neq \mathcal{D}_{\text {pre }}$). - Then create a head on the transformer model that has 10 outputs, one reward for each action. - The learning proceeds as before where we use a weak learner to collect data without observing the features and the goal is that the PreDeToR learns to exploit the underlying latent structure and reward correlation between the different tasks even when $\mathcal{D}_{\text {test }} \neq \mathcal{D}_{\text {pre }}$. The main goal of this section is to demonstrate how the transformer is robust to unseen data during test time and subsequently learns to adapt to unseen data. We see that as long as the number of invariant arms are substantially higher than new arms, the transformer is still able to learn near-optimal policy. We also do not permute the actions during training time. 5. Reservation: As discussed in Conclusion/Future Work, we indeed plan to extend this line of research to MDPs. There are several open questions. As you correctly stated, now we need to have a value function estimation for each state. Also, the exploration in MDP needs to be factored in, and it is not clear whether being greedy with the value function is sufficient to obtain a near-optimal policy in MDPs. We leave this to future works. ## Reviewer puCK We thank the reviewer for their helpful comments. We address the concerns below: Weakness: 1. We focussed on short horizon setting for two reasons: First (motivation) this setting shows up in lot of real-world applications like recommender systems with LLMs ([Liu et al. (2023)](https://arxiv.org/pdf/2309.17382.pdf)) and Latent Multi-armed Bandits (MAB) (see [Kwon et al. (2022)](https://proceedings.neurips.cc/paper_files/paper/2022/file/95a6fcdc0c8458baa9c6e14736a644f8-Paper-Conference.pdf), [Kwon et al. (2021)](https://proceedings.neurips.cc/paper_files/paper/2021/file/cd755a6c6b699f3262bcc2aa46ab507e-Paper.pdf)). In all of these settings, the user interacts with the task only for a short horzion as the incentive/utility for a long interaction is very less. As shown by [Kwon et al. (2022)](https://proceedings.neurips.cc/paper_files/paper/2022/file/95a6fcdc0c8458baa9c6e14736a644f8-Paper-Conference.pdf) in the Latent MAB setting, the short horizon learning is a more challenging setting as less information is available for each task. Therefore for Multi-task learning, any good algorithm has to leverage the information across tasks to perform well in any invidual new task shown at test time. Similar settings are studies from theoretical perspective in multi-task representation learning in [Yang et al. (2021)](https://arxiv.org/pdf/2010.06531.pdf), [Cella et al. (2022)](https://arxiv.org/pdf/2205.15100.pdf). We show the scalability of PreDeToR in larger horizon for the linear setting in Appendix A.6. We further show the performance of PreDeToR in non-linear setting in Questions 1. 2. Predicting the reward (in contrast to showing the optimal arm for each task as in DPT paper [Lee et al. (2023)](https://proceedings.neurips.cc/paper_files/paper/2023/file/8644b61a9bc87bf7844750a015feb600-Paper-Conference.pdf)) is a natural choice. Note that all the tasks share the same action set $\mathcal{X}$. This naturally imparts a shared correlation between the tasks, even though the reward distibution for each task changes due to a different $\theta_{m,*}$. However, by observing the reward pattern across many such tasks, the transformer can learn the shared reward correlation and perform well on a new task (with same arm set but different $\theta_{m,*}) shown at inference time. This happens in two ways: First after observing a few in-context examples the transformer is able to identify the task (or a closely related task) and then use the reward prediction model to predict the expected reward of the optimal arm of the identified task and select it. Questions: 1. We show the performance of PreDeToR in non-linear setting in Table 2. In this setting we set $f\left(\mathbf{x}, \boldsymbol{\theta}_*\right)=1 /\left(1+0.5 \cdot \exp \left(2 \cdot \exp \left(-\mathbf{x}^{\top} \boldsymbol{\theta}_*\right)\right)\right)$ and this is discussed in detail in Appendix A.2. Note that the DPT paper [Lee et al. 2023](https://proceedings.neurips.cc/paper_files/paper/2023/file/8644b61a9bc87bf7844750a015feb600-Paper-Conference.pdf) considers a horizon $H=200$ for linear bandit setting. We show even a larger horizon till $H=250$ to illustrate that the main goal is to learn the correlation across the tasks, which our approach is able to learn. From the results, we observe that PreDeToR has lower cumulative regret than LinUCB as this is a non-linear setting. | Algorithm | Regret (H=50) | Regret (H=100) | Regret (H=150) | Regret (H=200) | Regret (H=250) | | ----------------- | ----------- | ------------ | ------------ | ------------ | ------------- | | PreDeToR (Ours) | 3.7 | 5.92 | 7.97 | 9.98 | 11.91 | | DPT (Greedy) | 6.86 | 11.49 | 15.69 | 19.81 | 23.94 | | Thomp | 9.56 | 16.33 | 21.56 | 26.09 | 29.93 | | LinUCB | 4.95 | 7.35 | 9.39 | 10.99 | 12.32 | Table 2: Cumulative Regret of algorithms at different horizon time in non-linear setting denoted by $H = \{50, 100, 150, 200, 250\}$. For each horizon time we first train for $M=150000$ tasks, and then test in $200$ test environments. 2. Our model complexity and training convergence is similar to the DPT algorithm. Note that DPT also has a linear head on top of the transformer that predicts the optimal action (instead of reward). We will state this in our camera-ready version. We discuss the motivation of reward prediction in weakness 2. 3. In our Figure 1, we provide a test arm distribution and prediction result. In the linear setting experiment, we have $31.2\%$ of test environments where the difference in expected rewards between the optimal arm and thr second best arm is less than $0.05$. In $52\%$ of test environments the difference in expected rewards between the optimal arm and thr second best arm is less than $1.0$. In all these cases, we see from Figure 1, that PreDeToR is able to predict the mean reward of the optimal arm accurately which leads to a near-optimal policy. ## Reviewer Pc2o Weaknesses: 1. [Ask this] 2. See Question 3. Questions: 1. We show the performance of PreDeToR in long horizon setting for linear setting in Appendix A.6. We show further result for non-linear setting in Table 2 extending to horizon $H=250$. This is more than the horizon studied in DPT paper. Following their work we focus on how to leverage data across tasks to perform better at inference time on a new test task. 2. [Ask this] [QX: I am not quite sure why this reviewer finds it confusing. Some thoughts: We can potentially provide intuition on our algorithmic design; we can also consider moving Section 6 before Section 4?] 3. We present a ~~theoretical~~ understanding how PreDeToR in Section 6. In this work, we mainly focus on the empirical study of pretraining for decision transformers. In Section 6, we aim to understand the behavior of PreDeToR through a linear multivariate Gaussian model. We hypothesize that the PreDeToR is exploiting the reward correlation covariance matrix from the training data and acting greedily on it. To test this hypothesis, we consider the greedy BayesPred algorithm that first estimates the coorelation matrix from the pretraining data. It then uses the LMMSE estimator in Lemma 1 to calculate the posterior mean vector, and then acts grredily on it by selecting the most rewarding action. The hypothesis that BayesPred is a lower bound to PreDeToR is supported by Figure 3. Unfortunately, so far we are not able to provide a regret analysis as the DPT paper, as their proof technique crucially relies on observing the optimal action during training. We leave the theoretical regret analysis for PreDeToR to future works. ## Reviewer senv We thank the reviewer for their feedback. We address the concerns below: Cons: 1. We restate the main difference of PreDeToR from AD and DPT. In AD, the learner aims to predict the next action of the demonstrator and is trained to minimize the cross-entropy loss as $\mathcal{L}_t^{\mathrm{AD}}=\operatorname{cross-entropy}\left(\widehat{I}_{m, t}, I_{m, t}\right)$ where $\widehat{I}_{m, t}$ is the predicted next action by the model, and $I_{m, t}$ is the true action taken by the demonstrator. Therefore AD incurs similar cumulative regret as the demostrator. This observation is illustrated in Table 3. As observed by the reviewer, the performance of DPT (original with optimal action info) is better than PreDeToR as it has access to the optimal actions in each task. We also show these results in Table 3. | Algorithm | Regret (H=50) | Regret (H=100) | Regret (H=150) | Regret (H=200) | | ----------------- | ----------- | ------------ | ------------ | ------------ | | PreDeToR (Ours) | 5.43 | 9.9 | 13.93 | 17.75 | | DPT (Original) | 5.29 | 7.91 | 9.88 | 12.37 | | LinUCB | 5.08 | 7.59 | 9.54 | 11.9 | | Thomp | 9.56 | 16.30 | 21.59 | 25.92 | | AD | 9.42 | 15.88 | 20.6 | 25.12 | Table 3: Cumulative Regret of algorithms at different horizon time in linear setting denoted by $H = \{50, 100, 150, 200\}$. For each horizon time we first train for $M=100000$ tasks, and then test in $200$ test environments. 2. We mention in our future work that we intend to extend this line of research to MDPs. There are several open questions. In the MDP setting we need to have a value function estimation for each state. Also, the exploration in MDP needs to be factored in, and it is not clear whether being greedy with the value function is sufficient to obtain a near-optimal policy in MDPs. We leave this to future works. 3. Thank you for your comment. Excellent suggestion. See the Con 1. We report the performance of DPT (original) and AD in Table 3. To determine the optimal action for each task for DPT (greedy) we follow the steps in DPT paper. We will also include the comparison with AD and DPT (original) in other experiments as well. Questions: 1. See Con 3. 2. See Con 2.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully