We thank Reviewer Rdii for the detailed comments and insightful feedback. We are encouraged that the reviewer finds "the question that the paper set out to study was interesting". We address Reviewer Rdii's concerns and questions below:
---
> Q1. The paper is poorly structured and is packed with too many details which makes it hard to follow. Figure 1 in the main text is referenced only in appendix F.
Experimental results are presented in the main paper without any details of what the experimental setup is in the main paper (section 3.2). This use of the appendix in general as a 'cheat code' to get around the page limit is highly disappointing and makes for a poor reading experience.
We first apologize for the drawback in our presentation. As a first-of-its-type work, our paper intended to provide as extensive results (both empirically and theoretically) as possible. As the area is interdisciplinary (at the intersection of LLM, online learning, game theory, statistics, etc), we had to use some pages to introduce the background/basics, so that the readers from different background would be on the same page. As a result, we had to trade-off the presentation of some "setup" with that of "results", in the submitted version. We wanted to present "as many exciting results as possible". We apologize again for any uncomfortable reading experience this has caused.
Acknowledging that we will have one more page in the camera-ready version, we will make sure to add back these details regarding "setup". In case the reviewer would like to read a more detailed and reader-friendly version, we kindly refer to an anonymous version here at (https://anonymous.4open.science/r/no-regret-llm-8442/arxiv_llm_noregret-3.pdf). Again, we will make sure to address the presentation issues above with the extra page we will have.
<!-- acknowledge that since our paper wants to provide extensive empirical and theoretical understanding for LLMs in online learning problems, which is a brand new topic to the best of our knowledge, we have to refer to Appendix frequently to introduce and establish even basic stuff from scratch to build the foundation. However, we think it is worth doing this. For example, many papers in machine learning theory use a lot of appendices to explain their theory. We greatly appreciate the reviewer's understanding about this issue. -->
---
> Q2. The experiments are done with GPT-4 only.
Note that we also had experimental results of GPT-4 Turbo for the counter-examples in Figure 3 in the submission. Moreover, we also additionally evaluate both the performance of GPT-4 Turbo and GPT-3.5 in Figures 1, 2, 4, and 14 in https://anonymous.4open.science/r/no-regret-llm-8442/arxiv_llm_noregret-3.pdf).
Our experiments show that even GPT-3.5 can reliably display no-regret behaviors. There are increasing examples of LLMs these days, but it is generally believed that most of them, e.g., Claude 3 and Mixtral, are more powerful than GPT-3.5. Hence, it is reasonable to believe that our empirical findings are generalizable. In particular, we would like to emphasize that, the main goal/contribution of our paper is *not* to test if *certain LLM* has the no-regret behavior (as there are so increasingly many to keep track of!), but to provide a lens/metric (together with the associated validation framework, new training loss, etc) to look at LLM-for-decision-making more rigorously, which is to our knowledge novel in the literature, and can definitely be generalized to study other LLMs.
Meanwhile, we have been evaluating on some other LLMs since rebuttal, and will include them in our final version. Thanks again for the great suggestion.
___
> Q3. In section 4.1, an assertion is made that regret of LLMs is similar to that of humans. This assertion is not supported through any references nor through any original experiments (this would require a human study in my opinion). This 'explanation' also overlooks the fact that internet also contains non-human generated data; and the trends used by authors are quite popular ones and many datasets on internet contain these trends. The fact that LLM does well on these is NOT surprising.
---
Firstly, note that our assertion is *not* precisely claiming that the LLM's regret should be similar to that of human regret. Instead, it is about the *dataset generator*'s regret that is used for **training LLM** should have similar regret as the trained LLM's regret. The epsilon-decision error is already validated by some earlier work, e.g., Zhang et al., 2023c; Lin et al., 2023, which is also mentioned in our paper. This was also exactly the reason why we wrote the pre-training distribution as $\mathbb{P}_{data}$, instead of $\mathbb{P}_{human}$.
Moreover, we never emphasized or relied on this for any conclusions in the latter - note that we just write it as an "observation" (not a "theorem/lemma"), in order to echo people's intuition nowadays that "LLM agents' behaviors are mimicking those of human beings who generate the data", and we *did* have references for such claims in our paper, see (Park et al., 2022; Argyle et al., 2023; Horton, 2023) in the beginning of Sec. 4.1. Our more rigorous analysis and results can be found in Sec 4.2.
Secondly, we understand but respectfully disagree with the reviewer that "*the trends used by authors are quite popular ones and many datasets on internet contain these trends. The fact that LLM does well on these is NOT surprising*". This is because first, we do contain "randomly generated" loss sequences and games, which are *not* some "popular ones" that can be found online. Second, we have presented that LLMs indeed solve the online learning problem "for some reason": it (at a high-level) first estimates the mean/sum rewards for each action, and then carefully introduces "randomness" in their decision (see the empirical evidence in Appendix B.4 of https://anonymous.4open.science/r/no-regret-llm-8442/arxiv_llm_noregret-3.pdf), which are known to be key in achieving "no-regret" behavior. This behavioral pattern is observed to be consistent in all problems, no matter whether the losses show some predictable trend or not. In other words, even if the LLMs may have *memorized* some trends from its training dataset, the way it is solving online learning is NOT by recalling those trends or how humans solve those problems with particular trends.
___
> Q4. The practical value of the new training loss proposed by the authors is not very clear to me, and is not demonstrated in their experiments (on toy setups).
We have shown the results of our training loss compared to the GPT-4 in Figure 11, showing its ability to outperform black-box, off-the-shelf LLMs, in solving online learning problems. Therefore, in light of our empirical findings that LLMs cannot even reliably solve those relatively toy examples with strong adversaries (see Sec.3.4), we believe our approach can be of high practical value for LLMs to be ready for deployment in complex, dynamic, and adversarial environments. Note that we have used consistent "setups" from the very beginning, for both positive, negative, and the improved results (due to the new training loss).
<!-- Meanwhile, we combined Figure 4 and Figure 11 in our new revision (https://anonymous.4open.science/r/no-regret-llm-8442/arxiv_llm_noregret-3.pdf). -->
Moreover, note that we also did provide *ablation studies* in Figures 12-15, showing the scalability and robustness of our new training loss.
___
> Q5. The paper should have an explicit limitations section.
We thank the reviewer for the kind reminder and will definitely include a limitation section in the final version.
---
We greatly appreciate Reviewer Rdii's valuable feedback and constructive suggestions. We are more than happy to answer any further questions.
Paper2528 Authors
---
> Q1. If the page limit is a big issue in the presentation, authors should consider submitting to a journal that allows longer length submissions (e.g. TMLR). In its current state, the paper is very hard to read, and in my opinion, should not be accepted.
We appreciate the reviewer's suggestion to consider submitting to journals with more generous page limits, such as TMLR. We acknowledge the challenge of balancing detailed exposition with readability within the constrained page limit for prestigious ML conferences like ICML and NeurIPS. However, we would also like to point out that it is indeed common for significant contributions in the field of ML to require extensive introductions, thorough methodology descriptions, and comprehensive results sections to substantiate their claims fully. Such depth oftentimes extends beyond the standard 8 or 9-page limit for the main content, highlighting the importance of supplementary materials. We believe the main 8 or 9 pages of the main paper are often used to convey the main and most exciting messages only.
That being said, we do agree that clarity and accessibility are paramount, and we recognize that our paper's current format may pose readability challenges. To address this, we will endeavor to streamline our presentation and enhance the organization of the appendix, ensuring key information is more accessible and easier for readers to refer to.
Moreover, we believe that the ultimate measure of a paper's merit, especially within the context (together with the page limit) of top-tier ML conferences, should be its *intellectual contributions* rather than strict adherence to page limits. We hope our work meets this criterion by offering substantive advancements in the field.
---
> Q2. I think doing experiments with different scale open-weight models (e.g. from llama family), different from GPTs, would be insightful here.
We thank the reviewer's suggestion again, while we would like to point out that our experiments from GPT-3.5, GPT-4, to GPT-4 Turbo already represent models of different scales. We also kindly remind the reviewer that the paper [1] that the reviewer has brought to our attention only used GPT-3.5 and GPT-4. Meanwhile, we are also actively conducting more experiments on more diverse choices of models. Should we get new results, we will share them with the reviewer as soon as possible.
Again, as we emphasized in the earlier response -- the main goal of the work is to *introduce this rigorous framework of "regret" into measuring the capability of LLM agents' decision-making*, instead of *keeping up with* the fast-growing zoo of LLM models. We believe the principles, algorithms, theoretical analysis (insights), and new training losses we developed, as main contributions, are of merit to understanding other LLMs.
---
> Q3.1. Memorization of training data does not need to be exact for LLMs -- but can be 'task' based. See https://arxiv.org/abs/2309.13638.
- First, we thank the reviewer for bringing up this insightful paper and will add detailed discussions of it in our revision. However, just as the reviewer and paper [1] suggest, we have exactly discussed *how those online learning tasks in the pre-training data will affect the LLM's performance in online learning* in **Sec4.1**, where we have mentioned that LLMs could show similar regret as the regret of those online learning tasks in the *pre-training data* since the LLMs may have memorized the patterns of how the online learning tasks are solved in the pre-training data. In other words, our arguments there are consistent with the paper the reviewer brought up.
- More importantly, we would like to point out that online learning is in general a very challenging problem, and **simply mimicking the behavior pattern of how the Internet text solves the online learning problems is unlikely to enable the LLM to be no-regret, for various reasons like the bounded rationality of humans and noise in the data collection process**. In addition, even if those diverse behavior patterns for online learning problems from the pre-training data incorporate a certain one that *can achieve no-regret*, it is unclear and indeed surprising (to us) *why and how the LLMs happen to just memorize the correct one that can achieve no-regret properties* in our evaluations. **In contrast, in our explanation in Sec4.2, we do not assume the pre-training data already contain the *smart and principled* no-regret behavior pattern for the LLM to mimic.**
> Q3.2. I am still unconvinced that the results in these papers are robust and reflective of online decision making by LLMs in more complex scenarios.
- Firstly, we would like to point out that the online learning problem we considered in our paper is the most canonical one that the online learning community has been and is still actively studying [2], which has spanned a large set of empirical applications. It is, when we really want to achieve no-regret, not an easy problem. Therefore, our results are already interesting in the sense that they show that LLM can solve those challenging problems. As a first attempt in this direction, we had to start with something classical and fundamental.
- Secondly, there are indeed extensions of our standard setup that people have been studying like the ones with continuous action space, high-dimensional context, etc. **However, the algorithmic principle (to achieve no-regret) to solve those more complex problems is the same as solving the standard ones**, like regularization or randomization. What we show is that the LLM already adopts such algorithmic principles without any explicit instructions to do so. Therefore, to enable LLMs to work on those extended settings, we believe it is more of an engineering effort, while the empirical insights we have drawn from our paper on standard settings are still of great merit.
[1] McCoy, R. Thomas, et al. "Embers of autoregression: Understanding large language models through the problem they are trained to solve." arXiv preprint arXiv:2309.13638 (2023).
[2] Hazan, Elad. "Introduction to online convex optimization." Foundations and Trends® in Optimization 2.3-4 (2016): 157-325.
---
Dear Reviewer Rdii
We are aware of the additional comment you left under the comments of Reviewer tmJS, and thank you for the additional comment. We would like to make further clarifications as follows.
> Q1. My main complaint about this paper is that it is not doing a good enough job of establishing the fact that LLMs (and not just GPTs) show no-regret behavior. It is entirely possible that someone in trying to replicate this paper might discover that only GPTs show this behavior, or that only the models that have been fine-tuned using RLHF (like the GPTs are) show this phenomenon. Where will this paper stand?
- **Firstly, we have *never* aimed to claim that *all LLMs must be no-regret*, which is exactly why we carefully chose our title as "Do LLM agents have regret?"** In contrast, our contributions lie in the **framework** for understanding LLMs' behavior in online learning and multi-agent/game-theoretic problems, possible explanations for certain behaviors (we observed), and new training paradigms, **which are far beyond a simple binary conclusion on whether all possible LLMs are no-regret.** For example, **the fact that paper https://arxiv.org/abs/2309.13638 brought by the reviewer only evaluates GPT-3.5 and GPT-4 does not mitigate their contributions to how LLMs can potentially solve tasks by memorization,** the point the reviewer used in the comment. Also, for example, we also honestly reported the cases where GPT-4 can have "regrettable" behaviors, in Sec. 4, and how to *further address* it, in Sec. 5. In other words, we aimed to provide a **systematic study** on this topic, instead of simply drawing a binary conclusion (and we **never did so**).
- **Secondly, that being said, we still thank the reviewer's suggestion for instantiating our framework with more LLMs.** Therefore, we **have additionally experimented** with Llama-2-70b and Mixtral-8x7b, representing two different LLMs trained with (potentially) different corpus sources and model architectures. We refer to the updated results (for original Figure 1 and 16) for online learning with full-information feedback here (https://drive.google.com/file/d/1YvlVnife4JKkhgovo7Ot1koQ67w5-Tyi/view?usp=sharing) and with bandit feedback here (https://drive.google.com/file/d/18BUYTxlpBWjkrlygLMBtdaZL2ELgDfWk/view?usp=sharing). **We can see that both new models are still reliably no-regret, validated by both our frameworks for checking "no-regret".**
<!-- - , where neither of the two models is the one that has gone through fine-tuning (e.g., RLHF).** -->
> Q2. Similarly, the authors themselve comment (in rebuttal to my response to their initial rebuttal) that the tasks they are studying are cannonical and most well-studied tasks in onine learning literature (which is one of the main complaints in my original review -- see point 3 in the weaknesses) and I highly suspect that some kind of memorization is going on here.
- **Firstly, we would like to clarify that what we argued in the rebuttal was that: we are dealing with a canonical setting, but we never said it is already *well-studied*.** Instead, we have emphasized it is still an active area, and already with lots of applications. As an important first attempt along this direction (to rigorously understand it), we had to start from something fundamental. We also remind the reviewer that even the most famous follow-the-regularized-leader and follow-the-perturbed-leader algorithms are designed **for this setting**. Generalizing them to other settings is not trivial and requires ad-hoc modifications [1, 2], while some settings are even fundamentally hard like adversarial MDP and Markov game [3, 4]. These challenges do **NOT** prevent FTRL and FTPL from being important algorithms. Therefore, we believe requiring LLMs to straightforwardly generalize to other settings is not reasonable.
- **Being canonical might mean the training data contains such problems/tasks, but it does not mean it has to contain the correct principles (or imitation examples) for the LLMs to memorize, due to various reasons like the bounded rationality of humans and data collection noises.** These are exactly in our hypothetical explanation in Sec. 4. In other words, we do admit that the explanation of memorization is ***likely*** (and also as stated in our first explanation in Sec. 4), but **it does NOT contradict** our later explanations since we hope to **further** consider the scenarios where good no-regret examples **may not** exist in the training data.
> Q3. I agree with the reviewer's point that there is a need for new kind of theory for RL -- but this theory needs to be grounded in empirics. The onus is on the authors to demonstrate convincingly that the phenomenon they are studying exists (and is thus worth explaining) -- they have failed to do so in my opinion (so far). Given this, accepting this paper would constitute bad science in my opinion and I would strongly argue against acceptance.
- We apologize if we have made any false impression for the reviewer that we claimed *all LLMs are no-regret* (which was not our intention, and we in fact never did so, see responses above). **In other words, the phenomenon we are studying is NOT "all LLMs are no-regret".** Meanwhile, according to our new experiments on Llama-2 and Mixtral-8x7b, the no-regret behaviors are indeed shown in other families of LLMs.
- **Since online learning is a challenging problem even for our standard and canonical setting, the fact that there exist some LLMs that can exhibit the no-regret property is already worth understanding** (and is of great interest to us). We believe such understandings can further advance the studies of how to enable the no-regret behavior of some LLMs that are not no-regret, if any (e.g., using the novel loss we designed in Sec. 5).
[1] Jin, Chi, et al. "Learning adversarial markov decision processes with bandit feedback and unknown transition." International Conference on Machine Learning. PMLR, 2020.
[2] Cheng, Duo, Xingyu Zhou, and Bo Ji. "Follow-the-Perturbed-Leader for Adversarial Bandits: Heavy Tails, Robustness, and Privacy." The Twelfth International Conference on Learning Representations. 2023.
[3] Abbasi Yadkori, Yasin, et al. "Online learning in markov decision processes with adversarially chosen transition probability distributions." Advances in neural information processing systems 26 (2013).
[4] Liu, Qinghua, Yuanhao Wang, and Chi Jin. "Learning markov games with adversarial opponents: Efficient algorithms and fundamental limits." International Conference on Machine Learning. PMLR, 2022.
Dear all reviewers,
Thanks to Reviewer Rdii's suggestion of using various LLMs for our section 3, we **have additionally experimented** with Llama-2-70b and Mixtral-8x7b, representing two different LLMs trained with (potentially) different corpus sources and model architectures. We refer to the updated results (for original Figure 1 and 16) for online learning with full-information feedback [(link)](https://drive.google.com/file/d/1YvlVnife4JKkhgovo7Ot1koQ67w5-Tyi/view?usp=sharing) and with bandit feedback [(link)](https://drive.google.com/file/d/18BUYTxlpBWjkrlygLMBtdaZL2ELgDfWk/view?usp=sharing). **We can see that both new models are still reliably no-regret, validated by both our frameworks for checking "no-regret".** Now, we have GPT-3.5-Turbo, GPT-4, GPT-4-Turbo, Llama-2-70b, and Mixtral-8x7b. We will also add several LLMs for our revised version.