HackMD - Collaborative Markdown Knowledge Base

**Dear reviewer gakY,** (수정) We appreciate your insightful questions and positive support. We have provided a detailed response to the comments which includes **ablation studies and sharing ideas for future research directions**. Please let us know if you have any further comments or feedback. We will do our best to address them. Best, Paper 12428 authors. **Question 1.1** > How robust is the proposed URL framework to hyperparameter variations, such as the number of dimensions in the latent manifold or the batch size used for training? Initially, to ensure a fair comparison with baseline methods, we set the batch size and projection dimension identical to those in previous work [1] and did not fine-tune these hyperparameters. To further investigate the robustness of our framework, we conducted ablation studies on batch size and latent dimensions. In these experiments, we maintained the remaining hyperparameters identical to CLT's optimal configurations. Here, we use state datasets for pretraining. At finetuning, we ran five random seeds for each game. The results are as follows: **Table 1:** An ablation study of batch size. | Batch Size | Act F1 | Rew F1 | IQM | Mean | |:----------------- |:------:|:------:|:-----:|:-----:| | 320 | 26.68 | 69.48 | 0.468 | 0.881 | | 640 (``default``) | 25.90 | 67.70 | 0.451 | 0.773 | **Table 2:** An ablation study of latent dimension. | Latent Dim | Act F1 | Rew F1 | IQM | Mean | |:----------------- |:------:|:------:|:-----:|:-----:| | 256 | 24.17 | 64.70 | 0.321 | 0.700 | | 512 (``default``) | 25.90 | 67.70 | 0.451 | 0.773 | | 1024 | 26.49 | 68.89 | 0.459 | 0.730 | Based on the experimental results, we observed the following results. Firstly, reducing the batch size led to an increase in performance. Secondly, increasing the projection dimension resulted in a general improvement in downstream performance. Interestingly, these results are highly consistent with the experimental findings reported in the Barlow Twins ([2]). Unlike conventional unsupervised representation learning methods ([3], [4], [5]), which exhibit performance improvements as batch size increases, Barlow Twins achieved the highest performance at a moderate batch size and showed a performance increase as the latent dimension increased. Thanks to the reviewer's suggestion, these experiments allowed us to gain a better understanding of how to set hyperparameters for our proposed framework, CLT. We will conduct further experiments with a wider range of batch sizes and latent dimensions, and incorporate the results into the appendix of our paper. **Question 1.2** > Can the proposed URL framework be extended to other data types beyond high-dimensional images, such as text or audio data? Although our expertise does not primarily lie in natural language or speech processing, we believe that our suggested URL framework has the potential to broaden its applicability to representation learning within audio datasets. First of all, it is important to note that the design of our proposed framework, Causal Latent Transformer, shares similarities with causal language modeling (e.g., GPT) employed in NLP, where predicting a discrete word sequence is the primary objective. However, the direct application of causal modeling techniques to video (i.e., sequences of images) and audio (i.e., sequences of audio fragments) datasets encounters challenges due to the high-dimensional and continuous nature of image and audio fragments. Employing causal modeling techniques directly necessitates an overly complex model to causally predict the raw, high-dimensional image or audio fragments. To address these limitations, high-dimensional data must be encoded to a latent representation, before causally modelling the high-dimensional datasets. We believe that a siamese model can serve as an encoder that maps the high-dimensional target to the latent representation, offering an end-to-end framework for causal modeling. However, the use of the siamese model is prone to representational collapse. To mitigate this issue, the feature decorrelation objective in our framework can effectively prevent representational collapse, which enables causal modeling of video and audio datasets. **Question 1.3** > How does the proposed URL framework contribute to understanding the relationship between unsupervised representation learning and reinforcement learning? How could this relationship be further explored in future research? It would be helpful if the authors could share some insight on the technical challenges in answering these questions rigorously. Through comprehensive experiments, the following observations were made. (1) While both contrastive learning ([3], [4]) and feature decorrelation approaches ([2], [6]) have achieved competitive results in computer vision literature, our empirical findings suggest that feature decorrelation-based methods might be more appropriate candidates than contrastive methods in reinforcement learning (RL). (2) The feature rank of the pretrained model’s representation plays an important role in the model’s downstream performance. While a high feature rank in a pretrained model did not necessarily result in superior downstream performance, having a sufficiently high feature rank was essential for achieving good downstream performance. Despite the absence of theoretical validation for the importance of the feature rank in RL, we conjecture that its success in RL is linked to the model plasticity. Several recent studies have explored the significance of model plasticity in RL ([7], [8]), where plasticity refers to a model's capacity to adapt to new objectives. According to Lyle et al. ([7]), feature rank can serve as one important metric for measuring a model's plasticity. In supervised learning, a model is generally trained on a fixed data distribution; however, in RL, a model is generally trained on a continuous data stream from a non-stationary data distribution. As a result, maintaining high plasticity is vital for the model to consistently adapt to the newly collected data stream. Consequently, we believe that preserving an appropriately high level of plasticity in a pretrained model is crucial for attaining good downstream performance in RL. From a plasticity standpoint, we believe that there exist various distinct technical challenges when applying unsupervised representation learning to reinforcement learning. The challenges include: 1. Identifying methods to assess model plasticity beyond feature rank. 2. A novel unsupervised representation learning objective that can directly maximize the plasticity of the model along with the causal prediction. We hope that disclosing our perspectives on these technical obstacles will be beneficial for future endeavors in applying URLs to RL. We will incorporate these insights into the conclusion and limitations sections of our manuscript. [1] Light-weight probing of unsupervised representations for Reinforcement Learning., Wancong Zhang., arXiv 2022. [2] Barlow Twins: Self-Supervised Learning via Redundancy Reduction., Jure Zbontar et al., ICML 2021. [3] SimCLR: A Simple Framework for Contrastive Learning of Visual Representations., Ting Chen et al., ICML 2020. [4] MoCo: Momentum Contrast for Unsupervised Visual Representation Learning., Kaiming He et al., CVPR 2020. [5] BYOL: Bootstrap your own latent: A new approach to self-supervised Learning., Jean-Bastien Grill et al., NeurIPS 2020. [6] On Feature Decorrelation in Self-Supervised Learning., Tianyu Hua et al., ICCV 2021. [7] Understanding and Preventing Capacity Loss in Reinforcement Learning., Clare Lyle et al ., ICLR 2021. [8] Understanding plasticity in neural networks., Clare Lyle et al.,arXiv 2023. **Dear reviewer U7y4,** We appreciate your valuable feedback and constructive criticism. We have provided a detailed response to address the concerns including the **originality** of our work. Please let us know if you have any further comments or feedback. We will do our best to address them. Best, Paper 6305 authors. **Question 2.1** > The authors need to be clearer about the contributions of this work, especially in terms of originality. From my understanding, feature decorrelation for representation learning in unsupervised pre-training for RL was already examined by Zhang et al. (2022) (without modeling using transformers), which conflicts with the statement at L095. L095: "In response, we introduce a new unsupervised representation learning objective for RL" We appreciate the reviewer's incisive remarks on the originality and contributions of our work. Indeed, the feature decorrelation objective has already been utilized in unsupervised representation learning for computer vision ([1], [2]) and has recently been employed in the reinforcement learning ([3]). However, **the core originality and contribution of our work lie in the integration of causal modeling in the latent space and feature decorrelation, proposing an effective unsupervised representation learning framework for reinforcement learning.** Our main assertion of this paper is that a causal modeling in the latent space is an essential objective for unsupervised representation learning in RL. In causal modeling, feature decorrelation can be employed as a regularization objective to prevent representation collapse. Furthermore, we provide an analysis for the effectiveness of the feature decorrelation as a regularization objective by comparing it with batch normalization and contrastive learning, as outlined in Empirical Study of Section 4.2. In our main experiment, we compared the performance of our approach with BarlowTwins ([2]) and BarlowBalance ([3]), where an explicit causal modeling component is absent and only include a feature decorrelation objective. Empirically, we found that CLT substantially outperforms these baselines, which highlights the importance of causal modelling. To cleary convey our paper's originality and contributions, we will emphasize throughout the paper that our contribution is the proposement of an unsupervised representation learning framework that integrates latent causal modelling and feature decorrelation. Additionally, we will revise the statement at L095 to clarify our contribution. [1] On Feature Decorrelation in Self-Supervised Learning., Tianyu Hua et al., ICCV 2021. [2] Barlow Twins: Self-Supervised Learning via Redundancy Reduction., Jure Zbontar et al., ICML 2021. [3] Light-weight probing of unsupervised representations for Reinforcement Learning., Wancong Zhang., arXiv 2022. **Question 2.2** > For empirical evaluation of unsupervised pre-training methods, I think any additional assumptions need to be clarified. Especially in this work, Causal Latent Transformer (CLT) predicts future states without conditioning on actions or both future states and actions. This is essentially sequence (trajectory) modeling, with the potential underlying assumption that the pre-training dataset is generated by actors with consistency to some degree in their action selection. We appreciate your observation that our method fundamentally focuses on sequence (trajectory) modeling, and we agree that CLT may require a potential underlying assumption that the pre-training dataset should be generated by actors exhibiting a certain level of consistency in their action selection. However, if the dataset's behavior policy is not sufficiently consistent, which makes sequence modeling challenging, our framework can be easily extended to learn transition dynamics in a non-causal manner. As demonstrated in Section 5.1, the CLT can be trained similarly to BERT by masking latents between sequences and restoring them non-causally. We believe that this approach can be effectively utilized when the dataset's behavior policy is not consistent. Certainly, the limitation highlighted by the reviewer can significantly aid readers in comprehending how to utilize the CLT based on the composition of the given pretraining data. Consequently, we will incorporate these limitations and our suggestions into our manuscript. **Question 2.3** > While there are minor grammar issues or typos with the writing, there exist more critical typos including: > 1. At L171, it says the predictor is $q$, but I think it should be $p$. > 2. At L210, $s_{1:T}$ is used as the output of the action predictor, but it already denotes states. We apologize for any confusion that these errors may have caused and appreciate your diligence in identifying them. Regarding the specific errors you mentioned: 1. At L171, you are correct that the predictor should be denoted as $p$ instead of $q$. 2. At L210, we acknowledge the mistake in using $s_{1:T}$ as the output of the action predictor while it already denotes states. We will rectify this issue by introducing a more appropriate notation for the output of the action predictor, ensuring to not conflict with the existing notation for states. In addition to addressing these specific issues, we will also carefully review the entire manuscript to fix any other grammar issues or typos that may have been missed. **Question 2.4** > I know Table.2 includes the results with behavioral cloning, but could you also state the performance of the demonstrations for a comparison? We employed the publicly available DQN replay dataset provided by Dopamine [1], consisting of transition tuples (observation, action, reward, next observation) acquired from training agents. As the actual game score is not available in the dataset, we offer an upper-bound estimate by averaging the logged evaluations in Dopamine. We report the mean and median of Human Normalized Scores (HNS) across 26 Atari-100K games, as well as the number of games in which the agent achieves super-human performance (>H) and surpasses random policy (>0). **Table 1:** A summary of pretraining datasets. | HNS Median | HNS Mean | > H | >0 | Dataset size | |:----------:|:--------:|:---:|:---:|:------------:| | 0.463 | 1.064 | 7 | 26 | 1.5M | | Metrics | Value | |:------------ |:-----:| | HNS Median | 0.463 | | HNS Mean | 1.064 | | > H | 7 | | > 0 | 26 | | Size of Data | 1.5M | To help readers better understand the pretraining dataset, we will include this table in the experimental setup section of our manuscript. [1] Dopamine: A Research Framework for Deep Reinforcement Learning., Pablo Samuel Castro et al., arXiv 2019. **Question 2.5** > Could you compare CLT and the baseline methods in terms of the pre-training cost (required numbers of parameters, etc.)? In the following table, we present a comparison of the pre-training costs associated with CLT and the baseline methods. For models that employ a momentum encoder (CURL, ATC, SGI), the parameter count for the target network is omitted. The training time is computed based on experiments conducted using a single A100 GPU. **Table 1:** A comparison of representation learning methods on state dataset. | Methods | params (M) | training time (hr) | IQM | |:------------ |:----------:|:------------------:|:-----:| | VAE | 1.48 | 27.5 | 0.266 | | BarlowTwins | 2.01 | 23.3 | 0.224 | | CURL | 2.26 | 31.3 | 0.247 | | RSSM | 6.32 | 28.8 | 0.302 | | ATC | 2.53 | 21.2 | 0.353 | | CLT (`ours`) | 8.84 | 11.3 | 0.451 | **Table 2:** A comparison of representation learning methods on demonstration dataset. | Methods | params (M) | training time (hr) | IQM | |:------------- |:----------:|:------------------:|:-----:| | BC | 1.75 | 9.3 | 0.413 | | IDM | 3.35 | 21.2 | 0.343 | | SGI | 8.86 | N/A | 0.380 | | BarlowBalance | 13.72 | N/A | 0.338 | | CLT (`ours`) | 9.12 | 11.7 | 0.500 | We utilized the official code to train both SGI and BarlowBalance models. We observed a training time of 29.5 hours for SGI and 70.8 hours for BarlowBalance. Nonetheless, the official code from both SGI and BarlowBalance record various additional metrics during the training process which contribute to a longer training time. Therefore, for these two methods, we believe it is challenging to provide accurate measurements of training time and report these values as N/A. Later, we will include the pre-training costs in the main results (Table 2) of our manuscript. **Dear reviewer JRh4,** We appreciate your valuable feedback and constructive criticism. We have provided a detailed response to address the concerns including the **missing references and limitations** of our work. Please let us know if you have any further comments or feedback. We will do our best to address them. Best, Paper 6305 authors. **Question 3.1** > The use of 'finetuning' term is confusing. Throughout the paper, it refers to frozen encoder evaluation on downstream RL tasks, while in Table 4 it denotes actual fine-tuning of the encoder. I would suggest to use e.g. 'downstream evaluation' for frozen encoder evaluation instead. We agree that the usage of the term ‘finetuning’ can indeed lead to ambiguity within the context of this paper. To alleviate any potential confusion, we will employ the term 'downstream evaluation' to describe the evaluation procedure of training the policy on top of the frozen encoder. **Question 3.2** > How does this method compare to EfficientZero? EfficientZero doesn't use offline pre-training at all and performs better than the proposed method. Granted, EfficientZero is a fundamentally different method because it utilizes an MCTS policy improvement operator and has a different architecture, but I believe the comparison should still be made for completeness, which shouldn't require running more experiments. The reviewer highlights the importance of comparing our proposed method, CLT, with state-of-the-art methodologies, such as EfficientZero, which do not employ pretraining and have demonstrated superior performance on the Atari100k benchmark. Below, we provide a comparison to the state-of-the-art reinforcement learning algorithms where the results were obtained from each paper. **Table:** A comparison to the reinforcement learning algorithms without pretraining. | Methods | IQM | Median | Mean | OG | |:---------------- |:-----:|:------:|:-----:|:-----:| | IRIS [1] | 0.501 | 0.289 | 1.046 | 0.512 | | SR-SPR [2] | 0.632 | 0.685 | 1.272 | 0.433 | | EfficientZero [3]| N/A | 1.090 | 1.943 | N/A | | CLT (frz) | 0.451 | 0.434 | 0.773 | 0.522 | | CLT (ft + reset) | 0.601 | 0.493 | 1.124 | 0.467 | Upon closely examining the results, we can easily observe that reinforcement learning algorithms ([1], [2], [3]), including EfficientZero, outperform our proposed CLT method. Nevertheless, it is crucial to note that our research focus is more complementary than competitive with these existing approaches. Our primary objective is not to attain state-of-the-art performance by integrating various pretraining and finetuning strategies, but rather to identify which pretraining techniques yield an effective representations for finetuning. Consequently, we believe that all aforementioned methods, including EfficientZero, can potentially gain additional performance improvements by employing a pretrained representation from our CLT. In fact, an experiment from SR-SPR ([2]) has demonstrated that a pretrained representation consistently improves its performance on the Atari benchmark. Of course, we agree that comparing the performance of CLT to state-of-the-art methods would provide valuable insights into understanding the benefits and limitations of pretraining. Therefore, we will include this comparison in our manuscript for the benefit of our readers. [1] IRIS: Transformers are Sample-Efficient World Models., Vincent Micheli et al., ICLR 2023. [2] SR-SPR: Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier., Pierluca D’oro et al., ICLR 2023. [3] EfficientZero: Mastering Atari Games with Limited Data., Weirui Ye et al., NeurIPS 2023. **Question 3.3** > Connected to 2, it's not clear what the difference between Figure 9 in the appendix and Figure 5 in [4]. Is the point here that the authors do not see the correlation claimed in [4], or that the correlation doesn't hold when the encoder is not frozen? In accordance with the experimental protocol outlined in [4], we evaluated the quality of the pretrained representation by employing both linear probing and downstream assessment. Owing to the computationally intensive nature and high variance outcomes of downstream evaluation, the authors in [4] proposed to use a computationally efficient linear probing method as a proxy of the downstream evaluation performance. The authors in [4] show that the correlation between linear probing scores (act F1, rew F1) and downstream evaluation performance (IQM) was notably high (0.808 and 0.967, respectively). However, our experiments yielded a lower correlation between linear evaluation performance (act F1, rew F1) and downstream evaluation performance (IQM), with values of 0.73 and 0.56 respectively. Consequently, we argue in Section E of the appendix that a higher linear probing score may not necessarily guarantee improved downstream performance in reinforcement learning. We would like to highlight that the downstream evaluation is an essential evaluation step to properly measure the pre-trained model’s performance. [4] Light-weight probing of unsupervised representations for Reinforcement Learning., Wancong Zhang., arXiv 2022. **Question 3.4** > The use of feature rank is not clearly motivated. The authors describe it as a tool to guide method design, but Figure 4 shows only moderate correlation with HNS IQM, and it's not clear if any design decision was made using that metric. The only thing that is clear to me is that increasing the coefficient increases feature rank (figure 3), but it's hard to tell from figure 4 if there's a single best value. Is Figure 3 averaging feature rank over all the games? If so, why does Figure 4 not do the same for consistency? Possibly, the correlation to IQM HNS would be more significant feature rank is averaged over all the games. In this response, we aim to clarify the methodology and intent behind the metrics depicted in Figures 3 and 4. First, it is important to note that Figure 3 measures the metrics by averaging the outcomes across all games. The primary objective of Figure 3 was to discern the relationship between feature rank and downstream evaluation performance, specifically IQM. From this figure, it is evident that enhancing the feature rank up to a certain threshold results in improved downstream performance. Subsequently, Figure 4 was designed to assess whether the relationship established in Figure 3 between feature rank and downstream evaluation performance also holds in individual games. Consequently, we presented the feature rank and downstream performance (HNS) for each game, utilizing the best model identified in Figure 3 ($\lambda_d$=0.01) as a basis. Although Figure 4 demonstrates that a high feature rank does not ensure high downstream performance, it does reveal that all instances of low feature rank exhibited a suboptimal downstream performance. In summary, the core message in Figures 3 and 4 is that while a high feature rank in a pretrained model does not necessarily result in superior downstream performance, having a sufficiently high feature rank was essential for achieving good downstream performance. We will endeavor to make this message more transparent in our revised manuscript. **Question 3.5** > BYOL uses momentum encoder. Is there a reason why this method doesn't use momentum encoder for the target branch? Indeed, a more accurate interpretation of our proposed framework, CLT, would be as an integration of Simsiam [1] and Barlow Twins [2], rather than as a combination of BYOL [3] and Barlow Twins. Simsiam, a model in which the momentum encoder is removed from BYOL, demonstrated that although the momentum encoder contributes to the stability of the learning process, it is not an essential component for preventing representation collapse. Furthermore, Barlow Twins has shown that the momentum encoder becomes an unnecessary component when the feature decorrelation loss is employed. During the hypothesis validation phase, we experimented with the use of momentum encoders across four games: Assault, Breakout, Pong, and Qbert. We observed an overall decline in performance, accompanied by a 1.2x increase in training time. As a result, we decided to not use the momentum encoder for our default setting. [1] Simsiam: Exploring Simple Siamese Representation Learning., Xinlei Chen et al., CVPR 2021. [2] Barlow Twins: Self-Supervised Learning via Redundancy Reduction., Jure Zbontar et al., ICML 2021. [3] BYOL: Bootstrap your own latent: A new approach to self-supervised Learning., Jean-Bastien Grill et al., NeurIPS 2020. **Question 3.6** > Limitations outlined in section F seem tangential to the main point of the paper. An important limitation is the environment. The method is only tested Atari, and further experiments would be needed to check if it applies to other environments. We appreciate the reviewer's feedback regarding the limitations of our study. We acknowledge that our method has been primarily tested on Atari environments, and we understand that this might limit the generalizability of our findings. In response to your concerns, we will revise the limitations section of our paper to address this issue more directly. We will emphasize the potential limitation of only testing the method on Atari environments and the need for further experiments in other domains to better understand its generalizability and applicability. **We appreciate all three reviewers for their constructive feedback and valuable comments.** **The strengths of our paper, as recognized by the reviewers, are:** - Proposing a novel URL method for RL that causally models the sequence with a feature decorrelation objective. - A comprehensive analysis of the model's components in designing the framework. - Conducting extensive experiments and ablation studies to verify the effectiveness of the proposed method. **We have addressed the reviewers' concerns in the following manner::** - Performed ablation studies on batch size and latent dimensions (Question 1.1). - Discussed the technical challenges and future research directions in applying URL to RL (Question 1.3). - Clarified the contribution and originality of our paper (Question 2.1). - Explored the potential limitation of our method with regard to the data collection policy (Question 2.2). - Provided a comparison to EfficientZero and discussed potential benefits of pretraining (Question 3.2). - Clarified the relationship between feature rank and downstream performance (Question 3.4). **We hope our responses address all reviewer’s concerns, and we welcome any additional comments and clarifications.**

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.