# ICLR2022 Rebuttal
## General Response to Reviewers
We appreciate valuable comments from all reviewers. According to the suggestions from reviewers, we have revised our paper a lot. The modification parts are remarked in red in the revision. To remind the main content and to get a quick overview of the revised paper, we summarize the main content as follows.
- Introduction: illustrate challenges from extending RLSVI under deep RL scenario: feature learning and computation burden.
- Methodology: extend the hypermodel from bandit to deep RL.
- (**New 1**) a suitable architecture to make the objective trainable.
- a theoretical result to explain why the (linear) hypermodel is effective to approximate the posterior distribution.
- an objective function to address issues of extending RLSVI.
- Experiment: numerical results to validate the proposed method.
- (**New 2**) HyperDQN could perform well on both hard exploration tasks and easy exploration tasks on Atari and SuperMarioBros.
- (**New 3**) Validate HyperDQN is competitive to another strong baseline NoisyNet.
- Explain and validate why commitment is important for HyperDQN.
- Visualize the multi-step uncertainty and deep exploration of HyerDQN to provide insights.
- Conclusion: it's possible to extend HyperDQN for other domains.
- (**New 4**) The extension HyperActorCritic (HAC) outperforms SAC on the hard exploration task Cart Pole.
- Leverage the informative prior to accelerate exploration.
New 1: We explain the architecture difference with the original one in (Dwaracherla et al., 2020). Specifically, we illustrate the reason why the direct extension of (Dwaracherla et al., 2020) fails under the deep RL case. This helps clarify our novelty and contribution compared with (Dwaracherla et al., 2020).
New 2: We visualize the relative improvements on both "hard exploration" and "easy exploration" problems to better understand the advances.
New 3: We involve the discussion and empirical comparsion with another baseline NoisyNet in Appendix F.4.
New 4: We add the preliminary result on the extension to continuous control tasks.
---
Dwaracherla, Vikranth, et al. "Hypermodels for exploration." ICLR 2020.
## Response to Reviewer 1
Thanks for your insightful review.
**Question 1**: The results if the meta-model is a non-linear function.
**Answer 1**: First, when the meta-model is non-linear and the latent variable $z$ is a standard Gaussian random variable, the posterior sample $\theta_{\text{predict}} := f_{\nu}(z)$ does not follow a unimodal Gaussian distribution. As a result, the representation power of the posterior distribution could be enhanced. However, the negative effect is that the training problem becomes hard.
Second, we provide the empirical investigation of this direction in Appendix D.5. From Figure 7 in Appendix D.5, we observe that a non-linear hypermodel has a similar performance with the linear hypermodel. In particular, it does not bring obvious gains. We believe the underlying reason is that a non-linear hypermodel is slightly harder to train.
---
**Question 2(a)**: The choice of $\sigma_{\omega}$ and $\sigma_{p}$ in the objective function.
**Answer 2(a)**: First, values of these parameters are listed in Table 2 in Appendix D. Second, as stated in Remark 6 in Appendix D.3, we choose these values based on the concern of the parameter initialization. Third, to further address your concern, we provide ablation studies of these parameters in Appendix D.5. Results in Figure 6 indicate that our method is not sensitive to the noise scale $\sigma_\omega$ but a large prior scale $\sigma_p$ leads to poor performance since the posterior update is slow compared with the strong prior term.
**Question 2(b)**: The baseline where an agent uses an identical noise term for modifying the $Q_{\text{target}}$.
**Answer 2(b)**: We are sorry that we do not fully understand your question. In our formulation, there is a noise $\sigma_\omega z^{\top} \xi$ in Equation (4.2) for HyperDQN, where $\xi$ is associated with each sample $(s, a, r, s^\prime)$. Do you mean to use the identical $\xi$ for all $(s, a, r, s^\prime)$ tuples?
---
**Question 3**: The performance on "hard exploration" and "easy exploration" tasks.
**Answer 3** : Thanks for this suggestion. In the revision, we show that the relative improvements of HyperDQN on "hard-exploration" and "easy-exploration" problems in Figure 11 and Figure 12 in Appendix E. In particular, we observe that HyperDQN has clear improvements in both “easy exploration” environments (e.g., Battle Zone, Jamesbond, and Pong) and “hard exploration” environments (e.g., Frostbite, Gravitar, and Zaxxon). Unfortunately, HyperDQN does not work on Montezuma’s Revenge. In fact, all almost randomized exploration methods (including BootDQN and NoisyNet) cannot perform well on this task. One reason is that the extremely sparse reward provides limited feedback for feature selection, which is crucial for randomized exploration methods as argued in the introduction.
In addition to the Atari suite, SuperMarioBros-1-3-v1 and SuperMarioBros-2-2-v1 are two hard exploration problems due to sparse reward and long horizon. HyperDQN has clear improvements on these two tasks. We have clarified this point in the revision.
Following this discussion, we would like to further explain why the metric in Figure 3 (i.e., averaged results over 56 tasks) is good to measure exploration efficiency. First, in practice, **before solving a task, we do not know whether this task is a "hard exploration" or "easy exploration" problem**. As a result, we hope an intelligent agent can perform well on both types of problems. Second, we want to highlight that **even for the "easy exploration" tasks, we still need more efficient strategies than epsilon-greedy**. Note that we are by no means calling for less emphasis to be placed on specific hard exploration problems. Instead, we need to pay attention to the improvement in all tasks.
---
**Question 4**: Why are the agents only trained for 20M frames?
**Answer 4**: The main reason is that the 20M frames training is sufficient to show the exploration efficiency and 200M frames training leads to an unacceptable cost for us (maybe lots of RL researchers). We elaborate on this claim as follows.
First, we would like to point out that the 20M frames training is commonly used in (Lee et al., 2019; Rashid et al. 2020, Bai et al., 2021). The main reason is that training with 20M frames is sufficient to solve many tasks in Atari (e.g., Battle Zone and Pong) and SuperMarioBros (e.g., 1-1 and 1-2). As a result, we can test different algorithms in the regime of 20M frames.
Second, 200M frames training is expensive for us (maybe lots of RL researchers) in time and money. Specifically, running DQN for 200M frames on a single environment takes about **30 days**. Unfortunately, we have limited servers so that we cannot run all experiments with 200M frames in an acceptable time. Furthermore, the experiment fee for such one experiment is about 30 * 24 * 0.5 = 360\$ as the unit cost (of the server and power) for one hour is 0.5\$. Hence, the cost for 56 environments is about **20,160\$** for a single algorithm, which we cannot afford at the current stage.
---
**Question 5**: Why does DoubleDQN in Fig 3 start at a value lower than 0?
**Answer 5**: This is because of the implementation in DQN Zoo. In particular, the evaluation policy of DoubleDQN at the initial stage is not a random policy so that its performance does not match 0. In contrast, other baselines can match a random policy due to randomization.
---
**Question 6**: Random seeds and error bars.
**Answer 6**: We use 3 random seeds for Atari and SuperMarioBros (due to limited computation resources) and 5 random seeds for Deep Sea. We have added these details in the main text in the revision. We have shown error bars for almost all figures. In particular, we do not show the error bar in Figure 3 since the error bar over 56 environments is large, which makes curves unclear. Note that in (O’Donoghue et al., 2018; Taiga et al., 2020), the error bar is also not shown for this type of figure. Please refer to Figure 16 and Figure 17 for learning curves of each environment, in which we clearly show the error bars.
---
Lee, Su Young, et al. "Sample-efficient deep reinforcement learning via episodic backward update." NeurIPS 2019.
Bai, Chenjia, et al. "Principled exploration via optimistic bootstrapping and backward induction." ICML, 2021.
Rashid, Tabish, et al. "Optimistic exploration even with a pessimistic initialisation." ICLR 2020.
Taiga, Adrien Ali, et al. "On Bonus-Based Exploration Methods in the Arcade Learning Environment." ICLR, 2021.
O’Donoghue, Brendan, et al. "The uncertainty bellman equation and exploration." ICML, 2018.
## Response to Reviewer 2
Thanks for your valuable comments.
**Comment 1**: novelty and contribution of this work compared with (Dwaracherla et al., 2020).
**Answer**: First, we have to apologize that we do not clarify the architecture difference with (Dwaracherla et al., 2020) in the submission. This leads to your claim that "the method is exactly the same as the one proposed in a prior work". In fact, even though inherited from (Dwaracherla et al., 2020), **our method is different from the original one (Dwaracherla et al., 2020)**. More precisely, Dwaracherla et al. (2020) apply the hypermodel for **all** layers of the base model to solve simple bandit tasks (refer to (Dwaracherla et al., 2020, Figure 1)). On the other hand, we extend the hypermodel to the deep RL case by applying the hypermodel for the **last** layer of the value function. We elaborate on this point as follows and please refer to Remark 1 in Section 4.1 and Appendix F.2 for a detailed discussion.
Second, the modification (all layers v.s. the last layer) is essential and is motivated by the **trainability** issue. In particular, **the direct extension of (Dwaracherla et al., 2020) would fail under the deep RL scenario**. The underlying reason is the **parameter initialization** issue. Concretely, the output of the initialized hypermodel is not a good initialization for the parameter of the base model when we use the architecture in (Dwaracherla et al., 2020). In particular, modern initialization techniques (e.g., LeCun's initialization) suggest we should initialize the parameter of the $i$-th layer by sampling from the Gaussian distribution $\mathcal{N}(0, \sigma^2)$ with $\sigma=1/\sqrt{d_{i-1}}$, where $d_{i-1}$ is the width of the $(i-1)$-th layer. However, the architecture in (Dwaracherla et al., 2020) cannot achieve this. Instead, the architecture in (Dwaracherla et al., 2020) implies the magnitude of the parameter of the base model is about $1$ (up to constants). As a result, the input signal amplifies over layers and the gradient explodes when we use the architecture in (Dwaracherla et al., 2020).
In fact, **Dwaracherla et al. (2020) only use a two-layer base model with a width of 10 for bandit tasks** so that the trainability issue is not severe in their applications. But **for deep RL, we use deeper and wider neural networks**. Hence, the trainability issue is important, which motives us to use the simple architecture in our paper. In our model, the hidden layers of the base model are initialized with common techniques and the last layer could be properly initialized by normalizing the output of the hypermodel. This resolves the parameter initialization issue. Fortunately, this simple architecture still retains the main ingredient of RLSVI (i.e., capturing the posterior distribution over a linear prediction function).
Third, we also want to mention that we have provided the **theoretical insight** on **why the hypermodel can approximate the posterior distribution**, a fundamental question along this research direction. Note that this theoretical guarantee is missing in (Dwaracherla et al., 2020). In particular, Dwaracherla et al. (2020) prove that a linear hypermodel has sufficient representation power to approximate any distribution (over functions) so it is unnecessary to use a non-linear one. However, Dwaracherla et al. (2020) do not elaborate on *whether the linear hypermodel can approximate the posterior distribution after optimization*.
Sorry for the misunderstanding caused by important messages missing in the submission. We hope the above two points could clarify our novelty and contribution compared with (Dwaracherla et al., 2020).
---
**Comment 2(a)**: OB2I is not a SOTA exploration method.
**Answer 2(a)**: We have understood your concern and we apologize for such an imprecise argument in the submission. To address your concern, we have revised our claim: OB2I is a SOTA UCB-type exploration method with a 20M frames training budget and it is not clear whether this method is SOTA or not with a 200M frames training budget.
**Comment 2(b)**: "UCB-type algorithms, such as noisy net might be able to outperform O2BI or HyperDQN greatly with 200M training budget".
**Answer 2(b)**: First, we think you make a typo/mistake that noisy net indeed is not a UCB-type algorithm.
Second, it is unclear whether your conjecture is true or false under the context of 200M frames. Currently, we cannot conduct such experiments to verify this. The reason is that training with 200M frames is expensive for us (maybe lots of RL researchers) in time and money. Specifically, running DQN for 200M frames on a single environment takes about 30 days. Unfortunately, we have limited servers so that we cannot run all experiments with 200M frames in an acceptable time. Furthermore, the experiment fee for such one experiment is about 30 * 24 * 0.5 = 360\$ as the unit cost (of the server and power) for one hour is 0.5\$. Hence, the cost for 56 environments is about 20,160\$ for a single algorithm, which we cannot afford at the current stage.
Third, we want to argue that the 20M frames training budget is commonly used in (Lee et al., 2019; Rashid et al. 2020, Bai et al., 2021). The main reason is that **training with 20M frames is sufficient to obtain near-optimal policies for many tasks** in Atari (e.g., Battle Zone and Pong) and SuperMarioBros (e.g., 1-1, 1-2). As a result, we can test different algorithms in the regime of 20M frames.
We will address your concern about the comparison with NoisyNet in **Answer 3**.
**Comment 2(c\)**: Prediction-error based exploration methods could progress on Montezuma’s Revenge while HyperDQN make no progress on it.
**Answer 2(c\)**: First, we admit that prediction-error based methods outperform HyperDQN on Montezuma’s Revenge. The failure reason of HyperDQN is that the extremely sparse reward provides limited feedback for feature selection, which is an important factor as discussed in the introduction. In fact, *almost all randomized exploration methods like BootDQN and NoisyNet cannot perform well on this task for the same reason.* The reason why prediction-error based methods succeed on this task is explained in (Taiga et al., 2021). That is, these methods could leverage the auxiliary reward for feature selection and exploration. However, such specific architecture designs on hard exploration tasks are *unable* to generalize on other tasks in Atari as verified in (Taiga et al., 2021).
Second, following the discussion, we would like to further point out that when evaluating exploration efficiency we need to consider **both** "hard" and "easy" problems for three reasons. First, we do not want algorithms to "overfit" on specific problem instances as argued in (Taiga et al., 2021). Second, before solving a task, we do not know whether this task is a "hard exploration" or "easy exploration" problem. As a result, we hope the algorithm can perform well on both types of problems. Third, even for the "easy exploration" tasks, we still need more efficient strategies than epsilon-greedy. Note that we are by no means calling for less emphasis to be placed on specific hard exploration problems. Instead, we need to pay attention to the improvement in all tasks, too.
---
**Comment 3**: training curves in Fig 12 in the submission.
**Answer 3**: We have added error bars in the revision. Note that Fig 12 and 13 in the submission correspond to Fig 16 and 17 in the reversion, respectively. OB2I is not involved because we only have evaluation logs of OB2I and do not have training logs of OB2I on Atari.
---
**Comment 4**: Comarision with NoisyNet.
**Answer**: First, we apologize that we miss this method in the submission. We have invovled NoisyNet in the related work.
Second, **your main rejection claim** that "after 20M frames... (inferred from the learning curves in Fig 12)..., HyperDQN is much inferior than noisy net" is **not correct**. In particuar, Figure 12 in the submission shows the training curves rather than the evaluation curves (that are in Figure 13). As a result, this comparsion between our method and NoisyNet is unfair.
Third, we try our best to reproduce NoisyNet and provide the numerical results in Figure 22 and Figure 23 in Appendix F.4. In particular, 6 example tasks (beam rider, montezuma's revenge, pong, seaquest, venture, and zaxxon) from the Atari suite and 3 example tasks (1-1, 1-2, and 1-3) from the SuperMarioBros suite are considered with a 20M frames training budget. Reproduced results basically match the reported results in (Fortunato et al ., 2018; Figure 6). The results are summarized in the following table, which shows that HyperDQN outperforms NoisyNet on 5 out of 8 games (there is a tie on Montezuma's Revenge). On the Deep Sea, it has been shown that NoisyNet cannot solve problems when the size is larger than 20 (Osband et al., 2018, Figure 9). In contrast, we have shown that BootDQN and HyperDQN can solve problems when the size is larger than 20 in Table 4 in Appendix E.3. In addition, we also involve the discussion of algorithmic differences to provide more insights on *what HyperDQN can achieve while NoisyNet cannot* in Appendix F.4. Since we are still running experiments, we will provide the full results of the NoisyNet in the later revision.
| | beam rider | pong | venture | zaxxon | seaquest |
| -------- | ------------------------ | ------------------------- | ------------------------ | ------------------------ | ------------------------ |
| NoisyNet | $\mathbf{\approx 1,700}$ | $\approx -10$ | $\approx 10$ | $\approx 1,000$ | $\mathbf{\approx 1,600}$ |
| HyperDQN | $\approx 1,500$ | $\mathbf{\approx 21}$ | $\mathbf{\approx 300}$ | $\mathbf{\approx 4,000}$ | $\approx 600$ |
| | montezuma's revenge | Mario-1-1 | Mario-1-2 | Mario-1-3 | |
| NoisyNet | $\mathbf{\approx 0}$ | $\mathbf{\approx 12,000}$ | $\approx 6,000$ | $\approx 1,000$ | |
| HyperDQN | $\mathbf{\approx 0}$ | $\approx 9,000$ | $\mathbf{\approx 8,000}$ | $\mathbf{\approx 5,500}$ | |
Finally, we hope the above answers could address your concerns and appreciate it a lot if you can re-evaluate our paper.
---
Lee, Su Young, et al. "Sample-efficient deep reinforcement learning via episodic backward update." NeurIPS 2019.
Bai, Chenjia, et al. "Principled exploration via optimistic bootstrapping and backward induction." ICML, 2021.
Rashid, Tabish, et al. "Optimistic exploration even with a pessimistic initialisation." ICLR 2020.
Fortunato, Meire, et al. "Noisy networks for exploration." ICLR 2018.
Dwaracherla, Vikranth, et al. "Hypermodels for exploration." ICLR 2020.
Taiga, Adrien Ali, et al. "On Bonus-Based Exploration Methods in the Arcade Learning Environment." ICLR, 2021.
## Response to Reviewer 3
We highly appreciate your positive opinion about valuable theoretical results and experimental discussions.
**Comment 1**: Generalizability to other domains.
**Answer 1**: Thanks for this suggestion. We briefly discuss the extensions in the following response.
For continuous control tasks, we could leverage the actor-critic method to design an efficient exploration strategy. A natural extension is to replace the last layer of a critic network with the hypermodel, which allows us to approximate the posterior distribution of the $Q$-value function; see the architecture design in Figure 24 in Appendix F.5. The preliminary result on the hard exploration task Cart Pole is shown in Figure 25 in Appendix F.5. In particular, SAC can not solve this toy problem while our extension succeeds because the epistemic uncertainty measure in the hypermodel leads to efficient exploration.
For offline control, a direct extension is to replace finite ensembles or drop out in existing methods (Yu et al., 2020; Wu et al., 2021) with hypermodel since the hypermodel is effective to capture the epistemic uncertainty. We are sorry that we do not have much effort to investigate this direction at the current stage.
For leveraging human demonstrations to accelerate exploration, we have presented an artificial experiment to illustrate this. In particular, we show that an informative prior value function from a pre-trained model improves the efficiency a lot in Figure 26 in Appendix F.6. We will consider how to automatically acquire such an informative prior from human demonstrations in the future.
---
**Question 2**: Potential challenges of jointly optimizing the hypermodel and the feature extractor.
**Answer 2**: Thanks for pointing this out. As you can see, it is a hard job to jointly optimize the hypermodel and the feature extractor. In fact, **a direct application of the original hypermodel in (Dwaracherla et al., 2020) for RL tasks is not successful**; see the evidence in Appendix F.2. Importantly, **our architecture is different from the one used in (Dwaracherla et al., 2020)**. Specifically, Dwaracherla et al. (2020) apply the hypermodel for **all** layers of the base model to solve simple bandit tasks (refer to (Dwaracherla et al., 2020, Figure 1)). On the other hand, we extend the hypermodel to the deep RL case by applying the hypermodel for the **last** layer of the value function.
We comment that the modification (all layers v.s. the last layer) is motivated by the **trainability** issue. In particular, **the direct extension of (Dwaracherla et al., 2020) would fail under the deep RL scenario**. The underlying reason is the **parameter initialization** issue. Concretely, the output of the initialized hypermodel is not a good initialization for the parameter of the base model when we use the architecture in (Dwaracherla et al., 2020). In particular, modern initialization techniques (such as LeCun's initialization) suggest we should initialize the parameter of the $i$-th layer by sampling from the Gaussian distribution $\mathcal{N}(0, \sigma^2)$ with $\sigma=1/\sqrt{d_{i-1}}$, where $d_{i-1}$ is the width of the $(i-1)$-th layer. However, the architecture in (Dwaracherla et al., 2020) cannot achieve this. Instead, the architecture in (Dwaracherla et al., 2020) implies the magnitude of the parameter of the base model is about 11 (up to constants). As a result, the input signal amplifies over layers and the gradient explodes.
In fact, **Dwaracherla et al. (2020) only use a two-layer base model with a width of 10 for bandit tasks** so that the trainability issue is not severe in their applications. But **for deep RL, we use deeper and wider neural networks.** Thus, this trainability issue is important, which motivates us to use the simple architecture in our paper. In our model, the hidden layers of the base model are initialized with common techniques and the last layer could be properly initialized by normalizing the output of the hypermodel. This addresses the parameter initialization issue. Fortunately, this simple architecture still retains the main ingredient of RLSVI (i.e., capturing the posterior distribution over a linear prediction function).
We apologize such an important message is missing in the submission. We have clarified this point in Remark 1 and Appendix F.2 in the revision. We hope this answer can address your concern.
---
Wu, Yue, et al. "Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning." ICML 2021.
Yu, Tianhe, et al. "Mopo: Model-based offline policy optimization." NeurIPS 2020.
Dwaracherla, Vikranth, et al. "Hypermodels for exploration." ICLR 2020.
## Response to Reviewer 4
We thank you for your valuable review.
**Question 1**: Insights are not presented in [1].
**Answer 1**: There are two insights in our paper that are not presented in [1].
The first insight is the **theoretical result on the posterior approximation** (see Theorem 1). In particular, Dwaracherla et al. (2020) only prove that a linear hypermodel has sufficient representation power to approximate any distribution (over functions) so it is unnecessary to use a non-linear one. This guarantee is unrelated to any optimization process. However, Dwaracherla et al. (2020) do not elaborate on *whether the linear hypermodel can approximate the posterior distribution after optimization*.
The second insight is the challenge of **jointly optimizing the feature extractor and posterior samples**. We are sorry that this point is not carefully discussed in the submission. In fact, a direct application of the original hypermodel in (Dwaracherla et al., 2020) for RL tasks is not successful; see the evidence in Appendix F.2. Importantly, our architecture is different from the one used in (Dwaracherla et al., 2020). Specifically, Dwaracherla et al. (2020) apply the hypermodel for **all** layers of the base model to solve simple bandit tasks (refer to (Dwaracherla et al., 2020, Figure 1)). On the other hand, we extend the hypermodel to the deep RL case by applying the hypermodel for the **last** layer of the value function.
The modification (all layers v.s. the last layer) is motivated by the **trainability** issue. In particular, **the direct extension of (Dwaracherla et al., 2020) would fail under the deep RL scenario**. The underlying reason is the **parameter initialization** issue. Concretely, the output of the initialized hypermodel is not a good initialization for the parameter of the base model when we use the architecture in (Dwaracherla et al., 2020). In particular, modern initialization techniques (e.g., LeCun's initialization) suggest we should initialize the parameter of the $i$-th layer by sampling from the Gaussian distribution $\mathcal{N}(0, \sigma^2)$ with $\sigma=1/\sqrt{d_{i-1}}$, where $d_{i-1}$ is the width of the $(i-1)$-th layer.However, the architecture in (Dwaracherla et al., 2020) cannot achieve this. Instead, the architecture in (Dwaracherla et al., 2020) implies the magnitude of the parameter of the base model is about $1$ (up to constants). As a result, the input signal amplifies over layers and the gradient explodes when we use the architecture in (Dwaracherla et al., 2020).
---
**Question 2**: BootDQN in [2] or [3]?
**Answer 2**: We use the version in [3]. That is, the version with prior value functions. It is not clear whether [3] uses epsilon-greedy or not. Since the main ingredient of [3] is the prior value function, we keep the epsilon-greedy design. This configuration is also used in [5] (please also refer to the famous implementation at GitHub: https://github.com/johannah/bootstrap_dqn ). Indeed, we have provided the results of BootDQN without epsilon-greedy in the submission, which corresponds to Figure 14 in the revision. We see that without epsilon-greedy, BootDQN becomes better on SuperMarioBros-1-2 and SuperMarioBros-1-3 but becomes worse on SuperMarioBros-1-1. The improved results are still inferior to HyperDQN.
---
**Question 3**: Comparison of HyperDQN and BootDQN on the deep sea.
**Answer 3**: In Table 4 in Appendix E.3, we have shown that HyperDQN is much efficient than BootDQN.
---
**Question 4**: Number of ensembles in BootDQN.
**Answer 4**: In this paper, we implement BootDQN with 10 ensembles, which follows the configuration in Section 6.1 of [2]. However, in [3], BootDQN is implemented with 20 ensembles. We remark that the choice of 10 ensembles is commonly used in the previous literature [4, 5] since it is more computationally cheap.
To address your concern, we provide the ablation study about this choice in Figure 8 in Appendix 8. Unfortunately, we do not see meaningful gains when using 20 ensembles.
---
**Comment 5**: Writing suggestions.
**Answer 5**: Thanks for your suggestions. We have revised the related parts to make them clear.
---
[4] Rashid, Tabish, et al. "Optimistic exploration even with a pessimistic initialisation." ICLR, 2020.
[5] Bai, Chenjia, et al. "Principled exploration via optimistic bootstrapping and backward induction." ICML, 2021.