ziniuli
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # ICLR2022 Rebuttal ## General Response to Reviewers We appreciate valuable comments from all reviewers. According to the suggestions from reviewers, we have revised our paper a lot. The modification parts are remarked in red in the revision. To remind the main content and to get a quick overview of the revised paper, we summarize the main content as follows. - Introduction: illustrate challenges from extending RLSVI under deep RL scenario: feature learning and computation burden. - Methodology: extend the hypermodel from bandit to deep RL. - (**New 1**) a suitable architecture to make the objective trainable. - a theoretical result to explain why the (linear) hypermodel is effective to approximate the posterior distribution. - an objective function to address issues of extending RLSVI. - Experiment: numerical results to validate the proposed method. - (**New 2**) HyperDQN could perform well on both hard exploration tasks and easy exploration tasks on Atari and SuperMarioBros. - (**New 3**) Validate HyperDQN is competitive to another strong baseline NoisyNet. - Explain and validate why commitment is important for HyperDQN. - Visualize the multi-step uncertainty and deep exploration of HyerDQN to provide insights. - Conclusion: it's possible to extend HyperDQN for other domains. - (**New 4**) The extension HyperActorCritic (HAC) outperforms SAC on the hard exploration task Cart Pole. - Leverage the informative prior to accelerate exploration. New 1: We explain the architecture difference with the original one in (Dwaracherla et al., 2020). Specifically, we illustrate the reason why the direct extension of (Dwaracherla et al., 2020) fails under the deep RL case. This helps clarify our novelty and contribution compared with (Dwaracherla et al., 2020). New 2: We visualize the relative improvements on both "hard exploration" and "easy exploration" problems to better understand the advances. New 3: We involve the discussion and empirical comparsion with another baseline NoisyNet in Appendix F.4. New 4: We add the preliminary result on the extension to continuous control tasks. --- Dwaracherla, Vikranth, et al. "Hypermodels for exploration." ICLR 2020. ## Response to Reviewer 1 Thanks for your insightful review. **Question 1**: The results if the meta-model is a non-linear function. **Answer 1**: First, when the meta-model is non-linear and the latent variable $z$ is a standard Gaussian random variable, the posterior sample $\theta_{\text{predict}} := f_{\nu}(z)$ does not follow a unimodal Gaussian distribution. As a result, the representation power of the posterior distribution could be enhanced. However, the negative effect is that the training problem becomes hard. Second, we provide the empirical investigation of this direction in Appendix D.5. From Figure 7 in Appendix D.5, we observe that a non-linear hypermodel has a similar performance with the linear hypermodel. In particular, it does not bring obvious gains. We believe the underlying reason is that a non-linear hypermodel is slightly harder to train. --- **Question 2(a)**: The choice of $\sigma_{\omega}$ and $\sigma_{p}$ in the objective function. **Answer 2(a)**: First, values of these parameters are listed in Table 2 in Appendix D. Second, as stated in Remark 6 in Appendix D.3, we choose these values based on the concern of the parameter initialization. Third, to further address your concern, we provide ablation studies of these parameters in Appendix D.5. Results in Figure 6 indicate that our method is not sensitive to the noise scale $\sigma_\omega$ but a large prior scale $\sigma_p$ leads to poor performance since the posterior update is slow compared with the strong prior term. **Question 2(b)**: The baseline where an agent uses an identical noise term for modifying the $Q_{\text{target}}$. **Answer 2(b)**: We are sorry that we do not fully understand your question. In our formulation, there is a noise $\sigma_\omega z^{\top} \xi$ in Equation (4.2) for HyperDQN, where $\xi$ is associated with each sample $(s, a, r, s^\prime)$. Do you mean to use the identical $\xi$ for all $(s, a, r, s^\prime)$ tuples? --- **Question 3**: The performance on "hard exploration" and "easy exploration" tasks. **Answer 3** : Thanks for this suggestion. In the revision, we show that the relative improvements of HyperDQN on "hard-exploration" and "easy-exploration" problems in Figure 11 and Figure 12 in Appendix E. In particular, we observe that HyperDQN has clear improvements in both “easy exploration” environments (e.g., Battle Zone, Jamesbond, and Pong) and “hard exploration” environments (e.g., Frostbite, Gravitar, and Zaxxon). Unfortunately, HyperDQN does not work on Montezuma’s Revenge. In fact, all almost randomized exploration methods (including BootDQN and NoisyNet) cannot perform well on this task. One reason is that the extremely sparse reward provides limited feedback for feature selection, which is crucial for randomized exploration methods as argued in the introduction. In addition to the Atari suite, SuperMarioBros-1-3-v1 and SuperMarioBros-2-2-v1 are two hard exploration problems due to sparse reward and long horizon. HyperDQN has clear improvements on these two tasks. We have clarified this point in the revision. Following this discussion, we would like to further explain why the metric in Figure 3 (i.e., averaged results over 56 tasks) is good to measure exploration efficiency. First, in practice, **before solving a task, we do not know whether this task is a "hard exploration" or "easy exploration" problem**. As a result, we hope an intelligent agent can perform well on both types of problems. Second, we want to highlight that **even for the "easy exploration" tasks, we still need more efficient strategies than epsilon-greedy**. Note that we are by no means calling for less emphasis to be placed on specific hard exploration problems. Instead, we need to pay attention to the improvement in all tasks. --- **Question 4**: Why are the agents only trained for 20M frames? **Answer 4**: The main reason is that the 20M frames training is sufficient to show the exploration efficiency and 200M frames training leads to an unacceptable cost for us (maybe lots of RL researchers). We elaborate on this claim as follows. First, we would like to point out that the 20M frames training is commonly used in (Lee et al., 2019; Rashid et al. 2020, Bai et al., 2021). The main reason is that training with 20M frames is sufficient to solve many tasks in Atari (e.g., Battle Zone and Pong) and SuperMarioBros (e.g., 1-1 and 1-2). As a result, we can test different algorithms in the regime of 20M frames. Second, 200M frames training is expensive for us (maybe lots of RL researchers) in time and money. Specifically, running DQN for 200M frames on a single environment takes about **30 days**. Unfortunately, we have limited servers so that we cannot run all experiments with 200M frames in an acceptable time. Furthermore, the experiment fee for such one experiment is about 30 * 24 * 0.5 = 360\$ as the unit cost (of the server and power) for one hour is 0.5\$. Hence, the cost for 56 environments is about **20,160\$** for a single algorithm, which we cannot afford at the current stage. --- **Question 5**: Why does DoubleDQN in Fig 3 start at a value lower than 0? **Answer 5**: This is because of the implementation in DQN Zoo. In particular, the evaluation policy of DoubleDQN at the initial stage is not a random policy so that its performance does not match 0. In contrast, other baselines can match a random policy due to randomization. --- **Question 6**: Random seeds and error bars. **Answer 6**: We use 3 random seeds for Atari and SuperMarioBros (due to limited computation resources) and 5 random seeds for Deep Sea. We have added these details in the main text in the revision. We have shown error bars for almost all figures. In particular, we do not show the error bar in Figure 3 since the error bar over 56 environments is large, which makes curves unclear. Note that in (O’Donoghue et al., 2018; Taiga et al., 2020), the error bar is also not shown for this type of figure. Please refer to Figure 16 and Figure 17 for learning curves of each environment, in which we clearly show the error bars. --- Lee, Su Young, et al. "Sample-efficient deep reinforcement learning via episodic backward update." NeurIPS 2019. Bai, Chenjia, et al. "Principled exploration via optimistic bootstrapping and backward induction." ICML, 2021. Rashid, Tabish, et al. "Optimistic exploration even with a pessimistic initialisation." ICLR 2020. Taiga, Adrien Ali, et al. "On Bonus-Based Exploration Methods in the Arcade Learning Environment." ICLR, 2021. O’Donoghue, Brendan, et al. "The uncertainty bellman equation and exploration." ICML, 2018. ## Response to Reviewer 2 Thanks for your valuable comments. **Comment 1**: novelty and contribution of this work compared with (Dwaracherla et al., 2020). **Answer**: First, we have to apologize that we do not clarify the architecture difference with (Dwaracherla et al., 2020) in the submission. This leads to your claim that "the method is exactly the same as the one proposed in a prior work". In fact, even though inherited from (Dwaracherla et al., 2020), **our method is different from the original one (Dwaracherla et al., 2020)**. More precisely, Dwaracherla et al. (2020) apply the hypermodel for **all** layers of the base model to solve simple bandit tasks (refer to (Dwaracherla et al., 2020, Figure 1)). On the other hand, we extend the hypermodel to the deep RL case by applying the hypermodel for the **last** layer of the value function. We elaborate on this point as follows and please refer to Remark 1 in Section 4.1 and Appendix F.2 for a detailed discussion. Second, the modification (all layers v.s. the last layer) is essential and is motivated by the **trainability** issue. In particular, **the direct extension of (Dwaracherla et al., 2020) would fail under the deep RL scenario**. The underlying reason is the **parameter initialization** issue. Concretely, the output of the initialized hypermodel is not a good initialization for the parameter of the base model when we use the architecture in (Dwaracherla et al., 2020). In particular, modern initialization techniques (e.g., LeCun's initialization) suggest we should initialize the parameter of the $i$-th layer by sampling from the Gaussian distribution $\mathcal{N}(0, \sigma^2)$ with $\sigma=1/\sqrt{d_{i-1}}$, where $d_{i-1}$ is the width of the $(i-1)$-th layer. However, the architecture in (Dwaracherla et al., 2020) cannot achieve this. Instead, the architecture in (Dwaracherla et al., 2020) implies the magnitude of the parameter of the base model is about $1$ (up to constants). As a result, the input signal amplifies over layers and the gradient explodes when we use the architecture in (Dwaracherla et al., 2020). In fact, **Dwaracherla et al. (2020) only use a two-layer base model with a width of 10 for bandit tasks** so that the trainability issue is not severe in their applications. But **for deep RL, we use deeper and wider neural networks**. Hence, the trainability issue is important, which motives us to use the simple architecture in our paper. In our model, the hidden layers of the base model are initialized with common techniques and the last layer could be properly initialized by normalizing the output of the hypermodel. This resolves the parameter initialization issue. Fortunately, this simple architecture still retains the main ingredient of RLSVI (i.e., capturing the posterior distribution over a linear prediction function). Third, we also want to mention that we have provided the **theoretical insight** on **why the hypermodel can approximate the posterior distribution**, a fundamental question along this research direction. Note that this theoretical guarantee is missing in (Dwaracherla et al., 2020). In particular, Dwaracherla et al. (2020) prove that a linear hypermodel has sufficient representation power to approximate any distribution (over functions) so it is unnecessary to use a non-linear one. However, Dwaracherla et al. (2020) do not elaborate on *whether the linear hypermodel can approximate the posterior distribution after optimization*. Sorry for the misunderstanding caused by important messages missing in the submission. We hope the above two points could clarify our novelty and contribution compared with (Dwaracherla et al., 2020). --- **Comment 2(a)**: OB2I is not a SOTA exploration method. **Answer 2(a)**: We have understood your concern and we apologize for such an imprecise argument in the submission. To address your concern, we have revised our claim: OB2I is a SOTA UCB-type exploration method with a 20M frames training budget and it is not clear whether this method is SOTA or not with a 200M frames training budget. **Comment 2(b)**: "UCB-type algorithms, such as noisy net might be able to outperform O2BI or HyperDQN greatly with 200M training budget". **Answer 2(b)**: First, we think you make a typo/mistake that noisy net indeed is not a UCB-type algorithm. Second, it is unclear whether your conjecture is true or false under the context of 200M frames. Currently, we cannot conduct such experiments to verify this. The reason is that training with 200M frames is expensive for us (maybe lots of RL researchers) in time and money. Specifically, running DQN for 200M frames on a single environment takes about 30 days. Unfortunately, we have limited servers so that we cannot run all experiments with 200M frames in an acceptable time. Furthermore, the experiment fee for such one experiment is about 30 * 24 * 0.5 = 360\$ as the unit cost (of the server and power) for one hour is 0.5\$. Hence, the cost for 56 environments is about 20,160\$ for a single algorithm, which we cannot afford at the current stage. Third, we want to argue that the 20M frames training budget is commonly used in (Lee et al., 2019; Rashid et al. 2020, Bai et al., 2021). The main reason is that **training with 20M frames is sufficient to obtain near-optimal policies for many tasks** in Atari (e.g., Battle Zone and Pong) and SuperMarioBros (e.g., 1-1, 1-2). As a result, we can test different algorithms in the regime of 20M frames. We will address your concern about the comparison with NoisyNet in **Answer 3**. **Comment 2(c\)**: Prediction-error based exploration methods could progress on Montezuma’s Revenge while HyperDQN make no progress on it. **Answer 2(c\)**: First, we admit that prediction-error based methods outperform HyperDQN on Montezuma’s Revenge. The failure reason of HyperDQN is that the extremely sparse reward provides limited feedback for feature selection, which is an important factor as discussed in the introduction. In fact, *almost all randomized exploration methods like BootDQN and NoisyNet cannot perform well on this task for the same reason.* The reason why prediction-error based methods succeed on this task is explained in (Taiga et al., 2021). That is, these methods could leverage the auxiliary reward for feature selection and exploration. However, such specific architecture designs on hard exploration tasks are *unable* to generalize on other tasks in Atari as verified in (Taiga et al., 2021). Second, following the discussion, we would like to further point out that when evaluating exploration efficiency we need to consider **both** "hard" and "easy" problems for three reasons. First, we do not want algorithms to "overfit" on specific problem instances as argued in (Taiga et al., 2021). Second, before solving a task, we do not know whether this task is a "hard exploration" or "easy exploration" problem. As a result, we hope the algorithm can perform well on both types of problems. Third, even for the "easy exploration" tasks, we still need more efficient strategies than epsilon-greedy. Note that we are by no means calling for less emphasis to be placed on specific hard exploration problems. Instead, we need to pay attention to the improvement in all tasks, too. --- **Comment 3**: training curves in Fig 12 in the submission. **Answer 3**: We have added error bars in the revision. Note that Fig 12 and 13 in the submission correspond to Fig 16 and 17 in the reversion, respectively. OB2I is not involved because we only have evaluation logs of OB2I and do not have training logs of OB2I on Atari. --- **Comment 4**: Comarision with NoisyNet. **Answer**: First, we apologize that we miss this method in the submission. We have invovled NoisyNet in the related work. Second, **your main rejection claim** that "after 20M frames... (inferred from the learning curves in Fig 12)..., HyperDQN is much inferior than noisy net" is **not correct**. In particuar, Figure 12 in the submission shows the training curves rather than the evaluation curves (that are in Figure 13). As a result, this comparsion between our method and NoisyNet is unfair. Third, we try our best to reproduce NoisyNet and provide the numerical results in Figure 22 and Figure 23 in Appendix F.4. In particular, 6 example tasks (beam rider, montezuma's revenge, pong, seaquest, venture, and zaxxon) from the Atari suite and 3 example tasks (1-1, 1-2, and 1-3) from the SuperMarioBros suite are considered with a 20M frames training budget. Reproduced results basically match the reported results in (Fortunato et al ., 2018; Figure 6). The results are summarized in the following table, which shows that HyperDQN outperforms NoisyNet on 5 out of 8 games (there is a tie on Montezuma's Revenge). On the Deep Sea, it has been shown that NoisyNet cannot solve problems when the size is larger than 20 (Osband et al., 2018, Figure 9). In contrast, we have shown that BootDQN and HyperDQN can solve problems when the size is larger than 20 in Table 4 in Appendix E.3. In addition, we also involve the discussion of algorithmic differences to provide more insights on *what HyperDQN can achieve while NoisyNet cannot* in Appendix F.4. Since we are still running experiments, we will provide the full results of the NoisyNet in the later revision. | | beam rider | pong | venture | zaxxon | seaquest | | -------- | ------------------------ | ------------------------- | ------------------------ | ------------------------ | ------------------------ | | NoisyNet | $\mathbf{\approx 1,700}$ | $\approx -10$ | $\approx 10$ | $\approx 1,000$ | $\mathbf{\approx 1,600}$ | | HyperDQN | $\approx 1,500$ | $\mathbf{\approx 21}$ | $\mathbf{\approx 300}$ | $\mathbf{\approx 4,000}$ | $\approx 600$ | | | montezuma's revenge | Mario-1-1 | Mario-1-2 | Mario-1-3 | | | NoisyNet | $\mathbf{\approx 0}$ | $\mathbf{\approx 12,000}$ | $\approx 6,000$ | $\approx 1,000$ | | | HyperDQN | $\mathbf{\approx 0}$ | $\approx 9,000$ | $\mathbf{\approx 8,000}$ | $\mathbf{\approx 5,500}$ | | Finally, we hope the above answers could address your concerns and appreciate it a lot if you can re-evaluate our paper. --- Lee, Su Young, et al. "Sample-efficient deep reinforcement learning via episodic backward update." NeurIPS 2019. Bai, Chenjia, et al. "Principled exploration via optimistic bootstrapping and backward induction." ICML, 2021. Rashid, Tabish, et al. "Optimistic exploration even with a pessimistic initialisation." ICLR 2020. Fortunato, Meire, et al. "Noisy networks for exploration." ICLR 2018. Dwaracherla, Vikranth, et al. "Hypermodels for exploration." ICLR 2020. Taiga, Adrien Ali, et al. "On Bonus-Based Exploration Methods in the Arcade Learning Environment." ICLR, 2021. ## Response to Reviewer 3 We highly appreciate your positive opinion about valuable theoretical results and experimental discussions. **Comment 1**: Generalizability to other domains. **Answer 1**: Thanks for this suggestion. We briefly discuss the extensions in the following response. For continuous control tasks, we could leverage the actor-critic method to design an efficient exploration strategy. A natural extension is to replace the last layer of a critic network with the hypermodel, which allows us to approximate the posterior distribution of the $Q$-value function; see the architecture design in Figure 24 in Appendix F.5. The preliminary result on the hard exploration task Cart Pole is shown in Figure 25 in Appendix F.5. In particular, SAC can not solve this toy problem while our extension succeeds because the epistemic uncertainty measure in the hypermodel leads to efficient exploration. For offline control, a direct extension is to replace finite ensembles or drop out in existing methods (Yu et al., 2020; Wu et al., 2021) with hypermodel since the hypermodel is effective to capture the epistemic uncertainty. We are sorry that we do not have much effort to investigate this direction at the current stage. For leveraging human demonstrations to accelerate exploration, we have presented an artificial experiment to illustrate this. In particular, we show that an informative prior value function from a pre-trained model improves the efficiency a lot in Figure 26 in Appendix F.6. We will consider how to automatically acquire such an informative prior from human demonstrations in the future. --- **Question 2**: Potential challenges of jointly optimizing the hypermodel and the feature extractor. **Answer 2**: Thanks for pointing this out. As you can see, it is a hard job to jointly optimize the hypermodel and the feature extractor. In fact, **a direct application of the original hypermodel in (Dwaracherla et al., 2020) for RL tasks is not successful**; see the evidence in Appendix F.2. Importantly, **our architecture is different from the one used in (Dwaracherla et al., 2020)**. Specifically, Dwaracherla et al. (2020) apply the hypermodel for **all** layers of the base model to solve simple bandit tasks (refer to (Dwaracherla et al., 2020, Figure 1)). On the other hand, we extend the hypermodel to the deep RL case by applying the hypermodel for the **last** layer of the value function. We comment that the modification (all layers v.s. the last layer) is motivated by the **trainability** issue. In particular, **the direct extension of (Dwaracherla et al., 2020) would fail under the deep RL scenario**. The underlying reason is the **parameter initialization** issue. Concretely, the output of the initialized hypermodel is not a good initialization for the parameter of the base model when we use the architecture in (Dwaracherla et al., 2020). In particular, modern initialization techniques (such as LeCun's initialization) suggest we should initialize the parameter of the $i$-th layer by sampling from the Gaussian distribution $\mathcal{N}(0, \sigma^2)$ with $\sigma=1/\sqrt{d_{i-1}}$, where $d_{i-1}$ is the width of the $(i-1)$-th layer. However, the architecture in (Dwaracherla et al., 2020) cannot achieve this. Instead, the architecture in (Dwaracherla et al., 2020) implies the magnitude of the parameter of the base model is about 11 (up to constants). As a result, the input signal amplifies over layers and the gradient explodes. In fact, **Dwaracherla et al. (2020) only use a two-layer base model with a width of 10 for bandit tasks** so that the trainability issue is not severe in their applications. But **for deep RL, we use deeper and wider neural networks.** Thus, this trainability issue is important, which motivates us to use the simple architecture in our paper. In our model, the hidden layers of the base model are initialized with common techniques and the last layer could be properly initialized by normalizing the output of the hypermodel. This addresses the parameter initialization issue. Fortunately, this simple architecture still retains the main ingredient of RLSVI (i.e., capturing the posterior distribution over a linear prediction function). We apologize such an important message is missing in the submission. We have clarified this point in Remark 1 and Appendix F.2 in the revision. We hope this answer can address your concern. --- Wu, Yue, et al. "Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning." ICML 2021. Yu, Tianhe, et al. "Mopo: Model-based offline policy optimization." NeurIPS 2020. Dwaracherla, Vikranth, et al. "Hypermodels for exploration." ICLR 2020. ## Response to Reviewer 4 We thank you for your valuable review. **Question 1**: Insights are not presented in [1]. **Answer 1**: There are two insights in our paper that are not presented in [1]. The first insight is the **theoretical result on the posterior approximation** (see Theorem 1). In particular, Dwaracherla et al. (2020) only prove that a linear hypermodel has sufficient representation power to approximate any distribution (over functions) so it is unnecessary to use a non-linear one. This guarantee is unrelated to any optimization process. However, Dwaracherla et al. (2020) do not elaborate on *whether the linear hypermodel can approximate the posterior distribution after optimization*. The second insight is the challenge of **jointly optimizing the feature extractor and posterior samples**. We are sorry that this point is not carefully discussed in the submission. In fact, a direct application of the original hypermodel in (Dwaracherla et al., 2020) for RL tasks is not successful; see the evidence in Appendix F.2. Importantly, our architecture is different from the one used in (Dwaracherla et al., 2020). Specifically, Dwaracherla et al. (2020) apply the hypermodel for **all** layers of the base model to solve simple bandit tasks (refer to (Dwaracherla et al., 2020, Figure 1)). On the other hand, we extend the hypermodel to the deep RL case by applying the hypermodel for the **last** layer of the value function. The modification (all layers v.s. the last layer) is motivated by the **trainability** issue. In particular, **the direct extension of (Dwaracherla et al., 2020) would fail under the deep RL scenario**. The underlying reason is the **parameter initialization** issue. Concretely, the output of the initialized hypermodel is not a good initialization for the parameter of the base model when we use the architecture in (Dwaracherla et al., 2020). In particular, modern initialization techniques (e.g., LeCun's initialization) suggest we should initialize the parameter of the $i$-th layer by sampling from the Gaussian distribution $\mathcal{N}(0, \sigma^2)$ with $\sigma=1/\sqrt{d_{i-1}}$, where $d_{i-1}$ is the width of the $(i-1)$-th layer.However, the architecture in (Dwaracherla et al., 2020) cannot achieve this. Instead, the architecture in (Dwaracherla et al., 2020) implies the magnitude of the parameter of the base model is about $1$ (up to constants). As a result, the input signal amplifies over layers and the gradient explodes when we use the architecture in (Dwaracherla et al., 2020). --- **Question 2**: BootDQN in [2] or [3]? **Answer 2**: We use the version in [3]. That is, the version with prior value functions. It is not clear whether [3] uses epsilon-greedy or not. Since the main ingredient of [3] is the prior value function, we keep the epsilon-greedy design. This configuration is also used in [5] (please also refer to the famous implementation at GitHub: https://github.com/johannah/bootstrap_dqn ). Indeed, we have provided the results of BootDQN without epsilon-greedy in the submission, which corresponds to Figure 14 in the revision. We see that without epsilon-greedy, BootDQN becomes better on SuperMarioBros-1-2 and SuperMarioBros-1-3 but becomes worse on SuperMarioBros-1-1. The improved results are still inferior to HyperDQN. --- **Question 3**: Comparison of HyperDQN and BootDQN on the deep sea. **Answer 3**: In Table 4 in Appendix E.3, we have shown that HyperDQN is much efficient than BootDQN. --- **Question 4**: Number of ensembles in BootDQN. **Answer 4**: In this paper, we implement BootDQN with 10 ensembles, which follows the configuration in Section 6.1 of [2]. However, in [3], BootDQN is implemented with 20 ensembles. We remark that the choice of 10 ensembles is commonly used in the previous literature [4, 5] since it is more computationally cheap. To address your concern, we provide the ablation study about this choice in Figure 8 in Appendix 8. Unfortunately, we do not see meaningful gains when using 20 ensembles. --- **Comment 5**: Writing suggestions. **Answer 5**: Thanks for your suggestions. We have revised the related parts to make them clear. --- [4] Rashid, Tabish, et al. "Optimistic exploration even with a pessimistic initialisation." ICLR, 2020. [5] Bai, Chenjia, et al. "Principled exploration via optimistic bootstrapping and backward induction." ICML, 2021.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully