Agent57 Discussion Questions

# Agent57 Discussion Questions ## Background (NGU and R2D2): * Intrinsic reward, what does it do, and how is it calculated for each state? Is it: * fixed for all states then scaled to obtain different policies, or * manually calculated, or * UCB exploration, or * none of the above * (Sec. 2, para 2) Intrinsic reward: what do they mean by "the per-episode novelty rapidly vanishes over the couse of an episode" * discount factor: what is the effect of changing this? Isn't it usually a part of the environment? * (pg 3) What does this mean: "NGU can be unstable and fail to learn... when the scale and sparseness of [the extrinsic and intrinsic rewards] are both different..." * Training process: which $j$ values do the actors use to collect data? How does the learner update the different Q functions? * Value function: $Q_{rj}^*$ - meaning of notation? why indexed with $r_j$? * How does it choose actions in each time step using the N policies? ## Agent57 part 1 (State-Action Value Function Parameterization) * where do the $\epsilon_l$ come from? * transformed retrace loss function (Munos 2016) * transformed Bellman op (Pohlen 2018) * both of these are not used anyways? ## Agent57 part 2 (Adaptive Exploration over family of policies) * What is the meta-controller, and how does it work? * Are the different arm settings $(\beta_j, \gamma_j)$ static or dynamic? * Review: what is the difference between "non-stationary MAB (running independently on each actor)" and a "global meta-controller"? * sliding window UCB? ## Experiments * Do we really need 32 different exploration schemes? Fig 8 seems to indicate it is usually highest, lowest, or medium. So as an ablation study, how is performance with N=3 or N=4? what is the overhead in terms of memory and computation of increasing N? ## Overview (broad questions) * How important is it to switch between different exploration schemes in the middle of learning, vs "guessing the single best scheme" before the start, and sticking to it? There is some discussion in the paper about switching from short-range expolration at the start to long-range later, and it makes intuitive sense, but can we see some experimental data on this? * Would a single game (Go) still profit from multiscale exploration? * Try Agent57 on classic board games? Supposedly Agent57 does a better job with long-term credit assignments, so could it play classic games as well as MuZero plays Atari? * How could we incorporate search with Agent57? What are the implications?