# ICLR Rebuttals ## Rev 3 ### Part 1 Thank you for the time you've invested in reviewing our work -- we are glad that you find our suggestion novel, and are very grateful for your suggestions! The following is a first response to your comments; our manuscript will be updated in the coming days to reflect your suggestions and we will notify you once the changes are in place. > the paper is hard to understand without a background in physics [...] There is a lot of jargon from physics that will not be familiar to the typical audience at ICLR In hindsight, we agree with your assessment that some aspects of the paper might be difficult to understand without a background in physics, and part of this indeed has to do with the jargon. We will try to address this where possible. That said, we view our work as an attempt to view problems in RL from the lens of ideas and concepts that are already well understood in statistical physics. For instance, the connections between Variational Fokker-Planck and Reinforcement Learning is less explored (the only relevant reference we found was [1]). Likewise, the notion of dissipativity has received little attention in modern reinforcement learning (but has been well studied in the context of dynamical systems). We hope that our empirical results (especially "Comparison with the Free-Energy Functional", p. 8 and "[...] the Importance of Dissipativity", p. 7) inspires more theoretical research in this direction. Indeed, the field of unsupervised learning and deep learning theory has greatly benifited from such interdisciplinary endeavors (e.g. [2], Energy Based Models, [3], and more), and we wish the same for reinforcement learning. > In your objective, you use an expectation over the timestep in the trajectory. Why not instead take an average over all the timesteps in the trajectory? This should be equivalent. They are indeed equivalent, and a matter of notation. > Similarly, in the algorithm, why do you sample from the dataset? It would likely be better to randomly shuffle the dataset (at the timestep level) once, and then iterate through the dataset computing gradient updates. It's certainly possible to partition the training in epochs, as you suggest. We will investigate if this leads to gains in performance. > When scaling the algorithm up to larger environments, you will likely run into the problem that uniform policies are often very bad at exploring the state space. The issue you mention is exceedingly important, and doing it justice will require us to deviate significantly from the primary objective of this work. We hope to have been upfront about it (cf. top of p. 4: "The price we pay is the lack of adequate exploration in complex enough environments [...]" and footnote 3) and intend to pursue this in future work. We think one promising approach could be based on off-policy methods [10], which applies to cases where the behaviour policy differs from the (evaluation) policy of interest (importance sampling is one simple example). In our case, the evaluation policy can still be random, whereas the behaviour policy is exploratory. Moreover, it is worth noting that the choice of an offline "base policy" is a recurring theme concerning model-based (and related) methods [9], and a random policy is a widespread choice [4-6, 11-13]. Some works use a mixture of a pre-trained and a random policy [7, 8], whereas some other works (e.g. [5]) attempt to expose and tackle the issue explicitly. (Continued in next post.) [1] Richemond and Maginnis 2017, "On Wasserstein Reinforcement Learning and the Fokker Planck equation." https://www.semanticscholar.org/paper/On-Wasserstein-Reinforcement-Learning-and-the-Richemond-Maginnis/a6c3763895b6c0172c6a9d8375bdf452ef22e0a0 [2] Sohl-Dickstein et al. 2015, "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." https://arxiv.org/abs/1503.03585 [3] Poole et al. 2016, "Exponential expressivity in deep neural networks through transient chaos." https://papers.nips.cc/paper/6322-exponential-expressivity-in-deep-neural-networks-through-transient-chaos [4] Savinov et al. 2018, "Semi-parametric Topological Memory for Navigation." https://arxiv.org/abs/1803.00653 [5] Ha & Schmidhuber 2018, "World Models." https://arxiv.org/abs/1803.10122 [6] Savinov et al. 2018, "Episodic Curiosity Through Reachability." https://arxiv.org/abs/1810.02274 [7] Oh et al. 2015, "Action-Conditional Video Prediction using Deep Networks in Atari Games." https://arxiv.org/pdf/1507.08750.pdf [8] Chiappa et al. 2017, "Recurrent Environment Simulators." https://arxiv.org/abs/1704.02254 [9] http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_9_model_based_rl.pdf [10] Munos et al. 2016, "Safe and Efficient Off-Policy Reinforcement Learning." https://arxiv.org/abs/1606.02647 [11] Nagabandi et al. 2017, "Learning Image-Conditioned Dynamics Models for Control of Under-actuated Legged Millirobots." https://arxiv.org/abs/1711.05253 [12] Kulkarni et al. 2019, "Unsupervised Learning of Object Keypoints for Perception and Control." https://arxiv.org/abs/1906.11883 [13] Anand et al. 2019, "Unsupervised State Representation Learning in Atari." https://arxiv.org/abs/1906.08226 ### Part 2 (Continuation of the previous post.) > For example, can you use the learned arrow of time to solve the environments in (Krakovna et al), and how does it compare to relative reachability and its many variants? Thank you for the suggestion -- we will investigate this and get back to you. In the mean time, we remark that Krakovna et al. 2018 [11] propose to compare the reachability of all other states from a given state to that of a counterfactual "baseline state" that the system would be in had the agent been inactive. However, finding what this baseline state is requires counterfactual reasoning, which in turn assumes that a forward model of the environment dynamics is available. This is very unlike our method, which makes no such assumption (cf. p. 5 of our manuscript: "In contrast, Krakovna et al. (2018) propose [...]"). > Similarly, how does your intrinsic reward compare to existing exploration methods (of which there are many, but consider count-based methods, curiosity (Pathak et al), random network distillation (Burda et al)) By our assessment, the strengths and shortcomings of our intrinsic reward seem to differ from those of existing techniques. For instance, the class of curiosity methods that rely on the forward prediction error of a model are known to be susceptible to the so-called noisy-TV problem [12], where a local source of (unpredictable) stochasticity leads to a large prediction error and therefore a large reward. However, these methods are robust against temporally correlated sources of noise, because they can be predicted away by a powerful enough model. Meanwhile, our results in Fig 9 (App. C.1.1) show that the h-potential has the opposite tendency -- it is rather robust to uncorrelated sources of noise, but might get distracted by temporarily correlated noise. This hints towards potential synergies between our method and the various others in the literature! > Consider for example the Overcooked environment [...] Thanks for the pointer, this is fascinating! We will respond to your suggestion in an upcoming post. [11] Krakovna et al. 2018, "Penalizing side effects using stepwise relative reachability." https://arxiv.org/abs/1806.01186 [12] Burda et al. 2018, "Exploration by Random Network Distillation." https://arxiv.org/abs/1810.12894 ### Part 3 ## Rev 2 ### Part 1 Thank you for your review - we are glad that you enjoyed reading our paper and find many of our ideas thought provoking. > I would expect to see comparison to a simple model-based method [...] To recapitulate, we propose a method to quantify irreversibility in MDPs without having to learn the full dynamics model of the said MDP. This method has There is indeed no question that a well-trained dynamics model will contain much more information about the environment than the h-potential, including (potentially implicit) information about reversibility of state-transitions. However, if one is only interested in quantifying reversibility, our experiments show that it is sufficient (but not necessary) to use our method. > [...] for 7x7 2D world [...] I would expect a simple statistical estimator over state transitions (using the same samples as the h-potential method) to be able to capture this as well. It would certainly be possible to hand-craft an estimator that captures irreversibility of state transitions; this might be a challenging endeavor (or not), depending on the environment. However, our goal is to _learn_ a model that can capture irreversibility. The experiment on 2D World with Vases (Fig 2 and App C.1.1) therefore serves the following purposes. (1) It reassures us and our audience that the h-potential indeed learns to count the number of broken vases (which is precisely what one would consider when designing a statistical estimator). (2) It serves as a sandbox to understand the strengths and limitations of our method -- for instance, we find that our method is fairly robust against random uncorrelated noise, but can be distracted by time-correlated noise. (3) It helps expose the the expected trade-off between safety and efficiency (Fig 12) -- while the baseline DDQN agent trained without the h-potential out-performs the safe agent trained with the h-potential as far as the probability of reaching the goal is concerned, the baseline agent breaks a larger number of vases (i.e. is less safe) than the safe agent to acheive its goal. > Sokoban seems to be that the h-potential has detected side-effects of actions; again, why can’t a model estimator learn this? We would appreciate it if you could clarify what "learning a model estimator" might mean in this context. Please correct us if we are wrong, but we interpret your question as meaning either "Why not learn a model to directly predict side-effects?" or "Why not directly predict whether a transition is irreversible?". For the former: we often do not have access to ground-truth labels for when a transition induces side-effects. But if we did, it might also make sense to use it directly. For the latter: our method is indeed based on predicting whether a transition is irreversible, but with a crucial inductive bias: the corresponding predictor must be a difference of two functions (akin to a siamese neural network). Please refer to p. 4 ("Instead, our proposal is to derive [...]" et seq.) for a discussion. > As a more minor concern, the fact that the method uses a batch of uniformly random state transitions (as per Sec.4), rather than randomly sampled trajectories is a definite concerned with respect to real-world application. This might be a misunderstanding -- by uniform random policy, we mean that the actions are sampled randomly, with no action preferred over the other. This is straightforward to implement for the environments we consider. We will clarify this in the next update. ### Part 2 This post addresses your minor comments. > Top of p.3: Can you give some intuition for h(), e.g. relation to entropy over trajectory. Certainly. Consider the random variable $S_t$ corresponding to the state of the system after t time-steps (more precisely, $S_t$ is a stochastic process). The $h()$ function is such that the expectation of $h(S_t)$ over $S_t$ should increase with increasing t, making it an "arrow of time" (or a Lyapunov functional of the dynamics, in technical terms). Note that this does not mean for samples $s_t$ of the random variable $S_t$ that $h(s_t)$ must always increase. For instance, observe in Fig 3 that $h()$ can decrease with time (around $t = 75$); but $h(S_t)$ must increase in expectation over all trajectories, as seen in Fig 16 (Appendix). The empirical analogy with the thermodynamical entropy arises via the explict comparison with the so called Free-Energy functional of a random walk (cf. "Comparison with the Free-Energy Functional"). A well known result in variational Fokker-Planck is that under the right assumptions, the free energy (which comprises an energy and a negative entropy term) must monotonously decrease with time. > Bottom of p.3: you use a random policy to sample trajectories. Is this simple to implement? Can you just sample random actions at each state, or do you need to sample over the space of all trajectories? We sample trajectories with a policy that selects an action at random (without consulting the state). Sampling uniformly from the set of all trajectories is indeed very non-trivial, but it is not what we do (we justify this in p. 3, "As a compromise, we use [...]"). We will clarify this in the update. > Footnote 2, p.4. It would be interesting to expand on this point. In inverse reinforcement learning (IRL), the goal is to learn the reward function that is optimized by an expert agent. The hope is that a policy trained on this learned reward will mimic the expert in order to successfully solve the task. Our proposal can be thought of as replacing the expert agent with a "dumb" one. Accordingly, we ask "what reward function does a random policy maximize?". In doing so, we do not capture the pecularities of the expert agent, but only of the underlying environment. In informal terms, we ask "what does the environment want to do if left to its own devices?" The resulting function is analogous to the h-potential (up to minor technicalities). We hope to have addressed your concerns -- please let us know if not. ## Rev 1 Thank you for your review -- we are elated that you enjoyed reading our work! Regarding questions (1) and (2): the issue of non-random policy for gathering trajectories is perhaps the most important issue we do not address in this work, for doing it full justice would detract significantly from the original goal of the paper. Nevertheless, one approach that we see being fruitful is that of off-policy learning [1], which applies to cases where the behaviour policy used to gather trajectories differs from the evaluation policy of interest. In our case, the behaviour policy could be any combination of an exploratory policy and a learning agent, whereas the evaluation policy is random (in order to not bias the h-potential). Importance sampling is a simple example of this scheme, but more sophisticated methods exist. Moreover, it might also be worthwhile to remark that the adoption of random rollouts is rather widespread in model-based (related) literature [3-8]. Regarding question (3): Consider a stochastic process $X_t$ (i.e. a (time-)series of random variables). We say that a (deterministic) function $h$ is statistically monotonic increasing if the function $H(t) = \mathbb{E} [h(X_t) | h(X_{t - 1}), ..., h(X_1)]$ is (deterministically) monotonic increasing with t. In other words, only the expectation of $h$ (w.r.t. its argument random variable) must increase with time (but not $h$ itself). In technical jargon, $h(X_t)$ is sometimes called a submartingale [2]. All that said, we agree with your interpretation of our work as "learning the state transition reversibility", but with a crucial detail -- namely that the learner has a specific architecture resembling a siamese network, i.e. its output is a difference of a function applied to its two inputs. This function turns out to be the h-potential: cf. p. 4 of our manuscript: "Instead, our proposal is to derive [...]" (et seq.) for a discussion about what we may gain by using this specific architecture. We hope to have answered your questions -- please let us know if not! [1] Munos et al. 2016, "Safe and Efficient Off-Policy Reinforcement Learning." https://arxiv.org/abs/1606.02647 [2] https://en.wikipedia.org/wiki/Martingale_(probability_theory)#Submartingales,_supermartingales,_and_relationship_to_harmonic_functions [3] Savinov et al. 2018, "Semi-parametric Topological Memory for Navigation." https://arxiv.org/abs/1803.00653 [4] Ha & Schmidhuber 2018, "World Models." https://arxiv.org/abs/1803.10122 [5] Savinov et al. 2018, "Episodic Curiosity Through Reachability." https://arxiv.org/abs/1810.02274 [6] Nagabandi et al. 2017, "Learning Image-Conditioned Dynamics Models for Control of Under-actuated Legged Millirobots." https://arxiv.org/abs/1711.05253 [7] Kulkarni et al. 2019, "Unsupervised Learning of Object Keypoints for Perception and Control." https://arxiv.org/abs/1906.11883 [8] Anand et al. 2019, "Unsupervised State Representation Learning in Atari." https://arxiv.org/abs/1906.08226