RL Bible Part III

--- title: RL Bible Part III tags: Templates, Talk description: View the slide with "Slide Mode". --- This note taking provides an overview of the Barto and Sutton book and is not a substitute for reading it. Moreover, some passages from the book were more detailed than others on purpose. # Overview of Barto & Sutton: ![](https://i.imgur.com/K6tVXYw.jpg) # Part III ### Approximate Solution Methods : Chapters 9, 10, 11, 12, 13. ## 9. On-Policy Prediction with Approximation The novelty is that the approximate value function is represented not as a table but as a parameterized functional form with weight vector. The approximate function might be a linear function in features of the state. The weight vector is much less than the number of states. When a single state is updated, the change generalizes from that state to affect the values of many other states. Such generalization makes the learning potentially more powerful but also more difficult to manage and understand. Extending RL to function approximation makes it applicable to partially observable problems, theoretical results apply to them. #### 9.1 Value-function Approximation Monte Carlo update: TD(0) update: N-step TD update: DP policy-evaluation update on an arbitrary state (=/= state encountered in actual experience) : So far, updates have been trivial: the table entry for the state’s estimated value is changed while the others are unchanged. Now, updating at a state generalizes so that estimated values of many other states are changed as well. In part II, an update is equivalent to do supervised learning (= function approximation), by viewing each update as a conventional training example. The resulting approximate function is interpreted as an estimated value function. The supervised learning technique have to be used in an online manner from incrementally acquired data, while the agent interacts with its environment or with a model of it. Plus, target functions can change over time in RL. For example, in control methods based on GPI, we often seek to learn q[pi] while pi changes. Even if the policy remains the same, the target values of training examples are nonstationary If they are generated by bootstrapping. #### 9.2 The Prediction Objective The measure of the error in each state is weighted by mu: Mu is often chosen to be the faction of time spent in s, under on-policy training, it is called the on-policy distribution. Our goal is to find w* such that, for all possible w: And in case of complex function approximators: for all w in some neighborhood of w* (we seek to converge). #### 9.3 Stochastic-gradient and Semi-gradient Methods SGD is a class of function approximation that is well suited to online RL. SGD at step t+1: In practice: Where U is an estimate of the true value to update towards. If U is an unbiased estimate such that: Then w is guaranteed to converge to a local optimum under the usual stochastic approximation conditions (2.7) for decreasing alpha. For example, the Monte Carlo target is an unbiased estimate by definition: On the opposite, the (n-step) TD target and the DP target are biased because the target itself depends on the current value of the weight vector at time t while MC target is independent of it. Hence, the bootstrapping methods are not instances of the true gradient descent and are called semi-gradient methods. State aggregation is a simple form of SGD which associate a single weight to a group of states. #### 9.4 Linear Methods #### 9.5 Feature Construction for Linear Methods #### 9.6 Selecting Step-Size Parameters Manually #### 9.7 Nonlinear Function Approximation: Artificial Neural Networks #### 9.8 Least-Squares TD #### 9.9 Memory-based Function Approximation #### 9.10 Kernel-based Function Approximation #### 9.11 Looking Deeper at On-Policy Learning: Interest and Emphasis #### 9.12 Summary ## 10. On-Policy Control with Approximation #### 10.1 Episodic Semi-gradient Control To form control methods, we need to couple action-value prediction methods with policy improvement techniques and action selection. In the case of a small discrete set of actions, we can reuse control techniques already seen in part 1 of the book: Looking a lot like 6.4 : Note: Optimistic initialization of the weights is possible in the linear case to produce extensive exploration. #### 10.2 Semi-gradient n-step Sarsa The agent has to wait for n initial steps to start learning, if n is large the results can be poor on the mountain car example. In particular, for Monte Carlo methods, the car might never climb the cliff because it learns at the end of each episode. #### 10.3 Average Reward: A New Problem Setting for Continuing Tasks Alongside the episodic and discounted settings, the Average Reward setting is used to replace the discounted setting in continuing tasks. The quality of a policy under this setting is defined as: Ergocity hypothesis. Modified Bellmann equations: Differential version of TD error: #### 10.4 Deprecating the Discounted Setting The discount magnitude depends on the problem, which is a constraint. We lose the Section 4.2 policy improvement theorem that was key to our RL control methods. #### 10.5 Differential Semi-gradient n-step Sarsa #### 10.6 Summary ## 13. Policy Gradient Methods #### 13.1 Policy Approximation and its Advantages #### 13.2 The Policy Gradient Theorem #### 13.3 REINFORCE: Monte Carlo Policy Gradient #### 13.4 REINFORCE with Baseline #### 13.5 Actor-Critic Methods #### 13.6 Policy Gradient for Continuing Problems #### 13.7 Policy Parameterization for Continuous Actions #### 13.8 Summary