RL Bible Part I

--- title: RL Bible Part I tags: Templates, Talk description: View the slide with "Slide Mode". --- This note taking provides an overview of the Barto and Sutton book and is not a substitute for reading it. Moreover, some passages from the book were more detailed than others on purpose. # Overview of Barto & Sutton: ![](https://i.imgur.com/GnSTtBr.jpg) # Part I ### Tabular Solution Methods : Chapters 1, 2, 3, 4 ## 1. Introduction Comparison of Reinforcement Learning (RL) with other computational approaches. The use of value function distinguishes RL from evolutionary methods. RL is the process of optimizing a rewards in a sequential decision making process under uncertainty. Tabular value function -> Optimal solution =/= Approximate value function -> Approximate solution. ## 2. Multi-armed bandits Evaluative feedbacks (RL): depends on the action taken. Instructive feedbacks (Supervised ML): do not. This case = 1 situation (= 1 state, nonassociative) = simple. #### 2.1 k-armed Bandit Problem Values of a machine = expectation of reward. Strong assumptions: stationarity, prior knowledge. #### 2.2 Action-value Methods Greedy/Epsilon greedy. #### 2.3 10-armed testbed Comparison Greedy/eps = 0.1/eps = 0.01 #### 2.4 Incremental Implementation How to compute the sample averages of observed rewards? New = Old + Step-Size (Target – Old). #### 2.5 tracking a Nonstationary Problem Nonstationary -> constant step-size parameter -> no convergence. Convergence = 2 criteria on the step-size parameter. Last actions weights grow exponentially. #### 2.6 Optimistic Initial values Instead of initial action values = 0: optimistic initial action values. Temporary need for exploration while being greedy all along. #### 2.7 UCB Action Selection Forces exploration but with respect to estimates that are uncertain: deterministic. Formula with a logarithm of time in the numerator -> importance of time decreases but unbounded. Once again, deal badly with nonstationary, not practical. #### 2.8 Gradient Bandit Algorithms #### 2.9 Associative Search Contextual bandits: in-between bandits and full RL problem #### 2.10 Summary Balancing exploration & exploration. Eps-greedy -> random, UCB -> deterministic. Parameter study graph on the book. Gittins index better but no generalization on full RL pb. Thomson Sampling good. Prob distribution as information state + horizon of x steps -> full RL pb. ## 3. Finite Markov Decision Processes MDP: Evaluative feedback + associative aspect (choosing different actions in different situations). Need to assign credit for long-term consequences to individual action selections. #### 3.1 The Agent-Environment Interface MDP and agent interaction -> trajectory S A R S A R … 2 conventions : S(t) A(t) R(t+1) S(t+1) … and S(t) A(t) R(t) S(t+1) … Probability p(S(t), R(t) | S(t-1), A(t-1)) defines the dynamics of the MDP = probability distribution for each choice of previous s and a: ![](https://i.imgur.com/IwmlegL.png) Markov property: prob of each value of s and r only depends on the immediately preceding s and a. Implies a restriction on the state. Can compute: - state-transition probs ![](https://i.imgur.com/7t4G3pj.png) - expected rewards for state-action pairs ![](https://i.imgur.com/exZxoRU.png) - for state-action-next state triples ![](https://i.imgur.com/w9cdVmW.png) #### 3.2 Goals and Rewards RL: use of reward signal to formalize the idea of goal. Rewards = communicate to the agent what you want to achieve, not how. #### 3.3 Returns and Episodes Continuing case: ![](https://i.imgur.com/OquGw6y.png) Episodic tasks: ![](https://i.imgur.com/thQXBeu.png) #### 3.4 Unified Notation for Episodic and Continuing Tasks With R(t) = 0 when t>T end of episode: ![](https://i.imgur.com/FcaVHTs.png) #### 3.5 Policies and Value Functions Compactly summarizes all possibilities on the long term, locally! Enables to evaluate a policy. ![](https://i.imgur.com/HyHT3AJ.png) ![](https://i.imgur.com/ymw6e2t.png) Relation between v and q: - V from Q: ![](https://i.imgur.com/Ub6vQek.png) ![](https://i.imgur.com/D9OaIQz.png) - Q from V: ![](https://i.imgur.com/VOY4VKp.png) ![](https://i.imgur.com/gE6o0Wl.png) - V from V successors (Bellman equation) ![](https://i.imgur.com/PYnrtdW.png) ![](https://i.imgur.com/KNYwIUn.png) - Q from Q successors (Bellman equation) ![](https://i.imgur.com/wSLdevb.png) ![](https://i.imgur.com/qyiTwB2.png) Given a certain policy, we can use the Bellman equation to solve a system of linear equations to find v or q, with number of equations equal to the number of states. #### 3.6 Optimal Policies and Optimal Value Functions A certain policy defines a certain value function. Doing well on 1 state never requires sacrificing value in another -> always exist an optimal policy. Optimal policy is not unique, but optimal value function is unique. Brute-force search of optimal policy: - Calculate the value functions for all possible deterministic policies and take the argmax. ![](https://i.imgur.com/uX8Voua.png) ![](https://i.imgur.com/HjGgvF6.png) ![](https://i.imgur.com/SiNc89D.png) ![](https://i.imgur.com/dBafLev.png) - Rather use the Bellman Optimality equation: ![](https://i.imgur.com/1mKeB6s.png) ![](https://i.imgur.com/EOF2TKI.png) Optimal action-value function of a state action means taking that action and then following the optimal policy from that point onward. ![](https://i.imgur.com/WqBox9p.png) Bellman optimality equation = system of non-linear equations with number of equations as the number of states. Given v*: one-step search using one-step dynamics. ![](https://i.imgur.com/FvAZjdP.png) Advantages of q*: No one-step search, and no need-to-know environment’s dynamics. ![](https://i.imgur.com/V0E9TVp.png) #### 3.7 Optimality and Approximation Tabular methods and approximation methods. Key property of RL: rare states might give bad actions. #### 3.8 Summary Complete knowledge vs incomplete knowledge. ## 4. Dynamic Programming Limited utility in RL because of: - Need for perfect model of the environment - Great computational expense Methods presented after in the book are the same but without the limitations above. Finite MDP assumption. If continuous states and actions, then quantize state and action spaces and apply finite-state DP (chapter 9 is an extension of that approach). #### 4.1 Policy Evaluation (Prediction) Compute state-value function for an arbitrary policy. From Bellman Equation, an update rule for all states s: ![](https://i.imgur.com/uMgZA9m.png) Conditions of convergence to v[pi] same as condition of existence and uniqueness of v[pi]: - Discount rate < 1 - Or termination is guaranteed from all states under pi Iterative policy evaluation: - Does expected estimations instead of sample estimations - Two-arrays version (old values and new values for each s) vs in-place version (order of values updates important) - In-place pseudocode ![](https://i.imgur.com/O1gdc86.png) #### 4.2 Policy Improvement ![](https://i.imgur.com/613lpaI.png) Policy Improvement Theorem: If we improve the discounted value of one state, then we are guaranteed to have improved the overall policy.  If: ![](https://i.imgur.com/pRooplp.png)  Then: ![](https://i.imgur.com/K5iyWaH.png)  Proof: ![](https://i.imgur.com/fXsh7VA.png)  Intuitive proof: ![](https://i.imgur.com/Unc5QmM.png) Q.E.D For all s, the new greedy policy is: ![](https://i.imgur.com/g8OHztP.png) ![](https://i.imgur.com/0bFWh4u.png) Same as Bellman optimality equation: v*. Easily extends to stochastic policies (if ties, any apportionment of probabilities among these actions are permitted). #### 4.3 Policy Iteration ![](https://i.imgur.com/y16tcUC.png) Different update formula for the policy evaluation: due to greedy policy chosen pi(s). Moreover, it may never terminate if the policy continually switches between 2 or more policies equally good (= argmax has ties broken arbitrarily = same value function give rise to different policies). #### 4.4 Value Iteration Combination of policy improvement and truncated policy evaluation. Do not run policy evaluation to completion at each iteration: just do one sweep through the state set. For all state: ![](https://i.imgur.com/QEKNG5g.png) = turning the Bellman optimality equation into an update rule. ![](https://i.imgur.com/LjHAYk5.png) #### 4.5 Asynchronous Dynamic Programming Instead of sweep through state set, select states to update, discussed in Chapter 8 Example: apply updates to states as the agent visits them: allow focus onto relevant parts of the state set #### 4.6 Generalized Policy Iteration GPI = all the ways we can interleave policy evaluation and policy improvement. Sum up: ![](https://i.imgur.com/4ut7PIo.png) #### 4.7 Efficiency of Dynamic Programming Worst case time DP methods to find optimal policy is polynomial in the nb of states and actions. More effective than direct search exponential to the nb of states as the nb of deterministic policies = nb of actions ^(nb of states). Morevoer, the linear programming methods become impractical at a much smaller number of states. DP can be used today to solve MDPs with millions of states. Both policy iteration and value iteration are widely used. Converge faster than theoretical worst-case run times particularly if started with good initial v or pi. Asynchronous DP methods preferred on large state spaces problems. #### 4.8 Summary Almost all reinforcement learning methods can be viewed as generalized policy iteration (GPI). GPI is the general idea of 2 interacting processes revolving around an approximate policy and an approximate value function. Convergence for classical DP methods has been proved but not for all GPI. Idea of updating estimates of the values of the states based on the other estimates is called bootstrapping vs sample-based policy evaluation where the value of each state is treated as an independent estimation problem (= Monte Carlo). Chapter 5: do not require a model and do not bootstrap. Chapter 6: do not require a model but do bootstrap. DP algorithms are considered PLANNING methods because it uses a model to improve the policy instead of trial and error.