Reinforcement Learning

--- tags: Master --- # Reinforcement Learning <details close> <summary> closed...</summary> </details> ## Introductory Concepts "Idea that we learn by interacting with our environment is probably the first to occur to us when we think about the nature of learning". Interactions are a major source of knowledge about our environment and about ourselves. Reinforcement Learning (RL) is a computational approach to learning from interaction and is goal-directed learning (different to other machine learning paradigms such as supervised learning). The loop in RL: ![](https://i.imgur.com/kKDKALU.png) ![](https://i.imgur.com/NUwRkWh.png) Some problems in RL can be simply structured simply as a grid world. RL learns by mapping situations or states to actions and by maximizing a numerical reward signal given by the environment. There are Three Central Characteristics 1. Closed-loop problems: the learner’s actions influence its later observations. 2. No direct instructions: the learner is not told which actions to take, it must discover which actions to take by testing them. 3. No initial knowledge: consequences of actions play out over an extended period of time. Agent must be able to learn from its own experience! There is no $y$ or labels. #### The Exploration-Exploitation Dilemma Regards the trade-off between exploration and exploitation. Both cannot happen at the same time and the dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. You explore a little now then you exploit a little latter. There are various methods that tackle this dilemma. Exploitation: To favour actions than have been high rewarding in the past. Exploration: However to discover such action, it has to explore the environment, look for new actions with higher reward for instance. ![](https://i.imgur.com/NVLV8QH.png) **Formal definition of agent and environment**: the agent is the entity that learns and makes decisions while the environment is what the agent interacts with. The environment is everything outside the agent. While the agent decides on actions, the environment responds to these actions and presents new situations to the agent. The environment also gives rewards, which are numerical values that the agent tries to maximize over time. The agent might end up in a different position than the one it intended to go to - the environment is stochastic to a certain degree. All RL problems have in common: all involve interaction between an agent and its environment; Agent seeks to achieve a goal in its environment despite uncertainty about its environment; Agent’s actions affect the future state of the environment, in turn affecting future options and opportunities; Correct choices require taking into account indirect, delayed consequences of actions (foresight and planning); Effects of actions can not be fully predicted, so the agent has to frequently monitor its environment; Agent can use its experience to improve its performance; Interaction with the environment is essential for adjusting behavior. They all consist of four elements: $\mu$ * **Policy** <details close> <summary> Policy is the core of an RL agents: it's sufficient to determine behavior. Deterministic policity a= μ(s) VS stochastic policy a = μ(a|s).</summary> Defines the learning agent’s behavior at any given time. It what dictates the agent next action. Mappa from perceived states of the environment to actions. And a policy can be a lookup-tables or a simple functions or extensive search processes like deep neural networks. </details> * **Reward Signal** <details> <summary> Defines the goal in a RL problem: Where I am going? </summary> Reward signal answers that question,where is the agent going. On each time step, the environment sends a single number, the reward, to the agent.Agents sole objective is to maximize total reward over time; Defines what is good and what is bad for the agent; Is immediate and defines features of the problem; Agent can alter the signal the process produces directly by its actions indirectly by changing its environment’s state; Primary basis for changing the policy. <img src="https://i.imgur.com/0l9lQm5.png" alt="Girl in a jacket" width="250" height="200"> </details> * **Value Function** <details close> <summary> Defines what is good in the long run, it's the prediction of future reward. </summary> The value of a state is the total amount of reward an agent can expect to accumulate in the future starting from that state.; A state might yield a low immediate reward but still have a high value because it is usually followed by beneficial states.; Rewards (immediate): primary; Values (predictions of rewards): secondary; The only purpose of estimating values is to get more reward; Action choices are based on value judgments; Most important component of most RL algorithms is having an efficient value estimation. <img src="https://i.imgur.com/39Dfdph.png" alt="Girl in a jacket" width="250" height="300"> <img src="https://i.imgur.com/KVTx7Yy.png" alt="Girl in a jacket" width="250" height="200"> The above problem can be approached simply by just having a value function as a lookup table. 1.Set up a table of numbers, one row for each possible world state. 2.Estimate: state value V (use a table as a value function). 3.Initialize value function (eat: 1, get eaten: -1, else: 0) 4.Live, eat or die, repeat. Collect experience. Learn. 5.Select an action based on the value of the next state: * Most of the time we select greedily, choosing the move that leads to the state with the greatest value - we are exploiting our collected knowledge. * Occasionally we select randomly among actions - we are exploring and collect new knowledge. Learning in a gird world is like navigating the waters of the grid world over and over again, we change the values of states visited during the game. These value estimates should become more accurate estimates of how much cumulative reward we expect in the future. Correct behavior requires planning, and taking into account delayed effects of choices through backups of the value function estimates. Procedure: "back up" value of the state after a greedy action to value before the action. The current value $V(s)$ of the earlier state $s$ is adjusted to be closer to the value $V(s')$ of the later state $s'$. $$V (s) ← V (s) + α [V (s') − V (s)]$$ This update rule (using step-size / learning rate α) is an example of temporal-difference learning. * **Model of the Environment** <details> <summary> Mimics the behavior of the environment. </summary> Some RL methods are **model-based**, others are simpler **model-free methods**: trial and error learners, models can be used for planning: given state and action the model predicts next state and next reward. We need an epigenetic environment :( </details> ### from silver lecture1 Partial observable envrioments, the agnet can see only a portion of the environment, e.g. a robot with a camera seeing only a portion of the room. In this cases needs a different formalism, a partially observable markov decision process. Belief is a choice, probabilites states. Another choice is using RNNs. A value based model is a model with a value function and with no policy (the policy is implicit.) It behaves greedily. And a policy based model does not store the value explicitly but adjust the states. ## Multi-arm Bandits #### A k-Armed Bandit Problem K-armed bandits method is inspired by slot machine, one slot machine or one lever or one arm or one k. This method chooses the best action between k levers. It needs active exploration (explicit trial and error search). Two feedback paradigms: **Evaluative feedback** depends entirely on the action taken; **Instructive feedback** is independent of the action taken. We will see how to learn to act in a non-associative setting (not more than one situation) A k-armed bandit problem Description consist of: 1. repeated choice among k different actions (options) 2. after each action you receive a numeric reward (drawn from a stationary probability distribution) 1. objective: maximize expected total reward over time For Recommender Systems problem exploration is about gathering information about the user, to then use this information to get the higest reward (e.g. hooked the user infinitely to the advertisement). k-bandits are used by spotify also to make music reccamandation. Another application is Hyper Parameter Search. Random Variables and expected values $E[X] = \sum^k_{i=1} x_i · Pr \{X = x_i \}$. In the example of an unbiased coin toss, the random variable X takes on values from the set {heads, tails}, heads= +1$ and tails=-1$. The expected value is then given by $E[X] = (+1) · 0.5 + (−1) · 0.5 = 0$ Some usiful notation: Each of the k actions has an expected reward given that that action is selected: this is the value of that action * $A_t$: action selected at time step $t$ * $R_t$: corresponding reward * $q_∗(a)$, the value of arbitrary action $a$ is the expected reward given that $a$ is selected: $q_∗(a) = E[R_t|A_t = a]$ * ... unfortunately we do not know the action’s values * $Q_t(a) ≈ q_∗(a)$: estimated value of action $a$, which is approximatily equal to the real value. **K-Armed methods apply greedy or non-greedy methods**. A greedy actions is an action(s) whose action value is largest. explotations means choosing those greedy actions form our current knowldge. However non-greedy actions may produce higher reward in the long run, so its good to explore unknown actions or actions with little knowldge about. There are many methods out there that deals with exploration-expliotation dilemma however deciding to explore or exploit depends on: 1. precision of the values of the value estimates 2. uncertainties of our value estimates 3. number of remaining steps #### Action-Value Methods Sample-Average Method: Value Estimation has the goal of estimating the values of actions, then those value estimates are used to select actions. The true value of an action is its mean reward when action is selected. The simplest way of doing this estimation is to average the rewards actually received. $$Qt(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$ $$= \frac{\sum^{t−1}_{i=1} R_i · 1_{A_i=a}}{\sum^{t−1}_{i=1} 1_{A_i=a}}$$ $1_{predicate}$ is 1 if predicate is true and 0 otherwise. If denominator is zero we define it explicitly (e.g. set it to $Q_1(a)$ = 0) By the law of large numbers $Q_t(a)$ converges to $q_∗(a)$. The simplest rule is to select actions with highest estimated action value. Select at each step $t$ one of the greedy actions $A^∗_t$ for which $Q_t(A^∗_t)$ = $max_a Q_t(a)$ We call this **the greedy action selection method** written as: $$At = \operatorname*{arg\,max}_a Q_t(a)$$ Greedy action selection always exploits current knowledge to maximize immediate reward and spends no time sampling apparently inferior actions to see if those might be better. **Epsilon-greedy selection methods** What if we would like to enable exploration? A simple fix is to behave greedy most of the time but with a small probability we select randomly from among all the actions. In the limit where all actions are selected infinitely many times, the $Q_t(a)$ converge to $q_∗(a)$. **$\epsilon$-greedy implies that probability of selecting the optimal action converges to greater than $1 − \epsilon$.** ![](https://i.imgur.com/rKGQYCI.png) Greedy method performs significantly worse in long run. epsilon-greedy methods perform better because they continue to explore. $\epsilon$ = 0.1 finds optimal action earlier but never selects more than 91% $\epsilon$ = 0.01 improves slower but will perform better than = 0.1 Epsilon-greedy methods have problems too, such as: they forces non-greedy actions to be tried, yes, but indiscriminately, so **they don't have preference for those that are nearly greedy or particularly uncertain.** #### Incremental Implementation Iterative Sample-Average Method: Starting from the sample average for a single action after $n − 1$ selected actions: $Qn = R1 + R2 + . . . + Rn−1 n − 1$. sample-average method needs to maintain a record of all rewards for all actions taken for estimating the action values * what we want: constant memory and per-time-step computation **Given $Q_n$ and the nth reward $R_n$, the new average of all n rewards can be computed by**: $$Q_{n+1} = \frac{1}{n} \sum^n_{i=1} R_i$$ $$= \frac{1}{n} \left(R_n + \sum^{n−1}_{i=1} R_i \right)$$ $$ = \frac{1}{n} \left(R_n + (n − 1) \frac{1}{n − 1} \sum^{n−1}_{i=1} R_i \right)$$ $$ = \frac{1}{n} (R_n + (n − 1)Q_n) $$ $$ = \frac{1}{n} (R_n + nQ_n − Q_n)$$ $$ = Q_n + \frac{1}{n} [R_n − Q_n]$$ The general form of the update rule is: $NewEstimate ← OldEstimate + StepSize [T arget − OldEstimate]$ where: $[T arget − OldEstimate]$: error in the estimate. The error is reduced by taking a step towards the "Target". For the iterative sample average method the "StepSize" parameter changes over time ( $\frac{1}{n}$ for the nth reward) Iterative Sample-Average Algorithm: given the update rule $Q_n+1 = Q_n + \frac{1}{n} [R_n − Q_n]$ we end up with the pseudo-code for a complete bandit algorithm using incrementally computed sample averages and epsilon-greedy action selection. ![](https://i.imgur.com/GVbnIa3.png) But remeber that most of the algorithma have the step size $αt(a)$. #### Tracking a Non-stationary Problem The averaging methods seen above are appropriate for stationary environments. However, RL problems are often effectively non-stationary. In such cases it makes sense to weight recent rewards more heavily than long-past ones. One way to do this is by using a constant step size α changing the update rule of the bandit algorithm to: $$Q_{n+1} = Q_n + \alpha [R_n − Q_n]$$ $$= (1 − α)^n Q_1 + \sum^n_{i=1} α(1 − α)^{n−i}R_i$$ where the step-size parameter $α ∈ (0, 1]$ is constant. $Q_{n+1}$ is a weighted average of past rewards and the initial estimate $Q_1$. The sum of all weights $(1 − α) n + Pn i=1 α(1 − α) n−i = 1$ The weight α(1 − α) n−i given to reward Ri depends on how many time steps ago it was observed. Weight decays exponentially (if 1 − α = 0, all weight goes to last reward) , Exponential, recency-weighted average αn(a) = 1/n results in the sample average method, which is guaranteed to converge by the law of large numbers. For constant step-size αn(a) = α convergence is not guaranteed. This property is desired when dealing with non-stationary problems. The estimates continue to vary in response to the most recently received rewards. (convergence condition $\sum^∞_{n=1} α^2_n (a) < ∞$ has to be violated) #### Optimistic Initial Values **The methods above depend on initial action-value estimate $Q_1(a)$. They are biased by their initial estimates. For sample-average methods the bias disappears once all actions have been selected once. For constant step-size α, the bias is permanent (however decreases over time).** * initial estimates are a set of parameters picked by user * easy way to supply prior knowledge about expected reward * easy way to encourage exploration **Instead of setting the initial action values for our 10-armed testbed to zero, we set them all to a more optimistic value (e.g +5).** Recall that our $q_∗(a)$ are sampled from a normal distribution with mean 0 and variance 1. Thus +5 is widely optimistic. * No matter which action is selected, its reward is less than the optimistic starting estimates. * Learner switches to other actions being disappointed with the reward received. * **As a result all actions are tried several times before the value estimates converge.** * **Initially optimistic method perform worse, because it explores more. Eventually it performs better as exploration decreases over time.** * **Effective on stationary problems (but far from being a general approach to encouraging exploration).** Comparison of Optimistic Initial Values (Qn = 5, = 0) with -greedy action selection (Qn = 0, = 0.1). ![](https://i.imgur.com/zi3gzgX.png) initially OIV does worse but recovers #### Upper-Confidence-Bound Action Selection USB often performs well, but as the rest of the methods seen above have diffulty to be extended to general RL problem where dealing with no-stationarity and large state spaces is a problem. However UCB proposes a solution to the problem derived from the epsilon-greedy methods where actions are chosen indicscimnatly and the is no preference for nearly-greedy actions. Thus UCB takes into account also those nearly greedy actions and adds a bit of discrimination. UCB selects among non-greedy actions w.r.t. their potential for actually being optimal (a bit discriminative). And takes into account how close are estimates to being optimal and the uncertainty in those estimates. ![](https://i.imgur.com/nZnILjt.png) **UCB formalism**: $$A_t = \operatorname*{arg\,max}_a Q_t(a) + c\sqrt{\frac{log\ t}{Nt(a)}}$$ $N_t(a)$: number of times action a has been selected prior to t $c > 0$: controls degree of exploration If $N_t(a) = 0$ ... a is considered to be maximizing The square root term is the novelty introduced by UCB and is a measure of uncertainty (variance) in the estimate of action’s value. **Each time action $a$ is selected $N_t(a)$ is increased and the uncertainty is reduced. Each time another action is selected only $t$ is increased. Thus the uncertainty increases. The log means that the increase gets smaller over time, but is unbounded. Se we add c as an upper bound. The equation is sort of upper bound on the possible true value of action a, with c determining the confidence level.** #### Gradient Bandit Algorithms Initially the gradient bandit performs worse. Eventually it performs better as action selection preferences converge over time. ![](https://i.imgur.com/65zdnJK.png) The above methods are based value based methods. Gradient bandit algorithms learn a numerical **preference** $H_t(a)$ for each action a. They are policy based. the gradient bandit algorithm is a stochastic approximation to gradient ascent, its stochastic gradient ascent based. GDA has a large preference for an action a, where a large preference means the action is taken more often. The preference has no interpretation in terms of reward, only relative preference of actions is important. Action probabilities: $$Pr{A_t = a} = \frac{e^{H_t(a)}}{\sum^k_{b=1} e^{H_t(b)}} = π_t(a)$$ * $π_t(a)$ is the probability of taking action a at time t. * initially all preferences are the same {$H_1(a) = 0, ∀a$} Every time we select an action At and receive a reward Rt we update the action selection preferences: $$H_{t+1}(A_t) = H_t(A_t) + α(R_t − \bar{R}_t)(1 − π_t(A_t)) \text{, and}$$ $$H_{t+1}(a) = H_t(a) − α(R_t − \bar{R}_t)π_t(a), ∀a \neq A_t$$ * α > 0 is a step-size parameter * R¯ t is the average of all rewards up through and including time t * R¯ t serves as a baseline with which the reward is compared * If Rt > R¯ t the preference for At gets increased The above two formulas can be simplified into only one line, $$H_{t+1}(a) = H_t(a) + α(R_t − \bar{R}_t)(1_{a=A_t} − π_t(a)), ∀a$$ using the predicate 1a=At that is 1 when the action is selected, and zero in all other components. The last formula is also equal to this formula here and this is what makes GBA an instance of stochastic gradient ascent: $$H_{t+1}(a) = H_t(a) + α\frac{∂E[R_t]}{∂H_t(a)}$$ **Proof of GBA** ### TODO #### Associative Search (Contextual Bandits) The above mothods are **non-associative tasks, taks where there is no need to associate different actions with different situations.** In a general RL task there is more than one situation. The goal is to learn a policy: a mapping from situations to actions that are best for those situations. Lets view the setting of Contextual Bandits too before we move on to the full RL problem: * there are several different k-armed bandit tasks * at each step you confront one of these chosen at random * it appears like a single bandit problem whose action values change randomly from step to step * in addition you are given a clue which bandit you are facing (e.g. by color) * you can now learn a policy associating each task, signaled by the color you see. Contextual Bandits are an example for an associative search task. Involves both trial-and-error learning in search for best actions and association of these actions with situations where they work best. **If actions are allowed to affect the next situation as well as the reward $→$ full RL problem.** Contextual bandits are MDPs with a fixed set of states and a fixed set of actions but they don't have no control over the next state! ## Markov Decision Processes ***The future is independent of the past givent the present.*** | k-armed Bandits | Contestual k-armed Bandits | Markov Decision Processses | | :--------: | :--------: | :--------: | | there is only one state or problem| fixed set of states or problems, are known before, but we dont know the order | its like a a sequence of bandit problems but with explicit notion of a state $S_t$ | fixed set of actions | fixed set of actions | $A(S_t)$ **possible actions depending on the state. the decision of an action influences next state** | goal is to max reward over time | the goal is to max reward over time | goal is to max reward over time |**are MDPs with exactly one state and a fixed set of actions** | **are MDPs with fixed set of states and actions no control over the next state** | :) full RL problem **The decision on an action influences the next state because of the stochasticity of the environment.** > **A markov decision process is defined by:** $S$: state space $T$: index set, denoting discrete time steps $A$: set of actions that the agent can take $R$: reward function that specifies reward $P$: transition probabilities with the markov property $\gamma$: a discount factor, weighting near vs. far future reward) $agent$: navigates through state space by deciding on actions using its policy $π(a|s)$ **Stochastic Processes** are collections of random variables, variables take values from a state space $S$, those variables are indexed by some ordered index set $T$, example: >$S_t ∈ S = \{a, b, c\}, T ⊆ N_0$ >process ${S_t| t ∈ T } = {S_0, S_1, S_2,...,S_t}$ There are many types of stochastic processes. Usually the type of stochastic process is determined by the cardinalities of the space and index sets, and the depencies among the random variables; The index set $T$ models discrete or continuous time; The state space $S$ models discrete or continous space; The conditional probabilities $p(S_t+1|S_t , S_{t−1},...)$ model probabilistic dependencies among states. RL deals with processes that have Markov Property. A stochastic process is a first order markov process if $$p(S_{t+1}|S_t) = p(\underbrace{S_{t+1}}_{\text{future}} | \underbrace{S_t}_{\text{present}}, \underbrace{S_{t−1}, S_{t−2},...,S_0}_{\text{past}} )$$ The conditional distribution over the successor states of a markov process depends only on the current state. Such processes are also called **memoryless**. “the future is independent of the past, given the present”. Example of a markov process ![](https://i.imgur.com/aNxcrbw.png) where each $p \in P$, the transistion probability table. The agent-environment interaction in a markov process: 1. at **every timestep** $t$, the agent gets current state **$S_t$** of the environment, and reward **$R_t$** 2. based on current state $S_t$, the agent decides on an action $A_t$ according to its policy: $π(a|s) = Pr{A_t = a|S_t = s}$ 3. The effect of the action is determined by the environment - it transitions to a successor state $S_{t+1}, and gives a reward $R_{t+1} **The boundary between agent and environment** does not necessarily correspond to physical boundaries; in general, the environment is everything that the agent has only indirect control over, via the actions it chooses and the resulting reactions from them; the reward signal must be considered part of the environment. An agent can decide on an action and the environment reacts to the agent’s decision - the environment does so either deterministically or stochastically -> a discrive value or a probability. The environment’s reaction is governed by a probability distribution describing which states the environment may transition to and what the reward is. **Sequences and episodes** 1. The agent starts in some state S0 = s 2. the agent decides on an action according to its policy π(a|s), observes a new state, gets reward, decides, observes a new state, gets reward, decides . . . 3. **the agent travels along a path through statespace** S0, $A_0, R_1, S_1, A_1, R_2, S_2,...$ 4. paths might be infinitely long, or have natural ends 5. **paths with natural ends are also called episodes** **Multiple notations** function definitions for rewards and dynamics are used: * $r(s), r(s, a), r(s, a, s' )$ all denote reward, all depending on different inputs * $p(s' |s, a)$ denotes the conditional probability to transition to state s 0 , given the agent is in s, and chooses action a * in the book, R and P are also described jointly by the distribution $p(s', r|s, a)$ * $p(s', r|s, a)$ is also referred to as the dynamics * all other quantities can be derived from $p(s' , r|s, a)$ **the dynamics of an MDP** are completely specified by $$p(s', r|s, a) = Pr\{S_{t+1} = s', R_{t+1}=r|S_t =s, A_t = a\}$$ Given any state and action s and a, the dynamics give us the probability of each possible pair of next state s' and next reward r; you can interpret the dynamics as a model of the laws of nature in a particular environment; From this compact dynamics notation, we can derive all necessary quantities of interest. From the dynamics **we can derive the expected reward for state-action pairs:** ![](https://i.imgur.com/6PB2Zf7.png), where $r(s,a)$ is the reward if we are in state s, and choose action a. From the dynamics we can **derive the expected rewards for (state, action, next state) triples** $(s, a, s')$: ![](https://i.imgur.com/af2utpg.png) where $r(s, a, s')$ is the reward if we are in state s, choose action a and end up in state s'? from the dynamics we can **derive the state-transition probabilities:** ![](https://i.imgur.com/XYChZYs.png) where $p(s'|s, a)$ is the probability of ending up in state s', if we are in state s, and choose action a. The dynamics $p(s', r|s, a)$ are not always available to us. If they are available we use them for planning. Otheriwise, we **can either choose to learn a model of the dynamics or use model-free methods**. <details close> <summary> Explpicit notion of States (part of markov process definition)...</summary> States are derived from the agent’s perceptions of the environment e.g. for boardgames the state would be the board. States can include immediate sensor values (audio, video, ...) and can be heavily preprocessed augmented. Can contain multiple “snapshots” of sensor values (sampled at a higher framerate, therefore squeezing in short term temporal context, information about dynamics). States summarize past sensations compactly, without too much information loss. The more information about the past we manage to squeeze into the state signal, the more we are justified in modelling the dynamics p(s', r|s, a) as markovian. </details> <details close><summary>Rewards and goals</summary> At each timestep, the agent receives a reward Rt. Rewards are determined by the environment and rewards specify what is desirable, not how to achieve it. Over the lifetime of the agent, reward accumulates and the goal of an agent is to maximize reward in the long run. Parts of the environment are already specified (the universe with its physical laws, rules of games) but sometimes the reward function is predefined (most games have well defined reward already) and other timer we have to define our own reward function (to specify desired behavior!). >An example of practical reward in use: imagine we have a battery driven mobile robot with an arm. We want it to learn to collect old cans by specifying the reward signal with a function r : S → R from states to real numbers, we formalize what kind of behaviour is desirable and undesirable #### for example r(s): where ther order is r(state) = reward -> action ``` r(moving around) = −0.1 (uses energy!) r(picking up can) = 1 r(bumping into obstacle) = −1 r(running out of charge) = −2 . . . ``` </details> <details close> <summary> Tasks...</summary> is a complete specification of an environment, is defined by how rewards are given. A task is one particular instance of the reinforcement learning problem and a solution is always with respect to a task! reinforcement learning techniques will give you ways to obtain (approximately) optimal solutions, but might not help you, if the task is misspecified, or modelled improperly! </details> **Temporal Credit Assignment** Problem in general, there is no need of any direct feedback in response to an action, it might take a while to get reward. The agent does not know immediately, whether the decision it made in a state was good or bad. It knows it once it knows which of the actions were responsible for large reward in the long run. We need to connect expected future reward to states. Expected Returns** consider the sequence of future rewards $$Rt+1, Rt+2, Rt+3, . . .$$ The return Gt is defined as the sum of future rewards $$Gt .= Rt+1 + Rt+2 + Rt+3 + . . .$$ In the simplest case, we have a final time step and therefore a finite sum $Gt .= Rt+1 + Rt+2 + Rt+3 + · · · + RT$. This makes most sense in environments, where the agent-environment interaction decomposes into subsequences naturally (eg. games). These subsequences are often called episodes. What if there is no natural notion of a final step (T = ∞)?Simply summing up rewards will lead to infinite returns! So the fix is the discount factor γ ∈ [0, 1) which makes the sum finite $$Gt .= R_{t+1} + γR_{t+2} + \underbrace{γ^2}_{\text{future reward will have less weight that near rewards}}R_{t+3} +... = \sum^∞_{k=0}γ^kR_{t+k+1}$$ **If $\gamma$ is close to 0, only immediate rewards will be considered, if $\gamma$ is close to 1, rewards farther in the future are taken into consideration more strongly.** ### Value Functions and Returns Value functions map states to reals f : S → R (or state-action pairs to reals f : S × A → R). They estimate how good it is to be in a particular state (or how good an action is, if we are in a particular state). Good is defined in terms of expected return, which is to say what cumulative reward one expects in the future. The future rewards depend on the current policy, therefore value functions are always defined with respect to a particular policy π. <img src="https://i.imgur.com/udQTJRR.png" width="350"> <img src="https://i.imgur.com/YInR4Hn.png" width="350"> <img src="https://i.imgur.com/CNlMbb4.png"> All value function definitions in reinforcement learning have a particular recursive relationships, this is the key to finding optimal policies (there can be more than one optimal policy). Those recurrence relations are true for any policy π and any state s. The recursive relantionthip of the value funcitpn leads to the formalism of the Bellams equations: ![](https://i.imgur.com/MorjMuw.png) ![](https://i.imgur.com/6PbKWLq.png) <img src="https://i.imgur.com/LCUZUDY.png"> **Optimal Policies** is solving an RL problem by finding a policy that achieves the largest reward over time. Plicies can be compared via their value functions π ≥ π', if vπ(s) ≥ vπ0(s), ∀s ∈ S$. So an optimal policy is precisely defined for MDPs. An optimal policy is better or equal to all other policies, even though there might be more than one such policy, we denote all optimal policies by π∗. All optimal policies share the same value function, called the optimal value function $$v∗(s) = \operatorname*{max}_π v_π(s), ∀s ∈ S$$ All optimal policies also share the same action value function, called the optimal action value function $$q∗(s, a) = \operatorname*{max}_π q_π(s, a), ∀s ∈ S, ∀a ∈ A$$ <details ><summary>Bellamn Optimality equation</summary> the Bellman equation for $v∗(s)$ also called the Bellman optimality equation, expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action possible in this state: <img src="https://i.imgur.com/TOp8Fcf.png"> To use the bellamn optimal equation system, those asusmptions must hold true: we have enough computing power for tackle the statespace; we have access to the dynamics; the markov property holds. </details> **Benefit of using $q∗(s, a)$ instead of $v∗(s)$** in order to use $v∗(s)$, we need a model of the dynamics to conduct a one-step search over possible $(s, a, s' )$ triples but by using $q∗(s, a)$ bypasses this issue, so we can directly use $π∗(s) = \operatorname*{arg\,max}_a q∗(s, a)$. ## Dynamic Programming ## Monte Carlo Methods ## Temporal Difference Methods ## Policy Gradient Methods ## Function Approximation Tutorial ## Practical Applications of Policy Gradients ## Technical Details (Function Approximation) ## Last Lecture AlphaZero Readings: http://johnsalvatier.org/blog/2017/reality-has-a-surprising-amount-of-detail https://ieeexplore.ieee.org/document/6145622 Li, Shuai; Alexandros, Karatzoglou; Gentile, Claudio, "Collaborative Filtering Bandits", #### Monte Carlo Tree Search Its a mix of monte carlo methods and traditional tree search. Local approximation $Q(s,a)$ that takes form of a partial tree. ![](https://i.imgur.com/mvrJ1aG.png) We begin with a node, (at the begiining its both root and leaf) and then we do rollout: using some other policy, most of the time tree search its going to be the unform random policy. Play the game until game ends. Sotre those outcomes. **MCTS Steps**: - chose a root node - rollout - store outcomes - backing up the outcome, let the outcome trickle down towards the root node ![](https://i.imgur.com/lidIGZZ.png) $[7,0,1]$ is the best move in this state with each rollout the tree grows a little bit. when computation budget is finished return the best move after tot samples, we play the move and the game state is updated. the new experince is propragated way down till the root node. ![](https://i.imgur.com/P4flyIU.png) #### Policies The tree policy improves on the random policy. the tree policy uses an upper confidence bound $$arg\ max n0∈CHILDREN(n) Q(n 0 ) N(n0) + C s log N(n) N(n0)$$ Q is high and n prime is its current children chose ndoes visited not too often C = exploitation term (prefer action not chosed so much up until now) square = explorarion term if you visited the parent #### GO - the game #### alphazero 2016, deepmind as alphago. alphago has 2 functions, the policy netowrk and the netowkr value. Those 2 things were pretrained, they used lots of games and combined it whit MCTS. alphago builds on alphazero, muhc more general game player. go, chess and some others achieving state of the art on all of them. not rollouts in alphazero anymore. There is only one policy. Tree building. ftetha is used to build partially the tree. ## Dynamic Programming Markov decision processes satisfy both properties: * Bellman equation gives recursive decomposition * value function stores and reuses solutions. In RL, DP can be used to compute optimal policies, given a perfect model of the dynamics of an MDP. **Reminder** the dynamics of an MDP are completely specified by $p(s' , r|s, a)$, given any state and action s and a, the dynamics give the probability of each possible pair of next state s' and next reward r. The dynamics $p(s' , r|s, a)$ are not always available. If available use planning == DP. DP is applied only if the MDP is finite, i.e. S, A and R are finite and the dynamics are known $p(s', r|s, a)$ Iterative Policy Evaluation uses bellman to update the value function iterativily. <img src="https://i.imgur.com/9vJ7or6.png" width="380"> Remember undiscounted episode has $\gamma$ = 1. **An exaples of iterative policy evaluation:** <img src="https://i.imgur.com/UInuo6m.png" width="360"> <img src="https://i.imgur.com/dVOghsU.png" width="360"> **Terminiation of policy evaluation iteratively** formally, iterative policy evaluation only converges in the limit $k → ∞$ where $k$ is the n iteration. Typical stopping condition use $\Delta$ (smaller than some θ, chosen a priori). <details close> <summary> Iterative Policy Evaluation Algorithm</summary> <img src="https://i.imgur.com/YY5AMUB.png" width="450"> </details> **Improving policy once the value function is computed on some k** iterations the policy is improved by acting greedily on $v_{π}(s)$. <img src="https://i.imgur.com/9gv70xH.png" width="340"> <img src="https://i.imgur.com/DlOz8uN.png" width="340"> **Example policy iprovement** <img src="https://i.imgur.com/MQDFnjA.png" width="400"> **Evaluate in value function and improve polivy infinite loop steps** 1. imprtove prolicy based on valus function starting from $v_{(0)}$ 2. acti greeduly on v_1 to get better v_2 2. evaluate policy by updateing the value function 3. act greedily again to improve polivy 4. and so for... Improving policies and value functions by alternating between evaluation (E) and improving (I) steps gives a sequence: <img src="https://i.imgur.com/v7Q7zlx.png"> <img src="https://i.imgur.com/OKyfpRU.png" width="350"> **Principle of Optimality** any optimal policy can be subdivided into two components: an optimal first action and this action followed by an optimal policy from the successor state. **Generalized Policy Iteration (GPI)** describes the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes. Almost all reinforcement learning methods are well described as GPI. They have identifiable policies and value functions, the policy is always improved with respect to the value function, the value function is driven towards the value function for the policy. **Summary of Iterative policy funciton, policy improvement and value iteration** <img src="https://i.imgur.com/5ZbNqUj.png" width=""> ## Monte Carlo Methods Monte carlo are used when the MDP model is unknown, but experience can be sampled or the MDP model is known, but it is too big to use, except by samples. MC relies on repeated random sampling to obtain numerical results. MC is model-free: no knowledge of MDP transitions/rewards. * MC methods learn directly from episodes of experiments, real or simulated data * MC learns only from complete episodes (terminating environment): no bootstrapping * MC uses the simplest possible idea: value = mean return * caveat: can only apply MC to episodic MDPs, all episodes must terminate. * MC must wait until end of episode before return is known, so can learn only form complete sequences * MC does sample * good convergence properties * not very sensitive to initial value Monte Carlo methods sample and average returns for each state-action pair, in the same way bandits do but with the main difference that bandits dont have influnce on the next state givent the current one. Monte carlo's have the generalised policy iteration (GPI) too where in DP we computed value functions from knowledge of the MDP and instead monte carlo learns value functions from sample returns. **Monte Carlo Policy Evaluation** has the goal of learning state-value function for a given policy. Recall that value of a state is the expected return (expected cumulative future discounted reward) starting from that state. All Monte carlo methods estimate from experience, average the returns observed after visits to that state as more returns are observed, and average should converge to the expected value. MC has the goal of learning $v_π$ from episodes of experience under policy $π$ $$S1, A1, R2, . . . , Sk ∼ π$$ Monte Carlo policy evaluation uses empirical mean return instead of expected return as DP does. * recall that the return is the total discounted reward: $Gt = Rt+1 + γRt+2 + . . . + γT −1RT$ * recall that the value function is the expected return: $vπ(s) = Eπ[Gt |St = s]$ **MC prediction** Run some trajectories, take the mean to estimate the policy. Is about how much reward is gained by using current policy -> model prediction. Policy is giving now figure out the value function with model prediction. MC are suitable for episodic task, there is a start and and end. empirical mean: computes as many samples as possible from this state on-wards. what we care about in prediction is how good are the states under policy $\pi$. expected reward is the value function. incremental mean is done in online learning where the mean is updated at each state. using alpha in the incremental methods is to forget about old estimates, its ideal for non-stationary problems where the world is very various and keeping track of old estimates is not useful for getting current estimates so alpha, an hyper-parameter is used. MC has zero bias. MC does not exploit the markov property. It has to go trough whole the episode, from start to end, to make value function update. **First-visit Monte Carlo Policy Evaluation** <details close> <summary> Incremental First-visit MC policy evaluation Algorithm</summary> <img src="![](https://i.imgur.com/kRcQgun.png)" width=""> </details> The idea of the first visit is to evaluate state $s$, for the first time-step t that state s is visited in an episode and then * increment counter of the visited state $N_{(s)} ← N_{(s)} + 1$ * increment total return $S_{(s)} ← S_{(s)} + G_t$ * value is estimated by the empirical mean return $V_{(s)} = \frac{S_{(s)}}{N_{(s)}}$ $V(S_t) ← V(S_t) + \frac{1}{N(S_t)} [G_t − V(S_t)]$ is the incremental Monte carlo update, where $\frac{1}{N(S_t)}$ is the the standard deviation of the error, and $[G_t − V(S_t)]$ is the error between the newest return and the previous return. Remember that $V$ is simply the return too! Remember that in non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes $V (St) ← V (St) + α[Gt − V (St)]$, use $\alpha$ instead ot the standard deviation error 1/N. **Every-visit MC policy evaluation** increases the couner, the return and updated the value esitmate each time a state is encountered. <details close><summary>Every-visit MC policy evaluation Algorithm</summary> <img src="https://i.imgur.com/hPU6SId.png" width=""> </details> Difference between fisr visit and every visit is:M first visi MC estimate is said to be unbiased. And the following applies to both, because of the law of large number, MC value function always converges: $V_{(s)} → v_π(s)$ as $N_{(s)} → ∞$, as each state is visited many times $\rightarrow \infty$. And remember that MV value estimates for each state are independent, i.e. MC does not bootstrap (in contrast e.g. to DP); bootstrapping = update estimates on the basis of other estimates. **MC control** not just estimate the value function, but now optimize it also. generally, greedy is used for improving the policy but it has little exploration. Use Q(s,a) as action-value function for model free. Using Q does not need to rool ahead the model. For epos-greed, 1-eps act greedily and with eps act randomly. Either pick the max action or pick a random one. eps-greedy is a policy improvement. But greedy policy improvement over $v_π$ requires model of MDP (remember the bellman equation?) $$π'(s) = \operatorname*{arg max}_a \sum_{s' ,r} p(s' , r|s, a) r + γv_π(s' )$$ So for model free, now the greedy policy improves over $q_π$ instead of $v_\pi$. $q_\pi$ is model-free $π'(s) = \operatorname*{argmax}_a q_π(s, a)$ **Monte Carlo Control with Exploring Starts** <details><summary>MC control with Exploring Starts Algorithm</summary> <img src="https://i.imgur.com/JRKzc2a.png"> </details> Each episode starts in a state-action pair, every pair has **non-zero probability** of being selected as the start. This guarantees that each pair will be visited an infinite number of times in the limit of an infinite number of episodes. (Recall the Optimized initial value algorithm (OIV) in bandits?, this follow the same logic, having initialize value anything rather than 0). Using exploring start is a fix for the general MC problem: given a deterministic policy, many state-action pairs might never be visited. **$\epsilon$-Greedy Exploration** <details><summary>On-policy first-visit MC control for eps-soft policies Algorithm</summary> <img src="https://i.imgur.com/1lkN8Oo.png" width="400"> </details> The idea of eps-greedy of ensuring continual exploration is * all actions are tried with non-zero probability with probability 1 − $\epsilon$ * choose the greedy action with probability * choose an action at random * soft policy: $π(a|s) > 0$ for all $s ∈ S$ and all $a ∈ A$ * $\epsilon$-soft policy: π(a|s) ≥ $\frac{\epsilon}{|A(s)|}$ for all s ∈ S and all a ∈ A, with $\epsilon$ > 0 *$\epsilon$-greedy: actions are given the minimal probability $\frac{\epsilon}{|A(s)|}$, and the remaining bulk $1 − \epsilon + \frac{\epsilon}{|A(s)|}$ is given to the greedy action. **GLIE On-policy first-visit MC control** explores all action-staes pairs infinitely and converges to the optimal action-value function. On-Policy vs. Off-policy Methods **Difference on-policy and off-policy definition** on-policy learns on the job and learns about policy π from experience sampled from π. off-policy, looks over someone's shoulder, learns about policy π from experience sampled from some other policy µ. For offline $\epsilon$-greedy method is a compromise, it learns action values only for a near-optimal policy that still explores (not the real argmax action). While GLIE in off-policy $\epsilon$ might get too small too soon to learn anything useful with a reasonable amount of episodes. Off-policy use two policies: the target policy: policy we want to learn about (and improve) and second, the behaviour policy: policy used to generate episodes. To compute $v_π(s)$ or $q_π(a, s)$ to evaluate the target policy π(a|s) while following behaviour policy µ(a|s). On-policy methods can be seen as a special case of off-policy methods where both target and behaviour policy are the same. For MC off-policy importance sapling algorithm is mostly used. According to David Silver: “off-policy MC is extremely high variance” (he was talking about the ordinary importance sampling method), “in practice it is just useless”, “MC learning is a really bad idea off-policy” So if the problem is off-policy, use temporal difference learning. ## Temporal Difference TD methods learn directly from episodes of experience * TD is model-free: no knowledge of the dynamics * TD tries to improve a guess with a guess (bootstrapping) * TD learns from samples * TD can learn from incomplete episodes TD(0) updates at each step <img src="https://i.imgur.com/wfvysMn.png"> **Model prediction with TD** learns from the environment like MC, is model-free. The difference from MC is that can learn from incomplete episodes. It combine bootstrapping and sampling. bootstrapping: guessing how long will take to reach the end of the episode, use the already return to estimate new ones.Learns a value function under policy pie. Look every step make an estimate of the value function. In MC the update after the episode ends but TD updates at each step, does not wait until end of episode to update the value function estimates. TD is great in situations where you don't get the end of the episode, like in continous situations where the environment goes on forever. Because is computes the value function only on a fraction of the trajectory is bias sensitive, although carrying less variance to respect to the variance of MC methods. TD builds a MDP structure and makes use of the markov property, no need of looking at the whole trajectory to make value function update. TD(0) places TD backups to one step backup, which means makes value function update at each step, those are called shallow backups method. While TD methods with an higher n steps have deeper backups, more than one look ahead to make updates. $\lambda$ is a normalizing factor, which weight less the froward n steps. Is used for the algorithm that uses all n step returns. It tricky to find the best n step for update so TD$\lambda$ is used to bypass this. Lambda increases the n step at each step by in the long run n is inversely proportional to weight size. $\lambda$>$\lambda$=1 is monte carlo, and lambda 0 is TD(0). Lambda tells how the weight decay looks like, an high lambda mean an weight decay propagated ahead to many steps while a low lambda, like 0 is weight decay immediately at each step. That's why lambda=0 is MC, because it looks forward until the episode ends. **Schedule learning rate** TD(0) is sensitive to the learning rate. TD(0 converges to the true value function $v_pi()$ if α is sufficiently small, and adheres to the Robbins-Monro conditions for the sequence of learning rates $\alpha_t$ $$\underbrace{α_t : \sum^∞_{t=1} α_t = ∞}_{\text{(meaning the lr never gets too small)}}$$ $$\underbrace{\sum^∞_{t=1} α^2_t < ∞}_{\text{(meaning the learnrate decays over time)}}$$ <img src="https://i.imgur.com/YJ4xHAo.png" width="400"> **SARSA** has particular update pattern and works online. It's TD(0). <details Open> <summary>SARSA Algorithm</summary> </details> Sarsa update the update uses the 5-tuple (St , At , Rt+1, St+1, At+1), and the updates happens after every transition $S_t$. $$Q(St , At) ← Q(St , At)+α[Rt+1+γQ(St+1, At+1)−Q(St , At)]$$ **Expected SARSA** generalizes Q-learning while improving over SARSA. If we use a different policy µ for behavior, Expected SARSA becomes an off-policy method and if we use a purely greedy policy π for updating, and a different policy µ to explore (for example, ε-greedy), expected SARSA becomes exactly Q-learning. Q-learning, off policy learning of action-values Q(s,a), no need of importance sampling. **Deep Q learning** <img src="https://i.imgur.com/IbZugXx.png" > Deeo Q leanring is Q-learning algorithm and a deep convolutional neural network $\bar{q}(s, a, w)$ to approximate q∗(s, a). This essentially replaces the table Q(s, a) we have used so far with a parametric function approximator qˆ(s, a, w), where $w$ are the parameters that will be learned and $L(w)$ is the loss function we need to minimize. ## Policy Gradient Methods <details open> <summary> Policy Approximations</summary> </details> <details close> <summary> The Policy Gradient Theorem</summary> </details> <details close> <summary> REINFORCE: Monte Carlo Policy Gradient</summary> </details> <details close> <summary> REINFORCE with Baseline</summary> </details> <details close> <summary> Actor-Critic Methods</summary> </details> <details close> <summary> Policy Parametrization for Continuous Actions</summary> </details>

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.