--- title: model-free prediction tags: rl --- # model-free prediction estimate the value function of an *unknown* MDP ## Monte-Carlo reinforcement learning - Monte-Carlo methods learn directly from episodes of experience - MC is model-free: no knowledge of MDP transitions / rewards - MC learns from *complete* episodes: no bootstrapping - MC uses the simplest possible idea: value = mean return - caveat: can only apply MC to *episodic* MDPs - all episodes must terminate ### Monte-Carlo policy evaluation - goal: learn $v_\pi$ from episodes of experience under policy $\pi$ $S_1, A_1, R_2,...,S_k$ ~ $\pi$ - recall that the *return* is the total discounted reward: $G_t$ = $R_{t+1}+ \gamma R_{t+2} + \gamma^{T-1}R_T$ - recall that the value function is the expected return: $V_\pi(s)$ = $\mathbb{E}_\pi$[$G_{t}$ | $S_t$ = s] - Monte-Carlo policy evaluation uses *empirical mean* return instead of *expected return* ### first-visit Monte-Carlo policy evaluation - to evaluate state s - the **first** time-step t that state s is visited in an episode, - increment counter N(s) <- N(s) + 1 - increment total return S(s) <- S(s) + $G_t$ - value is estimated by mean return V(s) = S(s)/N(s) - by law of large numbers, V(s) -> $v_\pi$(s) as N(s) -> $\infty$ ### every-visit Monte-Carlo policy evaluation - to evaluate state s - **every** time-step t that state is visited in an episode, - increment counter N(s) <- N(s) + 1 - increment total return S(s) <- S(s) + $G_t$ - value is estimated by mean return V(s) = S(s)/N(s) - by law of large numbers, V(s) -> $v_\pi$(s) as N(s) -> $\infty$ ### incremental mean - the mean $\mu_1, \mu_2, ...$ of a sequence $x_1, x_2, ...$ can be computed incrementally, ![](https://i.imgur.com/g5DGfQ6.png) ### incremental Monte-Carlo updates - update V(s) incrementally after episode $S_1, A_1, R_2,...,S_T$ - for each state $S_t$ with return $G_t$ $N(S_t)$ <- $N(S_t)$ + 1 $V(S_t)$ <- $V(S_t) + \frac{1}{N(S_t)} (G_t-V(S_t))$ - in non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes $V(S_t)$ <- $V(S_t) + \alpha (G_t-V(S_t))$ ## temporal-difference learning - TD methods learn directly from episodes of experience - TD is model-free: no knowledge of MDP transitions / rewards - TD learns from incomplete episodes, by bootstrapping - TD updates a guess towards a guess ## MC and TD - goal: learn $v_\pi$ online from experience under policy $\pi$ - incremental every-visit Monte-Carlo - update value V(S_t) toward action return $G_t$ $V(S_t)$ <- $V(S_t) + \alpha (G_t-V(S_t))$ - simplest temporal-difference learning algorithm: TD(0) - update value $V(S_t)$ with estimated return $R_{t+1}+ \gamma V(S_{t+1})$ $V(S_t)$ <- $V(S_t)$ + $\alpha (R_{t+1}+ \gamma V(S_{t+1}) -V(S_t))$ - $R_{t+1}+ \gamma V(S_{t+1})$ is called the TD target - $\delta_t$ = $R_{t+1}+ \gamma V(S_{t+1}) - V(S_t)$ is called the TD error ### advantages and disadvantages of MC vs TD - TD can learn *before* knowing the final outcome - TD can learn online after every step - MC must wait until end of episode before return is known - TD can learn *without* the final outcome - TD can learn from incomplete sequences - MC can only learn from complete sequences - TD works in continuing (non-terminating) environments - MC only works for episodic (terminating) environments bias/variance trade-off - Return $G_t$ = $R_{t+1}+ \gamma R_{t+2} + \gamma^{T-1}R_T$ is *unbiased* estimate of $v_\pi(S_t)$ - true TD target $R_{t+1}+ \gamma v_{\pi}(S_{t+1})$ is *unbiased* estimate of $v_{\pi}(S_{t})$ - TD target $R_{t+1}+ \gamma v_{\pi}(S_{t+1})$ is *biased* estimate of $v_{\pi}(S_{t})$ - TD target is much lower variance than the return - return depends on *many* random actions, transitions, rewards - TD target depends on *one* random action, transition, reward advantages and disadvatnages cont. - MC has high variance, zero bias - good convergence properties (even with function approximation) - not very sensitive to initial value - very simple to understand and use - TD has low variance, some bias - usually more efficient than MC - TD(0) converges to $v_{\pi}(s)$ (but not always with function approximation) - more sensitive to initial value batch MC and TD - MC and TD converge: V(s) $\longrightarrow$ $v_{\pi}(s)$ as experience $\longrightarrow \infty$ - but what about batch solution for finite experience? ![](https://i.imgur.com/noYEzsL.png) - e.g. repeatedly sampe episode k $\in$ [1, K] - apply MC or TD(0) to episode k **example** two states A, B; no discounting; 8 episodes of experience A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 what is V(A), V(B)? (value function) V(B) = 6/8 = 3/4, V(A) depends on whether it's MC or TD ![](https://i.imgur.com/gAri0BE.png) MDP ^ certainty equivalence - MC converges to solution with minimum mean-squared error - best fit to the observed returns - in the AB example, V(A) = 0 ![](https://i.imgur.com/8yzqkG3.png) - TD(0) converges to solution of max likelihood Markov model - solution to the MDP $(S, A, \hat{P}, \hat{R}, \gamma)$ that best fits the data ![](https://i.imgur.com/RTdRo9Y.png) - in the AB example, V(A) = 0.75 advantages and disadvantages of MC vs TD cont. - TD exploits Markov property (future is independent of the past given the present) - usually more efficient in Markov environments - MC does not exploit Markov property - usually more effective in non-Markov environments ### Monte-Carlo backup ![](https://i.imgur.com/0vAaXjv.png) ### temporal-difference backup ![](https://i.imgur.com/p6nX6Fq.png) ### dynamic programming backup ![](https://i.imgur.com/6QxrArc.png) ### bootstrapping and sampling - bootstrapping: update involves an estimate - MC does not bootstrap - DP bootstraps - TD bootstraps - sampling: update samples an expectation - MC samples - DP does not sample - TD samples ![](https://i.imgur.com/bGr3qyO.png) ## TD($\lambda$) ### n-step prediction - let TD target look n steps into the future ![](https://i.imgur.com/e7H5a29.png) ### n-step return - consider the following n-step returns for n = 1, 2, $\infty$ ![](https://i.imgur.com/DanOROU.png) - define the n-step return ![](https://i.imgur.com/PA5MBWc.png) - n-step temporal-difference learning $V(S_t) \longleftarrow V(S_t) + \alpha (G^{(n)}_t - V(S_t))$ ### average n-step returns - we can average n-step returns over different n - e.g. average the 2-step and 4-step returns - combines information from two different time-steps ### $\lambda$-return - the $\lambda$-return $G^\lambda_t$ combines al n-step returns $G^{(n)}_t$ - using weight (1 - $\lambda$) $\lambda^{n-1}$ -