---
title: model-free prediction
tags: rl
---
# model-free prediction
estimate the value function of an *unknown* MDP
## Monte-Carlo reinforcement learning
- Monte-Carlo methods learn directly from episodes of experience
- MC is model-free: no knowledge of MDP transitions / rewards
- MC learns from *complete* episodes: no bootstrapping
- MC uses the simplest possible idea: value = mean return
- caveat: can only apply MC to *episodic* MDPs
- all episodes must terminate
### Monte-Carlo policy evaluation
- goal: learn $v_\pi$ from episodes of experience under policy $\pi$
$S_1, A_1, R_2,...,S_k$ ~ $\pi$
- recall that the *return* is the total discounted reward:
$G_t$ = $R_{t+1}+ \gamma R_{t+2} + \gamma^{T-1}R_T$
- recall that the value function is the expected return:
$V_\pi(s)$ = $\mathbb{E}_\pi$[$G_{t}$ | $S_t$ = s]
- Monte-Carlo policy evaluation uses *empirical mean* return instead of *expected return*
### first-visit Monte-Carlo policy evaluation
- to evaluate state s
- the **first** time-step t that state s is visited in an episode,
- increment counter N(s) <- N(s) + 1
- increment total return S(s) <- S(s) + $G_t$
- value is estimated by mean return V(s) = S(s)/N(s)
- by law of large numbers, V(s) -> $v_\pi$(s) as N(s) -> $\infty$
### every-visit Monte-Carlo policy evaluation
- to evaluate state s
- **every** time-step t that state is visited in an episode,
- increment counter N(s) <- N(s) + 1
- increment total return S(s) <- S(s) + $G_t$
- value is estimated by mean return V(s) = S(s)/N(s)
- by law of large numbers, V(s) -> $v_\pi$(s) as N(s) -> $\infty$
### incremental mean
- the mean $\mu_1, \mu_2, ...$ of a sequence $x_1, x_2, ...$ can be computed incrementally,

### incremental Monte-Carlo updates
- update V(s) incrementally after episode $S_1, A_1, R_2,...,S_T$
- for each state $S_t$ with return $G_t$
$N(S_t)$ <- $N(S_t)$ + 1
$V(S_t)$ <- $V(S_t) + \frac{1}{N(S_t)} (G_t-V(S_t))$
- in non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes
$V(S_t)$ <- $V(S_t) + \alpha (G_t-V(S_t))$
## temporal-difference learning
- TD methods learn directly from episodes of experience
- TD is model-free: no knowledge of MDP transitions / rewards
- TD learns from incomplete episodes, by bootstrapping
- TD updates a guess towards a guess
## MC and TD
- goal: learn $v_\pi$ online from experience under policy $\pi$
- incremental every-visit Monte-Carlo
- update value V(S_t) toward action return $G_t$
$V(S_t)$ <- $V(S_t) + \alpha (G_t-V(S_t))$
- simplest temporal-difference learning algorithm: TD(0)
- update value $V(S_t)$ with estimated return $R_{t+1}+ \gamma V(S_{t+1})$
$V(S_t)$ <- $V(S_t)$ + $\alpha (R_{t+1}+ \gamma V(S_{t+1}) -V(S_t))$
- $R_{t+1}+ \gamma V(S_{t+1})$ is called the TD target
- $\delta_t$ = $R_{t+1}+ \gamma V(S_{t+1}) - V(S_t)$ is called the TD error
### advantages and disadvantages of MC vs TD
- TD can learn *before* knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
- TD can learn *without* the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
bias/variance trade-off
- Return $G_t$ = $R_{t+1}+ \gamma R_{t+2} + \gamma^{T-1}R_T$ is *unbiased* estimate of $v_\pi(S_t)$
- true TD target $R_{t+1}+ \gamma v_{\pi}(S_{t+1})$ is *unbiased* estimate of $v_{\pi}(S_{t})$
- TD target $R_{t+1}+ \gamma v_{\pi}(S_{t+1})$ is *biased* estimate of $v_{\pi}(S_{t})$
- TD target is much lower variance than the return
- return depends on *many* random actions, transitions, rewards
- TD target depends on *one* random action, transition, reward
advantages and disadvatnages cont.
- MC has high variance, zero bias
- good convergence properties (even with function approximation)
- not very sensitive to initial value
- very simple to understand and use
- TD has low variance, some bias
- usually more efficient than MC
- TD(0) converges to $v_{\pi}(s)$ (but not always with function approximation)
- more sensitive to initial value
batch MC and TD
- MC and TD converge: V(s) $\longrightarrow$ $v_{\pi}(s)$ as experience $\longrightarrow \infty$
- but what about batch solution for finite experience?

- e.g. repeatedly sampe episode k $\in$ [1, K]
- apply MC or TD(0) to episode k
**example**
two states A, B; no discounting; 8 episodes of experience
A, 0, B, 0
B, 1
B, 1
B, 1
B, 1
B, 1
B, 1
B, 0
what is V(A), V(B)? (value function)
V(B) = 6/8 = 3/4, V(A) depends on whether it's MC or TD

MDP ^
certainty equivalence
- MC converges to solution with minimum mean-squared error
- best fit to the observed returns
- in the AB example, V(A) = 0

- TD(0) converges to solution of max likelihood Markov model
- solution to the MDP $(S, A, \hat{P}, \hat{R}, \gamma)$ that best fits the data

- in the AB example, V(A) = 0.75
advantages and disadvantages of MC vs TD cont.
- TD exploits Markov property (future is independent of the past given the present)
- usually more efficient in Markov environments
- MC does not exploit Markov property
- usually more effective in non-Markov environments
### Monte-Carlo backup

### temporal-difference backup

### dynamic programming backup

### bootstrapping and sampling
- bootstrapping: update involves an estimate
- MC does not bootstrap
- DP bootstraps
- TD bootstraps
- sampling: update samples an expectation
- MC samples
- DP does not sample
- TD samples

## TD($\lambda$)
### n-step prediction
- let TD target look n steps into the future

### n-step return
- consider the following n-step returns for n = 1, 2, $\infty$

- define the n-step return

- n-step temporal-difference learning
$V(S_t) \longleftarrow V(S_t) + \alpha (G^{(n)}_t - V(S_t))$
### average n-step returns
- we can average n-step returns over different n
- e.g. average the 2-step and 4-step returns
- combines information from two different time-steps
### $\lambda$-return
- the $\lambda$-return $G^\lambda_t$ combines al n-step returns $G^{(n)}_t$
- using weight (1 - $\lambda$) $\lambda^{n-1}$
-