# Model Free Control
###### tags: `RL`
1. On-policy Learning: Learning policy $\pi$ from experience sampled from $\pi$
2. Off-policy Learning: Learn about policy $\pi$ from experience sampled from $\mu$
- main idea : GPI (Generalised Policy Iteration)
so we have to choose policy evaluation method and a policy improvement method
## 1. On-Policy :
**CASE I:**
POLICY EVALUATION: MC evaluation ?
POLICY IMPROVEMENT: Greedy policy improvement ?
PROBLEM:
1. Lot of episodes.
2. If we use $V(s)$ , we still need model of MDP, hence use $Q(s,a)$ instead.
3. Exploration
**CASE II:**
POLICY EVALUATION: MC policy evaluation using Q
POLICY IMPROVEMENT: Greedy ?
PROBLEM:
1. Exploration
Idea for ensuring continual exploration :
$\epsilon$-Greedy Exploration
with prob $1-\epsilon$ choose greedy action
with prob $\epsilon$ action at random.
Now,
**CASE III:**
POLICY EVALUATION: MC policy evaluation using Q
POLILCY IMPROVEMENT: $\epsilon$-greedy policy improvement
PROBLEM:
1. Might take a long time
Making the process efficient :
POLICY EVALUATION : MC policy evaluation $Q \approx q_{\pi}$
i.e, act greedily after an episode.
### GREEDY LIMIT WITH INFINITE EXPLORATION (GLIE):
conditions for GLIE:
1. All state-action pairs are explored infinitely many times.
2. The policy converges on a greedy policy
one way :
approach $\epsilon$ to 0
$\epsilon_k = \frac{1}{k}$
### GLIE MC Control :
$$N(S_t, A_t) \leftarrow N(S_t, A_t) + 1$$
$$Q(S_t, A_t) = Q(S_t, A_t) + \frac{1}{N(S_t, A_t)} (G_t - Q(S_t, A_t))$$
Now,
Use TD instead of MC
- find Q(S,A)
- Use $\epsilon$-greedy policy
- Update for every timestep
### SARSA :

$$Q(S, A) \leftarrow Q(S,A) + \alpha (R + \gamma Q(S', A') - Q(S, A))$$
Convergence of SARSA:
1. GLIE sequence of policies
2. Robbins-Monro sequence of step-sizes $\alpha_t$:
$$\sum_{t=1}^{\infty} \alpha_t= \infty$$
$$\sum_{t=1}^{\infty} \alpha_t^2 < \infty$$
However in practice, the 2nd condition isn't required. And we sometimes even dont need the 1st condition to be true XDDD.
### n-step SARSA:

#### Forward view of SARSA($\lambda$):

$$ q_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1}q_t^{(n)}$$
$$ Q(S_t, A_t) \leftarrow Q(S_t,A_t) + \alpha (q_t^{\lambda} - Q(S_t, A_t))$$
However this isn't online.
We have to wait till the end of the episode.
### Backward View SARSA($\lambda$):
- use eligibility traces
at beginning of each episode :
$$E_0(s,a) = 0$$
$$E_t(s, a) = \gamma E_{t-1}(s,a) + \mathbb{1}(S_t = s, A_t = a)$$
Q(s, a) is updated for every state-action pair in proportion to TD-error $\delta_t$ and eligibility traces:
$$ \delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) $$
$$Q(s, a) \leftarrow Q(s,a) + \alpha \delta_t E_t(s,a)$$
$lambda$ determines how far the
information (of reward) propogates back in your trajectory.
## 2. Off-Policy:
### Importance Sampling :
$$ \mathbb{E}_{X \sim P}[f(X)] = \mathbb{E}_{X \sim Q}[\frac{P(X)}{Q(X)}f(X)] $$
However practically using this to MC is very bad idea with very high variance.
Thus use TD learning.
### Q-Learning:

Q-learning is better than both these methods for off-policy learning
- actions chosen from behaviour policy ($\mu$): $A$ and policy which we want to be optimized ($\pi$): $A'$
- special case where target policy $\pi$ is greedy w.r.t Q(s,a)
- and behaviour policy is $\epsilon$-greedy.
$$Q(S, A) \leftarrow Q(S, A) + \alpha ( R + \gamma \max_{a'} Q(S' , a') - Q(S, A))$$

