Model Free Control

# Model Free Control ###### tags: `RL` 1. On-policy Learning: Learning policy $\pi$ from experience sampled from $\pi$ 2. Off-policy Learning: Learn about policy $\pi$ from experience sampled from $\mu$ - main idea : GPI (Generalised Policy Iteration) so we have to choose policy evaluation method and a policy improvement method ## 1. On-Policy : **CASE I:** POLICY EVALUATION: MC evaluation ? POLICY IMPROVEMENT: Greedy policy improvement ? PROBLEM: 1. Lot of episodes. 2. If we use $V(s)$ , we still need model of MDP, hence use $Q(s,a)$ instead. 3. Exploration **CASE II:** POLICY EVALUATION: MC policy evaluation using Q POLICY IMPROVEMENT: Greedy ? PROBLEM: 1. Exploration Idea for ensuring continual exploration : $\epsilon$-Greedy Exploration with prob $1-\epsilon$ choose greedy action with prob $\epsilon$ action at random. Now, **CASE III:** POLICY EVALUATION: MC policy evaluation using Q POLILCY IMPROVEMENT: $\epsilon$-greedy policy improvement PROBLEM: 1. Might take a long time Making the process efficient : POLICY EVALUATION : MC policy evaluation $Q \approx q_{\pi}$ i.e, act greedily after an episode. ### GREEDY LIMIT WITH INFINITE EXPLORATION (GLIE): conditions for GLIE: 1. All state-action pairs are explored infinitely many times. 2. The policy converges on a greedy policy one way : approach $\epsilon$ to 0 $\epsilon_k = \frac{1}{k}$ ### GLIE MC Control : $$N(S_t, A_t) \leftarrow N(S_t, A_t) + 1$$ $$Q(S_t, A_t) = Q(S_t, A_t) + \frac{1}{N(S_t, A_t)} (G_t - Q(S_t, A_t))$$ Now, Use TD instead of MC - find Q(S,A) - Use $\epsilon$-greedy policy - Update for every timestep ### SARSA : ![](https://i.imgur.com/eFz5fc6.png) $$Q(S, A) \leftarrow Q(S,A) + \alpha (R + \gamma Q(S', A') - Q(S, A))$$ Convergence of SARSA: 1. GLIE sequence of policies 2. Robbins-Monro sequence of step-sizes $\alpha_t$: $$\sum_{t=1}^{\infty} \alpha_t= \infty$$ $$\sum_{t=1}^{\infty} \alpha_t^2 < \infty$$ However in practice, the 2nd condition isn't required. And we sometimes even dont need the 1st condition to be true XDDD. ### n-step SARSA: ![](https://i.imgur.com/SAl9tNO.png) #### Forward view of SARSA($\lambda$): ![](https://i.imgur.com/so0ttsx.png) $$ q_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1}q_t^{(n)}$$ $$ Q(S_t, A_t) \leftarrow Q(S_t,A_t) + \alpha (q_t^{\lambda} - Q(S_t, A_t))$$ However this isn't online. We have to wait till the end of the episode. ### Backward View SARSA($\lambda$): - use eligibility traces at beginning of each episode : $$E_0(s,a) = 0$$ $$E_t(s, a) = \gamma E_{t-1}(s,a) + \mathbb{1}(S_t = s, A_t = a)$$ Q(s, a) is updated for every state-action pair in proportion to TD-error $\delta_t$ and eligibility traces: $$ \delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) $$ $$Q(s, a) \leftarrow Q(s,a) + \alpha \delta_t E_t(s,a)$$ $lambda$ determines how far the information (of reward) propogates back in your trajectory. ## 2. Off-Policy: ### Importance Sampling : $$ \mathbb{E}_{X \sim P}[f(X)] = \mathbb{E}_{X \sim Q}[\frac{P(X)}{Q(X)}f(X)] $$ However practically using this to MC is very bad idea with very high variance. Thus use TD learning. ### Q-Learning: ![](https://i.imgur.com/TpuHobn.png) Q-learning is better than both these methods for off-policy learning - actions chosen from behaviour policy ($\mu$): $A$ and policy which we want to be optimized ($\pi$): $A'$ - special case where target policy $\pi$ is greedy w.r.t Q(s,a) - and behaviour policy is $\epsilon$-greedy. $$Q(S, A) \leftarrow Q(S, A) + \alpha ( R + \gamma \max_{a'} Q(S' , a') - Q(S, A))$$ ![](https://i.imgur.com/SxTT4mZ.png) ![](https://i.imgur.com/5uilVnY.png)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.