# Notes from RL theory (AJKS) <iframe src="https://rltheorybook.github.io/rltheorybook_AJKS.pdf" width=680 height=400></iframe> # Chapter 2 ## Proof of Proposition 2.1 (by Howard Huang) ### Part 1 (Application of Lemma A.4) A.4 (Concentration for discrete distributions) says $$P\left(||\hat q - q||_1 \geq \sqrt{d}(1/\sqrt{N}+\epsilon)\right) \leq e^{-N\epsilon^2}$$ The bound we want to end up with is $$\mathbb{P}\left(||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \leq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}}\right) > 1-\delta$$ We can swap things around so that it matches up with the inequality in A.4 (i.e. by using $P(A)+P(\neg A) = 1$, or $P(A>x)+P(A\leq x) = 1$, and doing a bit of algebra). $$\mathbb{P}\left(||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \geq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}}\right) < \delta$$ In our case, the probability distribution over next states $P(\cdot | s,a)$ maps to $q$, the number of samples $m$ maps to $N$, and the probability $\delta$ maps to $e^{-N\epsilon^2}$. \begin{align*} P\left(||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \geq \sqrt{|S|}(1/\sqrt{m}+\epsilon)\right) \leq e^{-m\epsilon^2} \end{align*} We want a bound with probability less than $\delta$ \begin{align*} e^{-m\epsilon^2} &= \delta \\ -m\epsilon^2 &= \log(\delta) \\ \epsilon &= \sqrt{-\frac{\log(\delta)}{m}} \\ \end{align*} Substitute in inner inequality with this value of $\epsilon$: \begin{align*} \sqrt{|S|}(1/\sqrt{m}+\epsilon) &= \sqrt{|S|}\left(1/\sqrt{m}+\sqrt{-\frac{\log(\delta)}{m}}\right) \\ &\geq \sqrt{|S|}\sqrt{-\frac{\log(\delta)}{m}} \\ &= \sqrt{\frac{|S|\log(1/\delta)}{m}} \\ \end{align*} ## Part 2 (Application of Union Bound) [Union bound](https://en.wikipedia.org/wiki/Boole%27s_inequality) We want to prove that \begin{align*} \mathbb{P}\left( \left[\max_{(s,a)\in \mathcal{S}\times\mathcal{A}} ||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \right] \leq (1-\gamma)^2\epsilon \right) \geq 1-\delta \end{align*} --- Starting from the inequality previously proved ![](https://i.imgur.com/lEFA70l.png) The above inequality occurs with probability greater than $1-\delta$ \begin{align*} \mathbb{P}\left(||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \leq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}} \right) &\geq 1-\delta \\ \end{align*} Consider the opposite event. \begin{align*} \mathbb{P}\left(||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \geq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}} \right) &\leq \delta \\ \end{align*} Then apply the union bound to get \begin{align*} \mathbb{P}\left(\bigcup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left[||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \geq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}} \right] \right) &\leq \delta |\mathcal{S}||\mathcal{A}| \\ \end{align*} We can negate this again (note: $P\left(\bigcup_i A_i\right)+P\left(\neg\bigcup_i A_i\right) = P\left(\bigcup_i A_i\right)+P\left(\bigcap_i \neg A_i\right) = 1$) \begin{align*} \mathbb{P}\left(\bigcap_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left[||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \leq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}} \right] \right) &\geq 1-\delta |\mathcal{S}||\mathcal{A}| \\ \end{align*} We can reason out from first principles that the probability that inequality holds when taking the max of the left-hand side is equivalent to the intersection of the event that the inequality holds for all $s$ and $a$. This can be stated mathematically as \begin{align*} \mathbb{P}\left( \bigcap_{(s,a)\in \mathcal{S}\times\mathcal{A}} \left[||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \leq M\right] \right) &= \mathbb{P}\left( \left[\max_{(s,a)\in \mathcal{S}\times\mathcal{A}} ||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \right] \leq M \right) \\ \end{align*} Where $M = c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}}$ is our probabilistic upper bound. Combine to get \begin{align*} \mathbb{P}\left( \left[\max_{(s,a)\in \mathcal{S}\times\mathcal{A}} ||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \right] \leq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}} \right) > 1-\delta |\mathcal{S}||\mathcal{A}| \end{align*} Substitute $\delta$ with $\delta/(|\mathcal{S}||\mathcal{A}|)$ \begin{align*} \mathbb{P}\left( \left[\max_{(s,a)\in \mathcal{S}\times\mathcal{A}} ||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \right] \leq c\sqrt{\frac{|\mathcal{S}|\log(|\mathcal{S}||\mathcal{A}|/\delta)}{m}} \right) > 1-\delta \end{align*} This is consistent with the bound we want to prove as long as this is a tighter bound. That is, we want to show Recall that the total number of samples is given by $$|\mathcal{S}||\mathcal{A}|N \geq \frac{\gamma}{(1-\gamma)^4}\frac{|\mathcal{S}|^2|\mathcal{A}|\log(c|\mathcal{S}||\mathcal{A}|/\delta)}{\epsilon^2}$$ meaning that the number of samples for each state-action pair is $$N \geq \frac{\gamma}{(1-\gamma)^4}\frac{|\mathcal{S}|\log(c|\mathcal{S}||\mathcal{A}|/\delta)}{\epsilon^2}$$ We can replace $m$ by the right-hand size of the above expression \begin{align*} c\sqrt{\frac{|\mathcal{S}|\log(|\mathcal{S}||\mathcal{A}|/\delta)}{m}} &\leq c\sqrt{\frac{|\mathcal{S}|\log(|\mathcal{S}||\mathcal{A}|/\delta)}{\frac{\gamma}{(1-\gamma)^4}\frac{|\mathcal{S}|\log(c|\mathcal{S}||\mathcal{A}|/\delta)}{\epsilon^2}}} \\ &= \epsilon(1-\gamma)^2\sqrt{\frac{c^2\log(|\mathcal{S}||\mathcal{A}|/\delta)}{\gamma\log(c|\mathcal{S}||\mathcal{A}|/\delta)}} \\ \end{align*} The original bound holds as long as $\sqrt{\frac{c^2\log(|\mathcal{S}||\mathcal{A}|/\delta)}{\gamma\log(c|\mathcal{S}||\mathcal{A}|/\delta)}} < 1$. This is not always true though. We can choose $c$ such that this expression is greater than $1$. ### Second claim Other proof for the passage $||(P - \hat{P})V^{\pi}||_{\infty} \leq \left( \max_{s,a} || P(\cdot | s,a) - \hat{P}(\cdot | s, a) || \right)||V^\pi||_{\infty}$ The left hand side is the norm of a matrix-vector product. We can thus upper-bound it by $||(P - \hat{P})|| \times ||V^\pi||_{\infty}$, where $|| \cdot||$ is the matrix norm induced by the $\infty$ norm. The induced matrix norm is precisely the maximum L1 norm over the rows of $P - \hat{P}$. Reference [here](https://www.cis.upenn.edu/~cis515/cis515-11-sl4.pdf), p.25.