# Notes from RL theory (AJKS)
<iframe src="https://rltheorybook.github.io/rltheorybook_AJKS.pdf" width=680 height=400></iframe>
# Chapter 2
## Proof of Proposition 2.1 (by Howard Huang)
### Part 1 (Application of Lemma A.4)
A.4 (Concentration for discrete distributions) says
$$P\left(||\hat q - q||_1 \geq \sqrt{d}(1/\sqrt{N}+\epsilon)\right) \leq e^{-N\epsilon^2}$$
The bound we want to end up with is
$$\mathbb{P}\left(||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \leq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}}\right) > 1-\delta$$
We can swap things around so that it matches up with the inequality in A.4
(i.e. by using $P(A)+P(\neg A) = 1$, or $P(A>x)+P(A\leq x) = 1$, and doing a bit of algebra).
$$\mathbb{P}\left(||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \geq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}}\right) < \delta$$
In our case, the probability distribution over next states $P(\cdot | s,a)$ maps to $q$,
the number of samples $m$ maps to $N$, and the probability $\delta$ maps to $e^{-N\epsilon^2}$.
\begin{align*}
P\left(||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \geq \sqrt{|S|}(1/\sqrt{m}+\epsilon)\right) \leq e^{-m\epsilon^2}
\end{align*}
We want a bound with probability less than $\delta$
\begin{align*}
e^{-m\epsilon^2} &= \delta \\
-m\epsilon^2 &= \log(\delta) \\
\epsilon &= \sqrt{-\frac{\log(\delta)}{m}} \\
\end{align*}
Substitute in inner inequality with this value of $\epsilon$:
\begin{align*}
\sqrt{|S|}(1/\sqrt{m}+\epsilon)
&= \sqrt{|S|}\left(1/\sqrt{m}+\sqrt{-\frac{\log(\delta)}{m}}\right) \\
&\geq \sqrt{|S|}\sqrt{-\frac{\log(\delta)}{m}} \\
&= \sqrt{\frac{|S|\log(1/\delta)}{m}} \\
\end{align*}
## Part 2 (Application of Union Bound)
[Union bound](https://en.wikipedia.org/wiki/Boole%27s_inequality)
We want to prove that
\begin{align*}
\mathbb{P}\left( \left[\max_{(s,a)\in \mathcal{S}\times\mathcal{A}} ||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \right] \leq (1-\gamma)^2\epsilon \right) \geq 1-\delta
\end{align*}
---
Starting from the inequality previously proved

The above inequality occurs with probability greater than $1-\delta$
\begin{align*}
\mathbb{P}\left(||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \leq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}} \right) &\geq 1-\delta \\
\end{align*}
Consider the opposite event.
\begin{align*}
\mathbb{P}\left(||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \geq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}} \right) &\leq \delta \\
\end{align*}
Then apply the union bound to get
\begin{align*}
\mathbb{P}\left(\bigcup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left[||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \geq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}} \right] \right) &\leq \delta |\mathcal{S}||\mathcal{A}| \\
\end{align*}
We can negate this again
(note: $P\left(\bigcup_i A_i\right)+P\left(\neg\bigcup_i A_i\right) = P\left(\bigcup_i A_i\right)+P\left(\bigcap_i \neg A_i\right) = 1$)
\begin{align*}
\mathbb{P}\left(\bigcap_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left[||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \leq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}} \right] \right) &\geq 1-\delta |\mathcal{S}||\mathcal{A}| \\
\end{align*}
We can reason out from first principles that the probability that inequality holds when taking the max of the left-hand side is equivalent to the intersection of the event that the inequality holds for all $s$ and $a$. This can be stated mathematically as
\begin{align*}
\mathbb{P}\left( \bigcap_{(s,a)\in \mathcal{S}\times\mathcal{A}} \left[||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \leq M\right] \right)
&= \mathbb{P}\left( \left[\max_{(s,a)\in \mathcal{S}\times\mathcal{A}} ||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \right] \leq M \right) \\
\end{align*}
Where $M = c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}}$ is our probabilistic upper bound.
Combine to get
\begin{align*}
\mathbb{P}\left( \left[\max_{(s,a)\in \mathcal{S}\times\mathcal{A}} ||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \right] \leq c\sqrt{\frac{|\mathcal{S}|\log(1/\delta)}{m}} \right) > 1-\delta |\mathcal{S}||\mathcal{A}|
\end{align*}
Substitute $\delta$ with $\delta/(|\mathcal{S}||\mathcal{A}|)$
\begin{align*}
\mathbb{P}\left( \left[\max_{(s,a)\in \mathcal{S}\times\mathcal{A}} ||P(\cdot|s,a)-\hat P(\cdot|s,a)||_1 \right] \leq c\sqrt{\frac{|\mathcal{S}|\log(|\mathcal{S}||\mathcal{A}|/\delta)}{m}} \right) > 1-\delta
\end{align*}
This is consistent with the bound we want to prove as long as this is a tighter bound. That is, we want to show
Recall that the total number of samples is given by
$$|\mathcal{S}||\mathcal{A}|N \geq \frac{\gamma}{(1-\gamma)^4}\frac{|\mathcal{S}|^2|\mathcal{A}|\log(c|\mathcal{S}||\mathcal{A}|/\delta)}{\epsilon^2}$$
meaning that the number of samples for each state-action pair is
$$N \geq \frac{\gamma}{(1-\gamma)^4}\frac{|\mathcal{S}|\log(c|\mathcal{S}||\mathcal{A}|/\delta)}{\epsilon^2}$$
We can replace $m$ by the right-hand size of the above expression
\begin{align*}
c\sqrt{\frac{|\mathcal{S}|\log(|\mathcal{S}||\mathcal{A}|/\delta)}{m}}
&\leq c\sqrt{\frac{|\mathcal{S}|\log(|\mathcal{S}||\mathcal{A}|/\delta)}{\frac{\gamma}{(1-\gamma)^4}\frac{|\mathcal{S}|\log(c|\mathcal{S}||\mathcal{A}|/\delta)}{\epsilon^2}}} \\
&= \epsilon(1-\gamma)^2\sqrt{\frac{c^2\log(|\mathcal{S}||\mathcal{A}|/\delta)}{\gamma\log(c|\mathcal{S}||\mathcal{A}|/\delta)}} \\
\end{align*}
The original bound holds as long as $\sqrt{\frac{c^2\log(|\mathcal{S}||\mathcal{A}|/\delta)}{\gamma\log(c|\mathcal{S}||\mathcal{A}|/\delta)}} < 1$. This is not always true though. We can choose $c$ such that this expression is greater than $1$.
### Second claim
Other proof for the passage $||(P - \hat{P})V^{\pi}||_{\infty} \leq \left( \max_{s,a} || P(\cdot | s,a) - \hat{P}(\cdot | s, a) || \right)||V^\pi||_{\infty}$
The left hand side is the norm of a matrix-vector product.
We can thus upper-bound it by $||(P - \hat{P})|| \times ||V^\pi||_{\infty}$, where $||
\cdot||$ is the matrix norm induced by the $\infty$ norm.
The induced matrix norm is precisely the maximum L1 norm over the rows of $P - \hat{P}$. Reference [here](https://www.cis.upenn.edu/~cis515/cis515-11-sl4.pdf), p.25.