Deep Double Q-Learning (DDQN pour les intimes)

###### tags: `IAR`  # Deep Double Q-Learning (DDQN pour les intimes) --- # Summary 1. Introduction 2. Q-learning 3. Deep Q-learning 4. Double Q-learning 5. Deep Double Q-learning 7. Evaluation 8. Conclusion --- ### Introduction ---- > The popular Q-learning algorithm is known to overestimate action values under certain conditions. Van Hasselt et al. (2015) Note: It is known to sometimes learn un- realistically high action values because it includes a maxi- mization step over estimated action values, which tends to prefer overestimated to underestimated values. ---- > Thrun and Schwartz (1993) showed that if the action values contain random errors uniformly distributed in an interval $[−\epsilon, \epsilon]$ then each target is overestimated up to $\gamma\epsilon\frac{m-1}{m+1}$, where m is the number of actions. Van Hasselt et al. (2015) ---- ### Let's recap things a bit --- ### Q-learning ---- The true value of each action, is defined as the expected sum of future rewards when taking that action and following the optimal policy thereafter. $\boxed{Q_\pi(s, a) \equiv \mathbb{E}\left[R_1+\gamma R_2+\ldots \mid S_0=s, A_0=a, \pi\right]}$ with the optimal value $\boxed{Q_∗(s, a) = max_{\pi}\space Q_{\pi}(s, a)}$ Note: $\gamma$ is the discount factor ---- However most interesting problems are too large to learn all action values in all states separately. Instead, we can learn a parameterized value function $Q(s, a, \theta_t)$. $\boxed{\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_t+\alpha\left(Y_t^{\mathrm{Q}}-Q\left(S_t, A_t ; \boldsymbol{\theta}_t\right)\right) \nabla_{\boldsymbol{\theta}_t} Q\left(S_t, A_t ; \boldsymbol{\theta}_t\right)}$ where $\boxed{Y_t^{\mathrm{Q}} \equiv R_{t+1}+\gamma \max _a Q\left(S_{t+1}, a ; \boldsymbol{\theta}_t\right)}$ Note: $\alpha$ is a scalar step size --- ### Deep Q-learning  ![image alt](https://miro.medium.com/max/1276/1*Vd1kcpLoQDnM5vrKnvzxbw.png "title") ---- The target network, with parameters $\theta^-$, is the same as the online network except that its parameters are copied every $\tau$ steps from the online network, so that then $\theta^-_t = \theta_t$, and kept fixed on all other steps. $\boxed{Y_t^{\mathrm{DQN}} \equiv R_{t+1}+\gamma \max _a Q\left(S_{t+1}, a ; \boldsymbol{\theta}_t^{-}\right)}$ ---- With experience replay (Lin, 1992), observed transitions are stored for some time and sampled uniformly from this memory bank to update the network. --- ### Double Q-learning improvements ---- ### What are the issues still ? > The max operator in standard Q-learning and DQN, in (2) and (3), uses the same values both to select and to evaluate an action. van Hasselt et al. (2015) This also apply to noise or any kind of approximation ---- $Y_t^{\mathrm{Q}}=R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_t\right) ; \boldsymbol{\theta}_t\right)$ ---- $Y_t^{\mathrm{DoubleQ}}=R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_t\right) ; \boldsymbol{\theta}_t'\right)$ Note: In the new equation a second network is used to fairly evaluate the update ---- ![](https://i.imgur.com/87P9UyD.png) > Figure 1: Estimation of Q-learning bias for a single update over 100 repetitions for a growing number of action Note: The orange bars show the bias in a single Q- learning update when the action values are $Q(s, a) = V_*(s) + \epsilon_a$ and the errors $\{\epsilon _a\}^m _{a=1}$ are independent standard normal random variables. The second set of action values Q′, used for the blue bars, was generated identically and in- dependently. All bars are the average of 100 repetitions. --- ## Adaptation to Double DQN ---- > We propose to evaluate the greedy policy according to the online network, but using the target network to estimate its value. van Hasselt et al. (2015) ---- <center> <img src="https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/7_double-deep-q-learning.png" /> </center> > Decision tree of Deep Double Q learning ---- $Y_t^{\mathrm{DQN}} \equiv R_{t+1}+\gamma \max _a Q\left(S_{t+1}, a ; \boldsymbol{\theta}_t^{-}\right)$ $Y_t^{\text {DoubleDQN }} \equiv R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_t\right), \boldsymbol{\theta}_t^{-}\right)$ Note: In comparison to Double Q-learning (4), the weights of the second network θ′t are replaced with the weights of the target network θ−t for the evaluation of the current greedy policy. --- ## Evaluation ---- ![](https://i.imgur.com/TNHMaxa.png) > Figure 2: DQN vs DDQN experimentation results playing Atari Games Note: The straight horizontal orange (for DQN) and blue (for Double DQN) lines in the top row are computed by running the corresponding agents after learning concluded, and averaging the actual discounted return obtained from each visited state. These straight lines would match the learning curves at the right side of the plots if there is no bias. ---- ![](https://i.imgur.com/f64zuTV.png) > Figure 3: Normalized score of reinforment agents trained with different algorithms playing Atari games --- ## Conclusion - Q learning suffers form overestimation - DQN can solve this problem in some cases - Double DQN does solve in most cases - Double DQN perfomes better - Double DQN can provide more stable reliable learning results ---- # Sources * https://arxiv.org/pdf/1509.06461.pdf * https://towardsdatascience.com/deep-double-q-learning-7fca410b193a * https://github.com/amirmirzaei79/CartPole-DQN-And-DDQN ---- # Thanks for your attention, questions ? <p style="font-size: 10px;">please no</p>