Double DQN - HackMD

--- title: Double DQN tags: DQN --- ## 문제 논문에서는 Q-learning의 max 연산에 의해 overestimation이 나타난다고 한다. ## 원인 가설 이는 이전부터 존재해왔으며 원인으로는 environmental noise, function approximation, non-stationarity 등 다양한 의견이 있다. 논문에서는 이러한 원인이 어떻든 estimation error 자체가 upward bias를 일으킨다는 것을 lower bound가 양수임을 보이면서 증명한다. ### Theorem 1. - 모든 optimal action value가 $V_*(s)$로 동일한 상태 $s$ $$Q_*(s,a)=V_*(s)$$ - 임의의 unbiased한 value estimates. $Q_t$ $$\sum_a(Q_t(s,a)-V_*(s)) = 0$$ - 하지만 값이 모두 $V_*(s)$와 일치 하지 않아서 제곱했을 때 임의의 양수 $C (C>0)$가 존재. ($m$은 행동의 개수. $m\ge2$) $$ \frac{1}{m}\sum_a(Q_t(s,a)-V_*(s))^2=C $$ 위 조건들 하에, 다음과 같은 lower-bound가 존재한다. $$ \max_a Q_t(s,a) \ge V_*(s) + \sqrt{\frac{C}{m-1}}$$ ## 해결 방안 Double Q-learning과 Deep Learning의 결합. ### DQN $$ \theta_{t+1} = \theta_t + \alpha (R_{t+1} + \gamma \color{blue}{\max_a Q(S_{t+1}, a; \theta_t)} - Q(S_t, A_t; \theta_t) )\nabla_{\theta_t}Q(S_t,A_t;\theta_t)$$ ### Double DQN $$ \theta_{t+1} = \theta_t + \alpha (R_{t+1} + \gamma \color{red}{Q(S_{t+1}, \arg\max_a Q(S_{t+1}, a; \theta_t);\theta'_t)} - Q(S_t, A_t; \theta_t) )\nabla_{\theta_t}Q(S_t,A_t;\theta_t)$$ 초기에 zero-based를 출력하는 네트워크가 Return의 값에 맞춰가기 위해 조금씩 scale이 커질 것이다. target network는 online network에 비해 보수적으로 업데이트 되므로 이보다 낮을 확률이 높기에 overestimation을 줄이는 효과를 기대할 수 있다. ## 실험 ### 1 - Q-learning과 Double Q-learning 차이 ![](https://i.imgur.com/HC4TB9Q.png) - x-axis 연속적인 상태 $s$ - y-axis 행동의 $a$의 estimated value - 모든 행동이 동일한 $Q_*$를 가짐($Q_*(s,a)=V_*(s)$) - 행동들은 서로 다른 sample들을 사용해서 approximation - function approximation(polynomial) - top : 6-degree - middle : 6-degree - bottom : 9-degree - True function - top : $\sin(s)$ - middle : $\exp(-s^2)$ - bottom : $\exp(-s^2)$ 결과적으로 우측에서 Q-learning은 function approximation이 flexible한지 여부와 상관없이 upward bias함을 알 수 있고, Double Q-learning은 조금 더 zero에 근접하다고 볼 수 있다. ### 2 - DQN과 Double DQN의 차이 (Atari 일부) ![](https://i.imgur.com/wZQZxAk.png) - top : DQN과 Double DQN의 value estimate - middle : 두 게임의 value estimate - bottom : 두 게임의 performance - 직선 : 학습 후 interaction 했을 때 actual cumulative rewards 학습 곡선이 최종적으로 직선과 일치한다면 bias가 없다고 볼 수 있다. DQN은 overestimate 한다. Double DQN은 그나마 덜 하지만 estimate가 정확하다고 볼 수는 없다. 하지만 하단을 보면 overestimation이 성능을 떨어뜨린다는 것을 알 수 있다. 결론적으로 Double DQN은 정확한 estimate를 하지는 않지만, 안정적으로 만들어줌으로써 더 나은 정책을 낼 수 있다. ### 3 - Double DQN Atari 성능 비교 $$ \text{score}_{\text{normalized}} = \frac{\text{score}_{\text{agent}}-\text{score}_{\text{random}}}{\text{score}_{\text{human}}-\text{score}_{\text{random}}}$$ #### No-op ![](https://i.imgur.com/zW0W7nN.png) #### Human Starts ![](https://i.imgur.com/GCvNI6u.png) Human Starts에서 더 좋은 성능을 보인 것으로 보아 좀 더 general하게 문제를 잘푼 것으로 볼 수 있다. ## 의문점과 생각 정리 1. Theorem 1.에서 왜 $\sum_a(Q_t(s,a) -V_*(s)) = 0$이라는 조건을 취했을까? $\sum_a(Q_t(s,a) -V_*(s)) \ge 0$로 했다면 더 general한 lower-bound를 구할 수 있지 않았을까? 2. 왜 lower-bound가 $\sqrt{\frac{C}{m-1}}$인 것일까? 3. lower-bound가 tight하다는 것을 얘기하는 것일까? **A)** 이 세 가지 의문점들과 모순을 통한 증명을 가지고 생각해보면, lower-bound를 정확히 구할 수 없어서 average correct한 value estimate에도 upward bias가 있다는 것을 보이고 싶었던 것 뿐인 것 같다. 그래서 tight하다는 것을 통해 나름 타당성을 보이고 싶었던 게 아닐까. 4. Q-learning과 Double-Q-learning이 보여준 실험처럼 더 복잡한 딥러닝에서도 overestimation이 나타날 것이라는 보장이 있는가? 즉, general 한가? **A)** 이미 Theorem 1.에서 estimation error가 upward bias를 일으킬 수 있다는 것을 보였다. 그리고 이 실험은 Double q-learning과 달리 statistical한 argument를 사용하지 않았고, flexible function approximation을 사용하여 충분히 general하게 볼 수 있다고 하는 것 같다. 5. 타겟 네트워크의 보수적인 업데이트로 Double DQN이 효과가 있다면, 둘 다 처음부터 높은 출력을 내도록 파라미터를 초기했을 때엔 어떻게 될까? (실험 해봐야함) ---