###### tags: `IAR`
# IAR - Homework 3 - Double Deep Q-Learning
Double Q-learning (DDQN) is an algorithms to reduce overestimations by decomposing the max operation in the target into action selection and action evaluation.
The greedy policy is evaluated by a target network ($\theta_t^{-}$), whereas the online network ($\theta_t$) is used to estimate its value. The term double DQN refers to "Double Q-learning" and "Deep Q-learning" which are both employed to implement DDQN.
The target in DDQN is defined as following:
$Y_t^{\text {DoubleDQN }} \equiv R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_t\right), \boldsymbol{\theta}_t^{-}\right)$
In our project, we used the method of DDQN to implement the game "flappy bird" using an agent to play the game. In flappy bird the player controls a bird, attempting to fly between green pipes without hitting them. The more green pipes he overcomes, the more points he gets.
## The pseudo-codes of the algorithms

Most of the code is commented in order to facilitate its understanding.
## Explanations of choices
For the algorithm and the network we chose some specific hyperparamters. During the progress of implementation, different combinations those parameters have been used as experimental setups.
| hyperparameter | value |
| --------------- | ----------- |
| total episodes | 1000000 |
| memory size | 100000 |
| update step | 100 |
| initial epsilon | 0.9 |
| final epsilon | 0.01 |
| batch size | 32 |
| learning rate | $1*10^{-3}$ |
We wanted to focus more on the implementation of the algorithm than on a complicated structure of the neural networks we were going to employ. That is why the neural network, we decided to use for the training of the agent, consists of three linear layers (or fully connected layers). Here, every input of the input vector influences every output of the output vector. It is a common and simple network architecture.
One takes the inputs *in_features* (size of each input sample), *out_features* (size of each output sample) and *bias* (if set to "False", layer will not learn an additive bias).
In our case, the chosen input is the number of the agents actions (*action_dim*) and the output of the last layer is the number of states (*state_dim*). For forward passing (traversing through all neurons from first to last layer), every output of a linear layer is seperated by a leaky ReLu.
## Learning curves and gaphs
On the following curves, we can see the evolution of epsilon and the learning rate.


We can observe that after 700 000 episodes, the mean square error starts to grow. Then the training may not be as effetivere. We can feel this on the average episode reward that doesn't increase after this treshold.

