--- title : "Rainbow: Combining Improvements in Deep Reinforcement Learning - Notes" tags : "IvLabs, RL" --- # Rainbow DRL Link to the [Research Paper](https://arxiv.org/pdf/1710.02298.pdf) {%pdf https://arxiv.org/pdf/1710.02298.pdf%} ## Introduction Deep Q-Network algorithms - Q-Learning - CNN - Experiency Replay Double DQN - Addresses - Over Estmation Bias of Q-learning - decoupling selection and evaluation of bootstrap action Prioritized Experience Replay - Improves data efficiency - Replaying more of transitions from which there is more to learn Dueling Network Architecture - Helps to generalize across actions - separatly representing state value and action advantages A3C - Multi-step boot-strap targets - shifts the bias-variance trade-off and helps propagate newly observed rewards faster to earlier visited states Distributional Q-Learning - Learns a categorical distribution of discounted returns instead of estimating the mean Noisy DQN - Uses stochastic network layers for exploration ## Background ### DQN Experience Replay RMSprop - Optimization $(R_{t+1} + \gamma_{t+1} \underset{a'}{\text{ max }}q_{\theta}(S_{t+1}, a') - q_{\theta}(S_t, A_t))^2$ ### Extesnsion to DQN #### Double Q-learning Overestimation due to maximization step Decoupling ![](https://i.imgur.com/tsNCKKq.png) B->D - Reward = N(m = -0.5, v = 1) ![](https://i.imgur.com/LlhQWXx.png) Since some actions may produce a positive reward the agent would pick moving to left but it won't be wise as the expected return in doing so is -0.5 This is addressed by taking the expected return from reward ![](https://i.imgur.com/Arueomz.png) :::danger May not work for Continous Action Space ::: #### Prioritized Replay - Probability of picking a sample is proportional to its last encountered TD error - New Transitions are fed into the buffer with maximum priority - Stochastic transitions might be also favoured even whene there is little left to learn about them #### Dueling Networks - Value stream and Advantage stream - Sharing a convolution encoder, merged by a special aggregator $q_{\theta} (s,a) = v_\eta(f_\xi(s)) + a_{\Psi}(f_\xi(s),a) - \dfrac{\sum_{a'}a_\Psi(f_\xi(s),a')}{N_{\text{actions}}}$ $\xi$ - Parameters of the shared encoder $f_\xi$ $\eta$ -Parameters of the value stream $v_\eta$ $\Psi$ - Parameters of the advantage stream a_\Psi $\theta$ - Theri concatenation #### Multi-step learning Truncated n-step return from a given state S~t~ $\displaystyle R_t^{(n)} \equiv \underset{k=0}{\overset{n-1}{\sum}} \gamma_t^{(k)} R_{t+k+1}$ Minimuzing varitation of DQN is defined by minimizing the alternative loss $(R_t^{(n)} + \gamma_t^{(n)}\underset{a'}{\text{max }}q_\theta(S_{t+n},a')-q_{\theta}(S_t, A_t))^2$ May lead to faster learning #### Distributional RL Approximate distribution returns instead of expected return #### Noisy Nets ε-Greedy policy - has a limitation of exploring in some cases where many actions must be executed to collect the first reward A noisy linear layer that combines a deterministic and noisy stream $y = (b + Wx) + (b_{\text{noisy}} \odot \epsilon^b + (W_\text{noisy}\odot\epsilon^w)x)$ $\epsilon^b, \epsilon^w$ - Random Variables $\odot$ - element wise product Over time network can learn to ignore the noisy stream - at different rates in different parts of state space - allowing state-consitional exploration ## Key Take Aways - Prioritized Replay - Noisy Nets