---
title : "Rainbow: Combining Improvements in Deep Reinforcement Learning - Notes"
tags : "IvLabs, RL"
---
# Rainbow DRL
Link to the [Research Paper](https://arxiv.org/pdf/1710.02298.pdf)
{%pdf https://arxiv.org/pdf/1710.02298.pdf%}
## Introduction
Deep Q-Network algorithms
- Q-Learning
- CNN
- Experiency Replay
Double DQN
- Addresses - Over Estmation Bias of Q-learning - decoupling selection and evaluation of bootstrap action
Prioritized Experience Replay
- Improves data efficiency - Replaying more of transitions from which there is more to learn
Dueling Network Architecture
- Helps to generalize across actions - separatly representing state value and action advantages
A3C
- Multi-step boot-strap targets - shifts the bias-variance trade-off and helps propagate newly observed rewards faster to earlier visited states
Distributional Q-Learning
- Learns a categorical distribution of discounted returns instead of estimating the mean
Noisy DQN
- Uses stochastic network layers for exploration
## Background
### DQN
Experience Replay
RMSprop - Optimization
$(R_{t+1} + \gamma_{t+1} \underset{a'}{\text{ max }}q_{\theta}(S_{t+1}, a') - q_{\theta}(S_t, A_t))^2$
### Extesnsion to DQN
#### Double Q-learning
Overestimation due to maximization step
Decoupling
![](https://i.imgur.com/tsNCKKq.png)
B->D - Reward = N(m = -0.5, v = 1)
![](https://i.imgur.com/LlhQWXx.png)
Since some actions may produce a positive reward the agent would pick moving to left but it won't be wise as the expected return in doing so is -0.5
This is addressed by taking the expected return from reward
![](https://i.imgur.com/Arueomz.png)
:::danger
May not work for Continous Action Space
:::
#### Prioritized Replay
- Probability of picking a sample is proportional to its last encountered TD error
- New Transitions are fed into the buffer with maximum priority
- Stochastic transitions might be also favoured even whene there is little left to learn about them
#### Dueling Networks
- Value stream and Advantage stream
- Sharing a convolution encoder, merged by a special aggregator
$q_{\theta} (s,a) = v_\eta(f_\xi(s)) + a_{\Psi}(f_\xi(s),a) - \dfrac{\sum_{a'}a_\Psi(f_\xi(s),a')}{N_{\text{actions}}}$
$\xi$ - Parameters of the shared encoder $f_\xi$
$\eta$ -Parameters of the value stream $v_\eta$
$\Psi$ - Parameters of the advantage stream a_\Psi
$\theta$ - Theri concatenation
#### Multi-step learning
Truncated n-step return from a given state S~t~
$\displaystyle R_t^{(n)} \equiv \underset{k=0}{\overset{n-1}{\sum}} \gamma_t^{(k)} R_{t+k+1}$
Minimuzing varitation of DQN is defined by minimizing the alternative loss
$(R_t^{(n)} + \gamma_t^{(n)}\underset{a'}{\text{max }}q_\theta(S_{t+n},a')-q_{\theta}(S_t, A_t))^2$
May lead to faster learning
#### Distributional RL
Approximate distribution returns instead of expected return
#### Noisy Nets
ε-Greedy policy - has a limitation of exploring in some cases where many actions must be executed to collect the first reward
A noisy linear layer that combines a deterministic and noisy stream
$y = (b + Wx) + (b_{\text{noisy}} \odot \epsilon^b + (W_\text{noisy}\odot\epsilon^w)x)$
$\epsilon^b, \epsilon^w$ - Random Variables
$\odot$ - element wise product
Over time network can learn to ignore the noisy stream - at different rates in different parts of state space - allowing state-consitional exploration
## Key Take Aways
- Prioritized Replay
- Noisy Nets