Notes on "[CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING](https://arxiv.org/pdf/1509.02971.pdf)"

# Notes on "[CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING](https://arxiv.org/pdf/1509.02971.pdf)" ###### tags: `incomplete` `research paper notes` `Policy gradient` ` #### Author [Raj Ghugare](https://github.com/RajGhugare19) ### Introduction: ## The main reasons behind the success of Deep Q Networks was its ability to learn directly from pixels. Before the emergence of DQN it was considered difficult to approx high dimension non linear policies. The core ideas behind DQN were: 1) Off-policy learning: The model learns from experiences, randomly sampled from a replay memory. 2) It uses a target network for temporal difference like updates. But DQNs fail to scale up with continuous actions.The authors try to apply these ideas to come up with an off-policy actor critic algorithm which can learn complex control actions in continous space even directly form raw pixel observations. ### Setting: ## They assume the normal Reinforcement learing setting where the agent interacts with the environment in discrete time steps and at every time-step it makes an observation $x_{t}$, takes an action $a_{t}$ and obtains a reward $r_{t}$ The goal of the agent is to approximate a policy which maximises the expected discounted reward over all time-steps. $G_{t} = \Sigma_{i=t}^{T} \gamma^{i-t}r_{i}$ Action value functions are often used in off-policy algorithms.They are the expected return from a time-step given that we take a particular action at that time-step and follow a particular policy thereafter $Q^{\pi}(s_{t},a_{t}) = E_{\tau \sim \pi}[G_{t}|s_{t},a_{t}]$ If we consider our policy to be deterministic, i.e. if it is a direct function from states to action then we dont need to take an expectation over all actions because the actions, given a particular state are not a random variable anymore. Incorporating a markovian notion of the world we can write Q function, for a deterministic policy in the following manner: $Q^{\pi}(s_{t},a_{t}) = E_{r_{t},s_{t+1} \sim E}[R_{s_t,a_t} + \gamma Q^{\pi}(s_{t+1},a_{t+1})]$ Where $a_{t+1} = \pi(s_{t+1})$ ### DDPG - Idea and key implementation details: ## #### The major components in DDPG: * $\mu(s|\theta^{\mu}) \rightarrow$ is an actor, which maps state to deterministic actions. * $Q(s,a) \rightarrow$ is a critic learned using Q learning like update. #### Implementation details: * learning rate for the actor networks = 0.0001 * learning rate for the critic networks = 0.001 * For the critic network they used an L2 weight decay of = 0.01 * For updating the target networks they used $\tau$ = 0.001 * The activation for all layers except the final layer was Relu. In the final layer of the actor network they used tanh to bound the actions. * The final layers weights and biases were initialised using a uniform distibution between $[-3 *10^{-3},3*10^{3}]$ when using low dimensional state representation and $[-3 *10^{-4},3*10^{4}]$ when learning directly from pixels. * The other layers were initialized from uniform distributions $[-\frac{1}{\sqrt f},\frac{1}{\sqrt f}]$ where $f$ is the fan-in of the layer(number of nodes in that layer). * Mini batch size of 64 for the low dimensional case and 16 for learning from pixels * For exploration noise they use Ornstein-Uhlenbeck process with $\theta= 0.15$ and $\sigma= 0.2$. * For the critic networks, while learning from pixels, the actions were not included untill the fully conected layers and while learning from low dimensionals inputs, the actions were not included for the first two hidden layers. * The replay buffer size used was $10^{-6}$.