Introduction to Deep Reinforcement Learning

# Introduction to Deep Reinforcement Learning ###### tags: `DL` This is a note of Machine Learning lectured by Prof. Hung-Yee Lee, NTUEE. Readers can find all the slides [here](http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML20.html). ## Outline 1. Basic setting of DRL 2. Policy-based Algorithm: PPO 3. Value-based Algorithm: Q-learning (DQN) 4. Actor-Critic Algorithm (A2C) ## Basic setting of RL Following lecture notes of Prof. Lee, we do not focus the approach of Markov Decision Process (MDP), which is a classic approach of introduction like the renowed course [UCL Course on RL](https://www.davidsilver.uk/teaching/). ![1](https://i.imgur.com/VptdWdb.png =80%x) An RL problem basically consist of the following elements: 1. State determined by environment at time $t:s_t$ 2. Action given by an agent at time $t: a_t$ asscociated with a "policy" $\pi$ 3. rewards given by environment at time $t: r_t$ Such sequence $\tau=\{s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_T, a_T ,r_T\}$ is defined to be an "episode". For example, an episode can be a whole game played by machine. The goal is to maximize the expected cumulative reward per episode through a policy which can be a neural network in DRL. ![2](https://i.imgur.com/64D7wpL.png =80%x) Here is something we should know for the discussion later: 1. We only consider "model-free" approach here, which means we do not model (or assume any information of) the behavior of environment such as $p(s_{t+1}|s_{t},a_{t})$ 2. The state value function $V^\pi(s_t)$ is defined to be the expected cumulative reward with respect to state $s_t$ and a specific policy $\pi$ 3. The state-action value function $Q^\pi(s_t,a_t)$ is similarly defined as above but we need to note that under $\pi$, we define this "Q function" over all possible action $a_t$ ![3](https://i.imgur.com/lz4zPNp.png =80%x) Next we talk about some classical algortihm with respective to different type learning: Policy-based and Value-based algorithm and finally the mixture of these two types. In order to introduce the concept of how these algortihm work clearly, we will not focus on the "tips" such as the double-DQN, dueling-DQN and so on here. Also, this note will not include the detail of implementation. ## Policy-based Algorithm: PPO Before we start the part of PPO, Proximal Policy Optimization, we first introduce the concept of policy gradient. ![4](https://i.imgur.com/8q7iUu9.png =80%x) Same as the basic elements in ML or DL, we want to find a function which is so good to handle both training data and testing data. In DRL, we model the policy $\pi$ as a neural network (parametrized as $\pi_\theta$) which may be CNN in the case of video game. The output of the NN may be the probability of conducting each action in the action space (either discrete or continuous). ![](https://i.imgur.com/NXCVHwM.png =80%x) The goodness of the function or the actor is evaluated by the total reward $R_\theta=\sum_{t=1}^{T} r_t$ which is usually a random variable. Thus we replace it by the expected reward $\bar{R_\theta}$. The idea of model-free apporach is mainly built on the sampling, we have no information of environment and obviously how the reward it may give, this is obsereved through sampling. For example, the expected value is usually replaced by sample average. ![](https://i.imgur.com/S7H3ssA.png =80%x) In consideration of training an actor as NN, we rely on the optimization technique which is mainly gradient descent. We can formulate the optimization problem and the gradient ascent as the following figure. The idea here is replace the calculation by Monte Carlo estimation. ![](https://i.imgur.com/gZFTKLH.png =80%x) ![](https://i.imgur.com/d8AIZSP.png =80%x) The term $\nabla logP(\tau^n|\theta)$ can be calculated by the rule of conditional probabilty as the following and we finally obtain the form of policy gradient. ![](https://i.imgur.com/4SdcESU.png =80%x) ![](https://i.imgur.com/f6N1d5e.png =80%x) The intution of the update is quite clear and stated as the slides, the point here is we update the parameter based on the total cumulative reward instead of the immediate reward since the action $a_t^n$ should be evaluated "longer" other than "short-sight". Below we give some remarks and the implementation problem: First, the training process is inefficient, because each time we update, we only use the data once. We should not use the data obtained by $\theta_{old}$ to train $\theta_{new}$, which makes the training time of policy gradient too long and thus the appearance of PPO. ![](https://i.imgur.com/cZwMn17.png =80%x) Second, the implementation of policy gradient can be seen as training a NN for classification problem assume the action to be the groundtruth and the loss is different from cross entropy. It can be seen as a weighted version of Maximum likelihood. To be more specific, we record the pair $(r_t,logP_\theta(a_t|s_t))$ of each time so as to calculate the term in the objective function (negative loss). After N episodes, we calculate the gradient of loss through backpropogation. Third, there can be some adjustment about the update, which can be seen as we adjust the loss function in consideration of better critic. Instead of evaluating the actor by the overall rewards $R(\tau^n)-b$, here we consider evaluating each move depending on how it developes afterwards $A^\theta(s_t,a_t)$. ![](https://i.imgur.com/Ml7C7K9.png =80%x) ![](https://i.imgur.com/3jgrtR8.png =80%x) According to the first problem, PPO is developed to learn "off-policy", which means we sample (play) by another actor but take those as experience to update the main actor. ![](https://i.imgur.com/g32AeRn.png =80%x) Recall that we evaluate the function through $\bar{R_\theta}$, which is the expected cumulative reward played by $\pi_\theta$, the calculation depends on the sequence $\tau$ sampled by $p_\theta(\tau)$. The idea here is to generate the sequence $\tau$ by another NN $\pi_{\theta'}$. Statistically speaking, we want to estimate the expectation through another distribution and it is also known as "Importance Sampling". The trick is shown below and note that to avoid large variance, we hope that the proposal distribution is close to the target distribution. Note that the "distribution" we talk here is referred to the output layer of the NN. It can be seen as a multinomial distribution associated with $\theta$ if the state space is discrete. ![](https://i.imgur.com/ywXf3cf.png =80%x) *The expectation should take on $(s_t,a_t)$~ $p_{\theta'}(s_t,a_t)$ On-policy gradient:$$\nabla \bar{R_\theta}\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T A^{\theta}(s_t,a_t) \nabla logP_\theta(a_t^n|s_t^n) $$ Off-policy gradient:$$\nabla \bar{R_\theta}\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \frac{P_{\theta}(a_{t}^n|s_{t}^n)}{P_{\theta'}(a_{t}^n|s_{t}^n)} A^{\theta'}(s_t,a_t) \nabla logP_\theta(a_t^n|s_t^n) $$ In summary, the slide below shows the PPO algorithm which also consider the constraint that the distribution of the two actor should be similar. $J^{\theta'}(\theta)$ stands for the Off-policy objective function. ![](https://i.imgur.com/ppPcAUN.png =80%x) Also, the KL divergence term is complex and introduce more hyperparameter thus PPO2 algorithm consider another approach to solve the problem of constraints. ![](https://i.imgur.com/QbW9btM.png =80%x) ## Value-based Algorithm: Q-learning ## Actor-Critic Algorithm