NYU ML final project

# NYU ML final project ## Attribution * main paper assigned: ***Asynchronous Methods for Deep Reinforcement Learnin*** https://arxiv.org/abs/1602.01783 * other paper: ***Human-level control through deep reinforcement learning*** doi:10.1038/nature14236 * My main code skeleton [ref](https://github.com/wweichn/Actor-Critic-cart-pole/blob/master/main.py) * [notebook from professor's link](https://github.com/yfletberliac/rlss-2019/blob/master/labs/DRL.01.REINFORCE%2BA2C.ipynb) * A3C code [ref1](https://github.com/tensorflow/models/blob/master/research/a3c_blogpost/a3c_cartpole.py) * A3C code [ref2](https://blog.tensorflow.org/2018/07/deep-reinforcement-learning-keras-eager-execution.html) ## Introduction In the main paper, the authors propose a new framework to improve the replay memory one. However, why do we need Replay Memory? (asked by me) ### What is Replay Memory/Experience Replay and why do you need it? > A key reason for using replay memory is to break the correlation between consecutive samples. [ref](https://deeplizard.com/learn/video/Bcuj2fTH4_4) (my words) In my understand, for learning environment like Atari games, states and controls are strongly correlated in time. Take Galaxian for example, position of my starfighter is adjacent and and it may move in a row. If we only use recent states and update our policy based on it, the algorithm will see the almost-same state-action pair many times and update the parameters in the same direction multiple time, which leads to non-convergence. So we store previous experiences (i.e., states, controls, rewards/costs, and the resulted states) in a pool and sample from it to obtain non-consecutive data point. (my words) Also, we know that NN is non-linear. And small change in its weights may lead to dramatically different output. Strongly correlated data will lead to same direction of updates in the NN and result in unstability. #### How Replay Memory works? (my words) Let's take Q-learning for discussion. We all know that Q-learning is a kind of approach to learn a policy when we don't have the system dynamics. (I learned the definition in a robotic course before.) Instead using the instant reward to update the Q-function's parameters (no matter it is a network/deep network/linear combination of basis functions), we use the average of the sample from the memory pool to do so. > Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action-value (also known as Q) function [doi:10.1038/nature14236] (my words) I learned before that for non-linear systems, they are very sensitive to the initial condition. They trajectory in phase-space will differ dramatically even if the two points were very close. In our case here, since several activation functions are composed (even though they are ReLU), DNN are highly non-linear. ### What does this paper want do? (taking information from the paper and using my words) Instead using a memory pool, the authors use multiple behavior policy learners (explorers) to produce training data, (each agent may even have different policy in needed). This approach avoid the target policy learner (valuer) to look on correlated data. ### What are the achievements of the paper? **Explain why the topic is important and how it fits in to ML in general. If there is something you don't really understand, it's better to leave it out than to try and write about it without understanding it.** I will focus more on A3C. * My understanding through the paper is that previously the only way to combine RL and DNN is replay memory and it only works in off-policy fashion. Here the authors say since they are using multiple agent instead of memory pool, they can now use DNN in on-policy method for RL in consecutive environment. (I look up some definitions) * (my words learned for the paper) It needs less computational power than memory replay framework. * (my words learned for the paper) We can now use on-policy method with DNN * A3C is the most powerful framework so far. [ref](https://github.com/Achronus/Machine-Learning-101/wiki/Asynchronous-Advantage-Actor-Critic-(A3C)#asynchronous) * (my words for the paper) We can use multi-core CPU to train multiple actor-leaner parallelly to speed up. In the paper they showed that i the speedup is proportional to numbe of cores. (In some cases the speedupa are better than linear.) ### Terminology * on-policy/off-policy * > On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data. [ref][R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. IEEE Transactions on Neural Networks, 9(5):1054–1054, 2018.] * (my words) If we are using delayed updating, replay memory, or explorer-exploiter, then we are using off-policy * Q-learning * (my words)A model-free RL method. We use a Q function to estimate the promisingness of states-action pairs to choose actions. The Q function will be update on-the-fly. The Q function can be either NN, DNN, or linear combination of a set of basis function like sin/cos. ## Replication In this part, you will replicate some or all of the implementation results that were given. At the beginning of this section, ### A3C on cartpole * Explain what you are going to replicate. * I am going to implement A3C to let cartpole stand up. * If it is a figure from a paper or a notebook, show the figure and explain what it means, and why it is important or useful. * I will show mp4 movie of the performance of the final model and a plot show the total reward w.r.t. epochs. * Explain what work was involved in this replication. * I learned from several tutorial to make my code work. None of the code I referenced can be run directly. There are dependency and environmental issue. I also write my own code to collect the data I want to show and make plot. I read every line and make comments on them. * Did you just run a notebook that was given to you? * No. The notebook given is doing A2C. * Did you have to port the setup process to Colab? * Yes, I combined several repos into on Colab notebook. * Was there no notebook, and you had to go through examples in a Github repo? * I didn't find a directly runnable notebook. ### some details of implementation * actor network * softmax for output layer of actor network * ReLu for the 2 fully connected hidden layer * Adam as optimizer * the advantage (the difference between the output of actor and critic) as the the metric * critic network * direct function for output layer of critic network * ReLu for the 2 fully connected hidden layer * Adam as optimizer * squared differenc between true value and estimated the the metric * sampling to decide an anction * cartpole is a continuous problem, we discretize it by one-hot * some parameters I use * Value Network * input: 4 * hidden1: 40 * hidden1: 40 * output: 1 * Actor Network * input: 4 * hidden1: 40 * hidden1: 40 * output: 2 * For the code part, I ran for some hours to complete. ### some technical issue I solved * The skeleton I mainly reference is in python2. I rewrite to python3. It was design for command line. I rewrite the structure to fit Colab. * to fix `xdpyinfo was not found, X start can not be checked! Please install xdpyinfo!` [ref](https://stackoverflow.com/questions/57019490/missing-package-to-enable-rendering-openai-gym-in-colab) * to store mp4 by `gym` we need this. [ref](https://stackoverflow.com/questions/52636899/python-openai-gym-monitor-creates-json-files-in-the-recording-directory) ``` env = gym.make('NAME') env = gym.wrappers.Monitor(env, 'PATH', force=True,video_callable=lambda episode: True) ``` * `gym.render()` issue [ref](https://stackoverflow.com/questions/40195740/how-to-run-openai-gym-render-over-a-server) * Many code on the net in for TensorFlow 1. To use the the function of it in nowadays Colab, we need `%tensorflow_version 1.x` to assign the version. * remember disable the the gym's render(). They are slow. > env.render() : This command will display a popup window. Since it is written within a loop, an updated popup window will be rendered for every new action taken in each step. [ref](https://medium.com/@ashish_fagna/understanding-openai-gym-25c79c06eccb) > ## Extension For policy networks, the gradient descent is by $$\theta_\pi \leftarrow \theta_\pi - \gamma_\pi \alpha^t \delta_t \nabla \ln \pi(u_t | x_t, \theta_\pi)$$ where $$\delta_t = G_t - V(x_t, \theta_V)$$ is the advantage. And the gradient ascent for the value network parameters is $$\theta_V \leftarrow \theta_V + \gamma_V \alpha^t \delta_t \nabla V(x_t, \theta_V)$$ I tried to derive the $$\nabla V(x, \theta_V)$$ and $$\nabla \ln \pi(u_t | x_t, \theta_\pi)$$. For simplicity I assume a basis function approach. $$V(x, \theta_V) = \theta_V^T B(x)$$ and $$h(x,u,\theta_\pi) = \theta_\pi^T \Psi(x,u)$$ $$\pi(u|x,\theta) = \frac{e^{h(x,u,\theta_\pi)}}{\sum_a e^{h(x,a,\theta)}}$$ Please see the attached pdf.