# [ML Lecture 23-1: Deep Reinforcement Learning](https://www.youtube.com/watch?v=W8XF3ME8G2I&feature=youtu.be) ## RE learning Scratching the surface Kreeger 先生在NATURE上面發了一篇REL 來玩Atari的小遊戲,都可以痛電人類,16年春天AlphaGO痛電人類,後來David Silver say AI = RL + DL #### Scenario of Reinforcement Learning state是環境的狀態,machine 所看到的東西。state 是 observation Action 會 Change the environment ![](https://i.imgur.com/xwtTrTO.png) ![](https://i.imgur.com/jY1P4cE.png) 學習做出可以讓reward maximize 的action ![](https://i.imgur.com/p1tYAeR.png) 困難的點是,reward常常是 sparse的 在打的時候,得到的reward都是0 要打完整場才能得到reward --- ### Learning on GO * Supervised:Learning from teacher * Reinforcement Learning: Learning from experience * First move -> ... many moves ... -> win * (two agents play with each other) * Alpha Go is supervised learning + reinforcement learning --- ### Learning on chat-bot ![](https://i.imgur.com/SHCU3xN.png) ![](https://i.imgur.com/0pI8yrp.png) 在圍棋裡面的好壞是很好區別的。贏就好,輸就不好 但說話不好定義好壞 --- ![](https://i.imgur.com/UBIE6fo.png) More applications * Flying Helicopter * Driving * Google Cuts its Giant Electricity Bill With DeepMind-Powered AI * Text generation --- ### Example: Playing Video Game * Widely studies: * Gym: https://gym.openai.com/ * Universe: https://openai.com/blog/universe/ * :robot_face: Machine learns to play video games as human players * What machine observes is pixels * Machine learns to take proper action itself 😁 ![](https://i.imgur.com/LWakpMW.png) * 學著在每個episode最大化reward --- ## Difficulties of Reinforcement Learning * Reward delay * In space invader, only "fire" obtains reward * Althought the moving before "fire" is important * In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward * Agent's actions affect the subsequent data it receives * E.g. Exploration --- ## RL 分成兩大類 1. Value-based 2. Policy-based ![](https://i.imgur.com/JEdcGG6.png) model-based: 預測未來發生的事 * To learn deep reinforcement learning ...... * Textbook: Reinforcement Learning: An Introduction * https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html * Lectures of David Silver * http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html (10 lectures, 1:30 each) * http://videolectures.net/rldm2015_silver_reinforcement_learning/ (Deep Reinforcement Learning) * Lectures of John Schulman * https://youtu.be/aUrX-rP_ss4 ## Policy-based Approach * Learning an Actor Machine Learning 約等於 Looking for a Function ![](https://i.imgur.com/jM38o7E.png) * Three Steps for Deep Learning: 1. Neural Network as Actor 2. goodness of function 3. pick the best function * Neural network as Actor * Input of neural network: the observation of machine represented as vector or a matrix * Ouput neural network: each action corresponds to a neuron in output layer * NN 就像是 actor * 會讓output 是機率值,假設Actor 是 stochastic * what is the benefit of using network instead of lookup table? * Neural network, 給他一個沒看過都東西還是可以有結果,用neural network 是比較generalize * Goodness of Actor * Review:Supervised learning ![](https://i.imgur.com/BaYFOZl.png) * Goodness of Actor * Review:Supervised learning * Given an actor * 每次機器採取的actor 可能不一樣 * 每次的遊戲也都有可能不一樣 * 所以每次的$R_\theta$ * 所以我們想要的不是$R_\theta$,而是要maximize $R_\theta$的期望值 * An episode is considered as a trajectory $\tau$ * $\tau = \{s_1,a_1,\gamma_1,s_2,a_2,\gamma_2,...,s_T,a_T,\gamma_T\}$ * $R(\tau) = \sum_{n=1}^N \gamma_n$ * If you use an actor to play the game, each $\tau$ has a probability to be sampled * The probability depends on actor parameter $\theta: P(\tau|\theta)$ $\bar{R_\theta} = \sum_\tau R(\tau)P(\tau|\theta) \approx {1 \over N} \sum_{n=1}^N R(\tau^N)$ * 每個$\tau$都有一個機率,每個過程都有一個reward,把他們相乘,再sum 所有遊戲可能的$\tau$後,可以得到這個actor的期望的reward。 * Use $\pi_\tau$ to play the game N times obtain $\{\tau^1,\tau^2,...,\tau^N\}$ * 讓actor 玩遊戲玩N場,像N個training Data。 * Sampling $\tau$ from $P(\tau|\theta)$ N * Sum over all possible trajectory ![](https://i.imgur.com/4Ys9poM.png) ![](https://i.imgur.com/PETqyCb.png) * Pick the best function * Gradient Ascent * $\theta^\star = arg max_\theta \bar{R}_\theta$ * $\bar R_\theta = \sum_\tau R(\tau)P(\tau|\theta)$ * Gradient ascent * start with $\theta^0$ * $\theta^1 \leftarrow \theta^0 + \eta\nabla\bar{R}_{\theta^0}$ * $\theta^2 \leftarrow \theta^1 + \eta\nabla\bar{R}_{\theta^1}$ * ...... * $\theta = \{w_1,w_2,...,b_1,...\}$ * $\nabla\bar{R}_\theta = \pmatrix{\partial\bar{R}_\theta/ \partial w_1 \\ \partial\bar{R}_\theta/ \partial w_2 \\ \partial\bar{R}_\theta/ \partial w_3 \\ .\\.\\.\\ \partial\bar{R}_\theta/ \partial b_1 \\ . \\ . }$ * $\bar{R}_\theta = \sum_\tau R(\tau)P(\tau|\theta) \ \ \ \nabla\bar{R}_\theta = ?$ * $\nabla \bar{R}\theta = \sum_\tau R(\tau|\theta) = \sum_\tau R(\tau)P(\tau|\theta){{\nabla P(\tau|\theta)} \over {P(\tau|\theta)}}$ * $R(\tau)$ do not have to be differentiable, It can even be a black box * ![](https://i.imgur.com/TJSDKhY.png) * ![](https://i.imgur.com/KmNzJm7.png) * ![](https://i.imgur.com/qNKvevB.png) * 我們在最後一次玩遊戲的時候,看到$s_t^n$ 所以採取$a_t^n$,然後$R(\tau^n) $ 是正的,那我們會希望整參數,讓採取這個action機率變大 * 反之,看到負的要把採取這個actio機率變小 * 我們要改變的是整個reward,不能只有單次的reward,不然會導致agent只會在原地開火。 * ![](https://i.imgur.com/zFULL5Q.png) * 為什麼要取log? * ![](https://i.imgur.com/vx5j9Sh.png) * 他是sum全部action的reward,如果出現的次數比較多,加總的reward會比較大,所以要做normalization,把出現機率比較高的action,除上比較大的值,最後才不會只考慮那些出現機率高的action * 沒sample到的人,機率就比較低,所以設計一個bias,希望$R(\tau)$ 不要都是正的,讓他有正有負,有一個$\tau$超過baseline才會取到他,才不會讓好的人沒有被sample到。 * ![](https://i.imgur.com/gvrWWEV.png) * Critic * A critic does not determine the action * Given an actor, it evaluates the how good the actor is * An actor can be found from a critic * e.g. Q-learning * Three kinds of Critics * A critic is a function depending on the actor $\pi$ it is evaluated * The function is represented by a neural network * State value function $V^\pi(s)$ * When using actor $\pi$, the cumulated reward expects to be obtained after seeing observation (state) s * 我們看到某一個observation時候,丟到actor裡面,會丟出一個值,告訴你好不好。 * Demo of A3C * https://www.youtube.com/watch?v=0xo1Ldx3L5Q * 王立祥 統計機器學習理論 * 余天力 機器學習導論 ###### tags: `mllearning2020`