# [ML Lecture 23-1: Deep Reinforcement Learning](https://www.youtube.com/watch?v=W8XF3ME8G2I&feature=youtu.be)
## RE learning
Scratching the surface
Kreeger 先生在NATURE上面發了一篇REL 來玩Atari的小遊戲,都可以痛電人類,16年春天AlphaGO痛電人類,後來David Silver say AI = RL + DL
#### Scenario of Reinforcement Learning
state是環境的狀態,machine 所看到的東西。state 是 observation
Action 會 Change the environment


學習做出可以讓reward maximize 的action

困難的點是,reward常常是 sparse的
在打的時候,得到的reward都是0
要打完整場才能得到reward
---
### Learning on GO
* Supervised:Learning from teacher
* Reinforcement Learning: Learning from experience
* First move -> ... many moves ... -> win
* (two agents play with each other)
* Alpha Go is supervised learning + reinforcement learning
---
### Learning on chat-bot


在圍棋裡面的好壞是很好區別的。贏就好,輸就不好
但說話不好定義好壞
---

More applications
* Flying Helicopter
* Driving
* Google Cuts its Giant Electricity Bill With DeepMind-Powered AI
* Text generation
---
### Example: Playing Video Game
* Widely studies:
* Gym: https://gym.openai.com/
* Universe: https://openai.com/blog/universe/
* :robot_face: Machine learns to play video games as human players
* What machine observes is pixels
* Machine learns to take proper action itself
😁

* 學著在每個episode最大化reward
---
## Difficulties of Reinforcement Learning
* Reward delay
* In space invader, only "fire" obtains reward
* Althought the moving before "fire" is important
* In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward
* Agent's actions affect the subsequent data it receives
* E.g. Exploration
---
## RL 分成兩大類
1. Value-based
2. Policy-based

model-based: 預測未來發生的事
* To learn deep reinforcement learning ......
* Textbook: Reinforcement Learning: An Introduction
* https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html
* Lectures of David Silver
* http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html (10 lectures, 1:30 each)
* http://videolectures.net/rldm2015_silver_reinforcement_learning/ (Deep Reinforcement Learning)
* Lectures of John Schulman
* https://youtu.be/aUrX-rP_ss4
## Policy-based Approach
* Learning an Actor
Machine Learning 約等於 Looking for a Function

* Three Steps for Deep Learning:
1. Neural Network as Actor
2. goodness of function
3. pick the best function
* Neural network as Actor
* Input of neural network: the observation of machine represented as vector or a matrix
* Ouput neural network: each action corresponds to a neuron in output layer
* NN 就像是 actor
* 會讓output 是機率值,假設Actor 是 stochastic
* what is the benefit of using network instead of lookup table?
* Neural network, 給他一個沒看過都東西還是可以有結果,用neural network 是比較generalize
* Goodness of Actor
* Review:Supervised learning

* Goodness of Actor
* Review:Supervised learning
* Given an actor
* 每次機器採取的actor 可能不一樣
* 每次的遊戲也都有可能不一樣
* 所以每次的$R_\theta$
* 所以我們想要的不是$R_\theta$,而是要maximize $R_\theta$的期望值
* An episode is considered as a trajectory $\tau$
* $\tau = \{s_1,a_1,\gamma_1,s_2,a_2,\gamma_2,...,s_T,a_T,\gamma_T\}$
* $R(\tau) = \sum_{n=1}^N \gamma_n$
* If you use an actor to play the game, each $\tau$ has a probability to be sampled
* The probability depends on actor parameter $\theta: P(\tau|\theta)$
$\bar{R_\theta} = \sum_\tau R(\tau)P(\tau|\theta) \approx {1 \over N} \sum_{n=1}^N R(\tau^N)$
* 每個$\tau$都有一個機率,每個過程都有一個reward,把他們相乘,再sum 所有遊戲可能的$\tau$後,可以得到這個actor的期望的reward。
* Use $\pi_\tau$ to play the game N times obtain $\{\tau^1,\tau^2,...,\tau^N\}$
* 讓actor 玩遊戲玩N場,像N個training Data。
* Sampling $\tau$ from $P(\tau|\theta)$ N
* Sum over all possible trajectory


* Pick the best function
* Gradient Ascent
* $\theta^\star = arg max_\theta \bar{R}_\theta$
* $\bar R_\theta = \sum_\tau R(\tau)P(\tau|\theta)$
* Gradient ascent
* start with $\theta^0$
* $\theta^1 \leftarrow \theta^0 + \eta\nabla\bar{R}_{\theta^0}$
* $\theta^2 \leftarrow \theta^1 + \eta\nabla\bar{R}_{\theta^1}$
* ......
* $\theta = \{w_1,w_2,...,b_1,...\}$
* $\nabla\bar{R}_\theta = \pmatrix{\partial\bar{R}_\theta/ \partial w_1 \\ \partial\bar{R}_\theta/ \partial w_2 \\ \partial\bar{R}_\theta/ \partial w_3 \\ .\\.\\.\\ \partial\bar{R}_\theta/ \partial b_1 \\ . \\ . }$
* $\bar{R}_\theta = \sum_\tau R(\tau)P(\tau|\theta) \ \ \ \nabla\bar{R}_\theta = ?$
* $\nabla \bar{R}\theta = \sum_\tau R(\tau|\theta) = \sum_\tau R(\tau)P(\tau|\theta){{\nabla P(\tau|\theta)} \over {P(\tau|\theta)}}$
* $R(\tau)$ do not have to be differentiable, It can even be a black box
* 
* 
* 
* 我們在最後一次玩遊戲的時候,看到$s_t^n$ 所以採取$a_t^n$,然後$R(\tau^n) $ 是正的,那我們會希望整參數,讓採取這個action機率變大
* 反之,看到負的要把採取這個actio機率變小
* 我們要改變的是整個reward,不能只有單次的reward,不然會導致agent只會在原地開火。
* 
* 為什麼要取log?
* 
* 他是sum全部action的reward,如果出現的次數比較多,加總的reward會比較大,所以要做normalization,把出現機率比較高的action,除上比較大的值,最後才不會只考慮那些出現機率高的action
* 沒sample到的人,機率就比較低,所以設計一個bias,希望$R(\tau)$ 不要都是正的,讓他有正有負,有一個$\tau$超過baseline才會取到他,才不會讓好的人沒有被sample到。
* 
* Critic
* A critic does not determine the action
* Given an actor, it evaluates the how good the actor is
* An actor can be found from a critic
* e.g. Q-learning
* Three kinds of Critics
* A critic is a function depending on the actor $\pi$ it is evaluated
* The function is represented by a neural network
* State value function $V^\pi(s)$
* When using actor $\pi$, the cumulated reward expects to be obtained after seeing observation (state) s
* 我們看到某一個observation時候,丟到actor裡面,會丟出一個值,告訴你好不好。
* Demo of A3C
* https://www.youtube.com/watch?v=0xo1Ldx3L5Q
* 王立祥 統計機器學習理論
* 余天力 機器學習導論
###### tags: `mllearning2020`