**An Outsider’s Tour of Reinforcement Learning (Part2)**

# **An Outsider’s Tour of Reinforcement Learning (Part2)** [TOC] ## **[Total Control](http://www.argmin.net/2018/02/01/control-tour/)** Reinforcement learning operates at the intersection of two areas : :::info 1. Machine learning 2. Control ::: &emsp; 但我們對於這兩個領域結合運用的了解卻很少，而且這兩個學科在使用的方法上也有所不同。 &emsp; ::: success Controls is the theory of designing complex actions from well-specified models, while machine learning makes intricate, model-free predictions from data alone. ::: &emsp; ### **Control** 控制理論的核心是具有輸入和輸出的動態系統 Systems have internal state which reacts to current conditions and the inputs, and the outputs are some function of the state and the input.   讓我們舉一個簡單的例子 :   ![](https://i.imgur.com/twQPosf.png) 遵循 Newton’s laws : **F=ma** 加速度會與(施加的總力-重力)成正比，並與飛行器質量成反比速度等於先前的速度加上加速度位置等於前一個位置加上速度 **輸入** : 螺旋槳旋轉產生的力 . 重力 **狀態** : (位置,速度) 從這些方程式中，我們可以計算出一組將飛行器提升到目標高度所需的力 ![](https://i.imgur.com/owsnPln.png) 我們可以用f函數從當前的狀態,輸入,誤差信號來得知下一個狀態 et可以是系統中的任意雜訊或是模型中的誤差 We assume that at every time, we receive some reward for our current xt and ut and we want to maximize this reward. ![](https://i.imgur.com/rC3HRWn.png) 可以將此問題轉化為最佳控制問題，並使用你最喜歡的方法解答他。事實上有許多的控制問題正是如此解決的。 f 函數還能用 **[Markov Decision Process (MDP)](https://zh.wikipedia.org/wiki/%E9%A6%AC%E5%8F%AF%E5%A4%AB%E6%B1%BA%E7%AD%96%E9%81%8E%E7%A8%8B)** 來描述。xt是discrete values， ut是discrete control action，xt和ut一同決定了xt+1的概率分布。在MDP中，一切都可以寫成機率表。 &emsp; 但當我們不知道f的時候呢?例如在上面的例子中我們可能不知道螺旋槳旋轉時產生的力。 &emsp; :::info 測試系統在不同輸入下的結果後建構出dynamics model，然後在最佳控制問題中使用此模型 ::: &emsp; 對於更複雜的系統，甚至很難寫出參數模型 &emsp; :::success 忽略模型，並根據測量到的狀態xt來增加獎勵 ::: &emsp; &emsp;&emsp;這樣的方法將與之前提過的強化學習的“prescriptive analytics” 特性相同。但此種方法會忽視與時間相關的重要信息並要求你在控制設計中忽略了啟蒙物理學。 &emsp; ### **總結** 我們必須如何理解並建構出動態系統以便以最佳方式對其進行控制？以盡可能少的干預來探討系統並實現高質量控制的最佳方法是什麼？ &emsp;&emsp;這些是強化學習的核心問題，但在這個領域的研究結果中，很少有人知道需要多少樣本以及哪些方法的效率更高或更低。 &emsp; &emsp; :::info 比較各種最佳控制方法的優缺點 ::: &emsp; &emsp; &emsp; 參考資料 : http://www.argmin.net/2018/02/01/control-tour/