**An Outsider’s Tour of Reinforcement Learning (Part4)**

# **An Outsider’s Tour of Reinforcement Learning (Part4)** [TOC] ## [The Linear Quadratic Regulator](http://www.argmin.net/2018/02/08/lqr/) 最佳控制理論主要探討的是讓動力系統以在最小成本來運作，若系統動態可以用一組線性微分方程表示，而其成本為二次泛函，這類的問題稱為線性二次（LQ）問題。此類問題的解即為LQR。 :::info 1. 設定成本函數 (cost function) 2. 設定加權，利用數學演算法來找到使成本函數最小化的設定值 ::: LQR演算法減少了我們為了讓控制器最佳化，而需付出的心力。但我們仍然要列出目標函數的相關參數，並且將結果和理想的設計目標比較。因此控制器的建構常會是**迭代**的，我們要在模擬過程中決定最佳控制器，再去調整參數讓結果更接近設計目標。     ![](https://i.imgur.com/U3CDZVB.png)   ![](https://i.imgur.com/GQhzuSw.png) **(zt;vt)** : the state of the system **ut** : control action **et** : random disturbance **Rt** : the reward gained at each time step given the state and control action   ### 簡化的第一步 : convex ### dynamical system : ![](https://i.imgur.com/QfAh7eD.png) 雖然有例外，但通常 convexity 會是線性的，且有許多系統我們也希望它們在運行的範圍內會是線性的。   ![](https://i.imgur.com/keUzeEz.png) Newton’s Laws written in matrix form are : ![](https://i.imgur.com/AuOf634.png) ### 成本函數(cost function) :::info - In mathematical optimization, the loss function, a function to be minimized - Cost functions is a recurring theme in reinforcement learning :::   ![](https://i.imgur.com/C4eFoO3.png) or ![](https://i.imgur.com/z7agU5m.png) or ![](https://i.imgur.com/PYatGsb.png) 哪個是最好的 ? 通常我們需要在控制器的輸出特性還有運算的容易度中做出取捨，但因為我們在設計 cost functions，所以我們應該著重於更容易解決的 costs。選擇適合的 quadratic cost ，我們能簡化控制問題的分析和設計。因此，Cost functions 是最佳控制設計中的重要部分。強化學習研究人員將成本設計稱為“reward shaping”。   ![](https://i.imgur.com/hzdPOpn.png) ![](https://i.imgur.com/wjZstTt.png) :::info - **Q , R** 為實對稱矩陣，且 **Q ≧ 0 , R > 0**，Q與R的值會影響系統 - 調整 Q , R 來達到我們想要的目標 :::   ### Solving LQR with backpropagation (sort of) ### 反向傳播 (Backpropagation) 是一種與最優化方法結合使用的，用來訓練人工神經網路的常見方法。該方法對網路中所有權重計算成本函數的梯度。這個梯度會反饋給最優化方法，用來更新權值以最小化成本函數。在最佳化問題中，拉格朗日乘數法是一種尋找多元函數在其變數受到一個或多個條件的約束時的極值的方法。這種方法可以將一個有n個變數與k個約束條件的最佳化問題轉換為一個解有n + k個變數的方程式組的解的問題。 The [Lagrangian](https://en.wikipedia.org/wiki/Lagrange_multiplier) for the LQR problem has the form ![](https://i.imgur.com/YqxwuZh.png) The gradient of the Lagrangian are given by the expressions ![](https://i.imgur.com/asxFjZT.png) Find settings for all of the variables to make these gradients vanish ![](https://i.imgur.com/CQXOMhb.png) ![](https://i.imgur.com/yYBJiZY.png) ![](https://i.imgur.com/1ApuPx7.png) ![](https://i.imgur.com/z87NL9i.png) ![](https://i.imgur.com/7jZ410v.png) with ![](https://i.imgur.com/ecsykIU.png) **The final control action is a linear function of the final state.** ![](https://i.imgur.com/MSmaJcJ.png) ![](https://i.imgur.com/wCVSJ97.png) for some matrix Mt+1 ![](https://i.imgur.com/3NP5NVZ.png) what is this sequence of matrices Mt ? ![](https://i.imgur.com/5F3TBH3.png) we get the formula ![](https://i.imgur.com/okcsiJC.png) N tend to infinity ![](https://i.imgur.com/pwIp6m6.png) This equation is called the *Discrete Algebraic Riccati Equation*. ### LQR for the position problem ### ![](https://i.imgur.com/hzdPOpn.png) ![](https://i.imgur.com/wjZstTt.png) &emsp; :::info - LQR 的 control action 是當前位置和當前速度的加權組合。由於速度是位置的導數，因此這是比例微分（PD）控制器 - 如果速度在向上方向上太快，則控制器將施加較小的力。但如果飛行器下降，控制器將增加螺旋槳速度。 - 對於較大的R值，飛行器需要更長的時間才能到達所需位置，但輸入力的總量較小。 ::: &emsp; &emsp; ![](https://i.imgur.com/TkeshKE.png)![](https://i.imgur.com/CXqVVEP.png) &emsp; &emsp; ### Takeaways ### :::success - LQR無法解決所有的最佳控制問題，即使動態是線性的也是如此。 - Dynamic programming recursion 可以使我們更有效率的計算 control actions - 迭代 LQR 能夠解決最佳控制問題的泰勒近似 ::: &emsp; 當我們不知道A和B時會發生什麼？ &emsp; &emsp; &emsp; 參考資料 : http://www.argmin.net/2018/02/08/lqr/