# **An Outsider’s Tour of Reinforcement Learning (Part8)** [TOC] ## [Updates on Policy Gradients](http://www.argmin.net/2018/03/13/pg-saga/) 從 SGD 轉變為 [Adam](https://nbviewer.jupyter.org/url/argmin.net/code/lqr_policy_comparisons.ipynb),policy gradient 的表現如何 ? [lqrpolsAdam](https://hackmd.io/py43DkisTZOhVuic-GAD8g?view)    [random.seed](https://hackmd.io/apGtHGpuSDqGoJfhdqg99A) **finite time horizon :** ![](https://i.imgur.com/cugXm1h.png) *Policy gradient looks a lot better!*   **infinite time horizon :** ![](https://i.imgur.com/7jJaEuz.png)     ### [RL is the go-to approach for datacenter cooling](https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/)   *DeepMind AI Reduces Google Data Centre Cooling Bill by 40% :* :::info 問題 - 設備與環境有著複雜且非線性的交互關係 - 每個數據中心都有獨特的架構和環境。 一個系統的 custom-tuned model 可能不適用於另一個系統。 因此,需要一個 general intelligence framework 解決方法 - 獲取數據中心內數千個 sensors 收集的 temperatures, power, pump speeds, setpoints 等等數據,並以此來訓練深度神經網路 - 訓練了深度神經網絡來預測未來一小時數據中心的溫度和其他環境條件。這些預測的目的是模擬模型中的 recommended actions :::     ![](https://i.imgur.com/lN0Ws58.png)   **$x_{t+1}$ = *A*$x_t$ + *B*$u_t$** Each component of the state x is the internal temperature of one of the racks ![](https://i.imgur.com/BQzYaYm.png) *A* = $\begin{bmatrix} 1.01&0.01&0\\ 0.01&1.01&0.01\\ 0&0.01&1.01\end{bmatrix}$             *B* = *I* *Q =I*                                             *R=1000I*   [datacenter_demo](https://nbviewer.jupyter.org/url/argmin.net/code/lqr_fake_datacenter_demo.ipynb)   ![](https://i.imgur.com/v5fSN9h.png)   ![](https://i.imgur.com/wwzYJj4.png)   **Change the random seed to 1336:**   ![](https://i.imgur.com/YzvNo7E.png)   ![](https://i.imgur.com/OZawgZq.png)   *That means that we’re still very much in a very high variance regime for Policy Gradient.*       :::warning - 我們可以調整 model-free methods ,但是這種方法存在根本的局限 - 潛在的動態是不穩定的。這意味著,除非應用適當的控制,否則系統將會 blow up (server 將著火) :::       參考資料 : http://www.argmin.net/2018/03/13/pg-saga/