# Reinforcement Learning Introduction **Book**: Reinforcement Learning: An Introduction **Author**: Richard S. Sutton & Andrew G. Barto ###### tags: `study_note` `reinforcement-learning` :::info 跟環境學習,從互動中學習是學習和智慧的基礎。 ::: :::info The approach we explore, called **reinforcement learning**, is much more focused on goal-directed learning from interaction than are other approaches to machine learning. ::: [TOC] ## Reinforcement Learning [[Wiki: Dynamical System Theory](https://www.wikiwand.com/en/Dynamical_systems_theory)][[Wiki: Markov Decision Process](https://www.wikiwand.com/en/Markov_decision_process)][[Wiki: Optimal Control](https://www.wikiwand.com/en/Optimal_control)] Reinforcement learning learns how to map situations to actions, so as to maximize a numerical reward signal. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics: - Trial and error - Delayed reward are the two most important distinguishing features of reinforcement learning. 以[動態系統理論(Dynamical System Theory)](https://www.wikiwand.com/en/Dynamical_systems_theory)的想法將 reinforcement learning 的問題形式化,作為 Markov decision processes 的 optimal control :::info 動態系統:會隨時間改變狀態的系統,可以 $S(t)$ 表示,類比於機器學習中隨固定分佈的 input data ::: :::info **[Definition](https://www.wikiwand.com/en/Markov_decision_process#/Definition)** A **Markov decision process** is a 4-tuple $(S, A, P_a, R_a)$, where - $S$ is a finite set of states, - $A$ is a finite set of actions (alternatively, $A_s$ is the finite set of actions available from state $s$), - $P_a(s, s')=\text{Pr}(s_{t+1}=s'|s_t=s, a_t=a)$ is the probability that action $a$ in state $s$ at time $t$ will lead to state $s'$ at time $t+1$, - $R_a(s,s')$ is the immediate reward (or expected immediate reward) received after transitioning from state $s$ to state $s'$, due to action $a$ (Note: The theory of Markov decision processes does not state that $S$ or $A$ are finite, but the basic algorithms below assume that they are finite.) ::: Markov decision processes are intended to include just these three aspects—sensation, action, and goal—in their simplest possible forms without trivializing any of them. ### Difference from Supervised Learning Supervised Learning - learn from traning set of labeled examples from external supervisor, - each example is a desciption of a situation together with a label (correct action) - not adequate for learning from interaction. Reinforcement Learning - learn from environment or immediate reward, or delayed reward - in interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act (難以列舉所有情境下該採取的正確行動) ### Difference from Unsupervised Learning Unsupervised Learning - finding structure hidden in collections of unlabeled data. Reinforcement Learning - trying to maximize a reward signal instead of trying to find hidden structure, but uncovering structure in an agent's experience can certainly be useful in RL. ### Exploration and Exploitation [[知乎: Exploitation-exploration trade-off](https://zhuanlan.zhihu.com/p/36428707)] Trade-off between exploration and exploitation is one of the challenges in reinforcement learning, and not in other kinds of learning - **Exploit** what it has already experienced in order to obtain reward. - **Explore** in order to make better action selections in the future ### Global View of the Whole Problem 只要明確地定義好 reward function,RL 不需要人工事先將大問題拆解成 subproblems ### Complete, Interactive and Goal-seeking A complete, interactive and goal-seeking can be a component of a larger system and indirectly interacts with the larger system's environment. (可被看作獨立的 module) Every agent has different goal themself, their environment is the rest of the agent together with the agent's environment. ### AI Towards Simple General Principles - "Weak Methods": based on general principles, such as search or learning - "Strong Methods": based on specific knowledge ## Examples - 棋手下了一步棋,這一手是經由 - 計畫: 猜想接下來對手的回應及反擊 - 判斷: 當下能採取的手段及直覺 - 控制器實時調整所有煉油廠作業的參數,最佳化了**產量/成本/品質**而不固定於一開始工程師所設定的參數 - 小牛剛出生時一直跌倒,出生半小時後牠能跑 20 miles/hr - 會移動的機器人決定是要再繼續進入新房間撿垃圾,還是要回充電站充電。由目前的電量及過去經驗決定要不要撐下去 All involve interaction between an active decision-making agent and its environment, - within which the agent seeks to achieve a goal despite uncertainty about its environment. - The agent’s actions are permitted to affect the future state of the environment, thereby affecting the actions and opportunities available to the agent at later times. - Correct choice requires taking into account indirect, delayed consequences of actions, and thus may require foresight or planning ## Elements of Reinforcement Learning [[Wiki: Reinforcement Learning](https://www.wikiwand.com/en/Reinforcement_learning#/Algorithms_for_control_learning)] ![](https://i.imgur.com/1Fu95Rl.png) ### Policy :::info The agent's action selection is modeled as a map called **policy** $$\pi:A\times S\rightarrow [0, 1],\text{where}\ \pi(a, s)=\text{Pr}(a_t=a|s_t=s)$$ - a policy is a mapping from perceived states of the environment to actions to be taken when in those states. - In some cases the policy may be a simple function or **lookup table**, whereas in others it may involve extensive computation such as a search process. - policies may be stochastic, specifying probabilities for each action. ::: ### Reward Signal :::info Reward signal defines a goal of reinforcement learning problem, $$r_t\in \mathbb{R}^d$$ it's given on each time step. ::: ### Value Function :::info Value function $V_\pi(S)$ is defined as the *expected return* starting with state $s$, i.e. $s_0=s$ and successively following policy $\pi$. $$V_\pi(s)=\mathbb{E}[R]=\mathbb{E}[\sum_{t=0}^\infty\gamma^t r_t |s_0=s],$$ where the random variable $R$ denotes the **return**, and is defined as the sum of future discounted rewards $$R=\sum_{t=0}^\infty \gamma^t r_t$$ where $\gamma\in[0, 1]$ is the discount-rate. ::: Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime. The central role of value estimation is arguably the most important thing that has been learned about reinforcement learning over the last six decades. ### Model Model mimics the behavior of environment, given a state and action, the model might predict the resultant next state and next reward ## Limitations and Scope [[控制的類型](https://pojenlai.wordpress.com/2014/09/17/%E5%90%84%E7%A8%AE%E6%8E%A7%E5%88%B6%EF%BC%8C%E4%B8%80%E5%80%8Boverview/)] - **Evolutionary algorithms**: Genetic algorithm , genetic programming, simulated annealing, and other optimization methods never estimate value functions. 這些方法採用固定不變的 policies。像生物器官一樣在一次 lifetime (optimization process) 中器官的功能都是固定的,直到下一次投胎上帝再讓新器官有一點小改變。 ## An Extended Example: Tic-Tac-Toe [[Wiki: Minimax](https://www.wikiwand.com/en/Minimax)] > 這一段像義大利麵一樣難切割,我就隨便記記 ![](https://i.imgur.com/EwFrXYI.png) Minimax 是在最糟的未知情況下所能確保的最佳 solution,是一種保守做法。 若在 tic-tac-toe 遊戲裡將平手與輸視為相等,