introduction to reinforcement learning

--- title: introduction to reinforcement learning tags: rl --- # introduction to reinforcement learning! #### how is reinforcement learning different from other machine learning paradigms? - there is no supervisor, only a *reward* signal - feedback is delayed, not instantaneous - time really matters (sequential, non i.i.d data) - agent's actions affect the subsequent data it receives ### rewards - a reward $R_t$ is a scalar feedback signal - indicates how well agent is doing at step t - the agent's job is to maximize cumulative reward reinforcement learning is based on the reward hypothesis #### reward hypothesis definition all goals can be described by the maximization of expected sumulative reward #### sequential decision making - goal: select actions to maximize total future reward - actions may have long term consequences - reward may be delayed - it may be better to sacrifice immediate reward to gain more long-term reward ### environment ![](https://i.imgur.com/K7jtlYw.png) at each step t, the agent: - executes action $A_t$ - receives observcation $O_t$ - receives scalar reward $R_t$ the environment: - receives action $A_t$ - emits observation $O_{t+1}$ - emits scalar reward $R_{t + 1}$ t increments at environment step ### state #### history and state the history is the sequence of observations, actions, rewards $H_t$ = $O_1$, $R_1$, $A_1$, ..., $A_{t - 1}$, $O_t$, $R_t$ i.e all observable variables up to time t i.e the sensorimotor stream of a robot or embodied agent what happens next depends on the history - the agent selects actions - the environment selects observations/rewards **state** is the information used to determine what happens next formally, state is a function of the history $S_t$ = f($H_t$) #### environment state the **environemt state** $S^e_t$ is the environment's private representation i.e. whatever data the environment uses to pick the next observation/reward - the environment state is not usually visible to the agent - even if $S^e_t$ is visible, it may contain irrelevant information #### agent state the **agent state** $S^a_t$ is the agent's internal representation i.e. whatever information the agent uses to pick the next action i.e. it is the informtion used by reinforcement learning algorithms - it can be any function of history: $S^a_t$ = f($H_t$) #### information state an **information state** (**Markov state**) contains all useful information from the history a state $S_t$ is **Markov** if and only if $\mathbb{P}$[$S_{t + 1}$ | $S_t$] = $\mathbb{P}$[$S_{t + 1}$ | $S_1$,...,$S_t$] - the future is independent of the past given the present $H_{1:t}$ -> $S_t$ -> $H_{t + 1: \infty}$ - once the state is known, the history may be thrown away - i.e. the state is a sufficient statistic of the future - the environment state $S^e_t$ is Markov - the history $H_t$ is Markov #### fully observable environments **full observability**: agent **directly** observes environment state $O_t$ = $S^a_t$ = $S^e_t$ - agent state = environment state = information state - formally, this is a **Markov decision process** (MDP) #### partially observable environments **partial observability**: agent **indirectly** observes environment: - a robot with camera vision isn't told its absolute location - a trading agent only observes current prices - a poker playing agent only observes public cards now agent state ≠ environment state - formally this is a **partially observable Markov decision process** agent must construct its own state representation $S^a_t$, e.g. - complete history: $S^a_t$ = $H_t$ - beliefs of environment state: $S^a_t$ = ($\mathbb{P}$[$S^e_t$ = s^1^], ..., $\mathbb{P}$[$S^e_t$ = s^n^]) - recurrent neural network: $S^a_t$ = $\sigma$($S^a_{a - 1}$$W_s$ + $O_t$$W_o$) ### major components of an RL agent an RL agent may include one or more of these components - policy: agent's behavior function - value function: how good is each state and/or action - model: agent's representation of the environment #### policy a policy is the agent's behavior, a map from state to action deterministic policy: a = $\pi$(s) stochastic policy: $\pi$(a | s) = $\mathbb{P}$[$A_t$ = a | $S_t$ = s] #### value function a value function is a prediction of future reward used to evaluate the goodness/badness of states and therefore to select between actions e.g. $v_\pi$(s) = $\mathbb{E}_\pi$[$R_{t+1}$ + $\gamma R_{t+2}$ + $\gamma ^2R_{t+3}$ + ... | $S_t$ = s] #### model a model predicts what the environment will do next P predicts the next state, R predicts the next (immediate) reward, e.g. $P^a_{ss'}$ = $\mathbb{P}$[$S_{t+1}$ = s' | $S_t$ = s, $A_t$ = a] $R^a_{s}$ = $\mathbb{E}$[$R_{t+1}$ | $S_t$ = s, $A_t$ = a] ### categorizing RL agents value based - no policy (implicit) - value function policy based - policy - no value function actor critic - policy - value function model free - policy and/or value function - no model model based - policy and/or value function - model ### problems within RL two fundamental problems in sequential decision making - reinformcent learning - the environment is initally unknown - the agent interacts with the environment - the agent improves the policy - planning - a model of the environment is known - the agent performs computations with its model (without any external interaction) - the agent improves its policy - aka deliberation, reaosning, introspection, pondering, thought, search exploration and exploitation - reinforcement leanring is like trial-and-error learning - the agent should discover a good policy from its experiences of the environment without losing too much reward along the way - exploration finds more information about the environment - exploitation exploits known information to maximize reward - it is isually important to explore as well as exploit prediction and control - prediction: evaluate the future (given a policy) - control: optimize the future (find the best policy)