---
title: introduction to reinforcement learning
tags: rl
---
# introduction to reinforcement learning!
#### how is reinforcement learning different from other machine learning paradigms?
- there is no supervisor, only a *reward* signal
- feedback is delayed, not instantaneous
- time really matters (sequential, non i.i.d data)
- agent's actions affect the subsequent data it receives
### rewards
- a reward $R_t$ is a scalar feedback signal
- indicates how well agent is doing at step t
- the agent's job is to maximize cumulative reward
reinforcement learning is based on the reward hypothesis
#### reward hypothesis definition
all goals can be described by the maximization of expected sumulative reward
#### sequential decision making
- goal: select actions to maximize total future reward
- actions may have long term consequences
- reward may be delayed
- it may be better to sacrifice immediate reward to gain more long-term reward
### environment

at each step t, the agent:
- executes action $A_t$
- receives observcation $O_t$
- receives scalar reward $R_t$
the environment:
- receives action $A_t$
- emits observation $O_{t+1}$
- emits scalar reward $R_{t + 1}$
t increments at environment step
### state
#### history and state
the history is the sequence of observations, actions, rewards
$H_t$ = $O_1$, $R_1$, $A_1$, ..., $A_{t - 1}$, $O_t$, $R_t$
i.e all observable variables up to time t
i.e the sensorimotor stream of a robot or embodied agent
what happens next depends on the history
- the agent selects actions
- the environment selects observations/rewards
**state** is the information used to determine what happens next
formally, state is a function of the history
$S_t$ = f($H_t$)
#### environment state
the **environemt state** $S^e_t$ is the environment's private representation
i.e. whatever data the environment uses to pick the next observation/reward
- the environment state is not usually visible to the agent
- even if $S^e_t$ is visible, it may contain irrelevant information
#### agent state
the **agent state** $S^a_t$ is the agent's internal representation
i.e. whatever information the agent uses to pick the next action
i.e. it is the informtion used by reinforcement learning algorithms
- it can be any function of history:
$S^a_t$ = f($H_t$)
#### information state
an **information state** (**Markov state**) contains all useful information from the history
a state $S_t$ is **Markov** if and only if
$\mathbb{P}$[$S_{t + 1}$ | $S_t$] = $\mathbb{P}$[$S_{t + 1}$ | $S_1$,...,$S_t$]
- the future is independent of the past given the present
$H_{1:t}$ -> $S_t$ -> $H_{t + 1: \infty}$
- once the state is known, the history may be thrown away
- i.e. the state is a sufficient statistic of the future
- the environment state $S^e_t$ is Markov
- the history $H_t$ is Markov
#### fully observable environments
**full observability**: agent **directly** observes environment state
$O_t$ = $S^a_t$ = $S^e_t$
- agent state = environment state = information state
- formally, this is a **Markov decision process** (MDP)
#### partially observable environments
**partial observability**: agent **indirectly** observes environment:
- a robot with camera vision isn't told its absolute location
- a trading agent only observes current prices
- a poker playing agent only observes public cards
now agent state ≠ environment state
- formally this is a **partially observable Markov decision process**
agent must construct its own state representation $S^a_t$, e.g.
- complete history: $S^a_t$ = $H_t$
- beliefs of environment state: $S^a_t$ = ($\mathbb{P}$[$S^e_t$ = s^1^], ..., $\mathbb{P}$[$S^e_t$ = s^n^])
- recurrent neural network: $S^a_t$ = $\sigma$($S^a_{a - 1}$$W_s$ + $O_t$$W_o$)
### major components of an RL agent
an RL agent may include one or more of these components
- policy: agent's behavior function
- value function: how good is each state and/or action
- model: agent's representation of the environment
#### policy
a policy is the agent's behavior, a map from state to action
deterministic policy: a = $\pi$(s)
stochastic policy: $\pi$(a | s) = $\mathbb{P}$[$A_t$ = a | $S_t$ = s]
#### value function
a value function is a prediction of future reward used to evaluate the goodness/badness of states and therefore to select between actions e.g.
$v_\pi$(s) = $\mathbb{E}_\pi$[$R_{t+1}$ + $\gamma R_{t+2}$ + $\gamma ^2R_{t+3}$ + ... | $S_t$ = s]
#### model
a model predicts what the environment will do next
P predicts the next state, R predicts the next (immediate) reward, e.g.
$P^a_{ss'}$ = $\mathbb{P}$[$S_{t+1}$ = s' | $S_t$ = s, $A_t$ = a]
$R^a_{s}$ = $\mathbb{E}$[$R_{t+1}$ | $S_t$ = s, $A_t$ = a]
### categorizing RL agents
value based
- no policy (implicit)
- value function
policy based
- policy
- no value function
actor critic
- policy
- value function
model free
- policy and/or value function
- no model
model based
- policy and/or value function
- model
### problems within RL
two fundamental problems in sequential decision making
- reinformcent learning
- the environment is initally unknown
- the agent interacts with the environment
- the agent improves the policy
- planning
- a model of the environment is known
- the agent performs computations with its model (without any external interaction)
- the agent improves its policy
- aka deliberation, reaosning, introspection, pondering, thought, search
exploration and exploitation
- reinforcement leanring is like trial-and-error learning
- the agent should discover a good policy from its experiences of the environment without losing too much reward along the way
- exploration finds more information about the environment
- exploitation exploits known information to maximize reward
- it is isually important to explore as well as exploit
prediction and control
- prediction: evaluate the future (given a policy)
- control: optimize the future (find the best policy)