# Extra Definitions People Might Want for Midterm II
### CS181
### Spring 2023

##
## Reinforcement Learning
**Definition 1: (Agent's Representation of the World)** An ***Markov Decision Process (MDP)*** is defined by a set of 5 parameters, $\{S, A, T, R, \gamma\}$:
1. The state space, $S$
2. The action space, $A$
3. The dynamics or *transition*, $T$, which is a function, $T: S\times A \times S \to [0,1]$, where $T^a(s, s') = \mathbb{P}[s' | s, a]$, for $a\in A$; we can also write $T[s' | s, a]$ to denote the probability of transitioning to $s'$ by doing $a$ and state $s$.
When the state space $S$ is finite, we often represent $T$ as a set of matrices, $T^a \in \mathbb{R}^{|S| \times |S|}$, where $T^a_{ij} = \mathbb{P}[s_i | s_j, a]$.
4. The reward function, $R: S\times A \times S \to \mathbb{R}$, which defines the reward received by the agent by transitioning to state $s'$ given that the agent starts at state $s$ and performs action $a$.
**Note:** in some texts, the reward function is defined only in terms of the current state and the action, i.e. $R: S\times A \to \mathbb{R}$.
5. The discount factor, $\gamma\in [0,1]$, which describes how the agent prioritizes immediate versus future rewards.
**Definition 2: (Agent's Strategy for Action Selection)** A function $\pi$ is called a ***policy*** if
$$
\pi: S \to A
$$
or
$$
\pi: S \to [0,1]^{|A|},
$$
where $\pi(s)$ is a distribution over actions. When we want to reference the *probability of selecting action $a$ at state $s$ under the policy $\pi$*, we write $\pi(a|s)$.
**Definition 3: (Cumulative Reward)** Given a trajectory as a sequence of states and actions, $\{(Z_0, A_0), (Z_1, A_1), (Z_2, A_2),\dots\}$, we quantify the ***return*** for that trajectory as:
$$
G = R_{0} + \gamma R_{1} + \gamma^2R_{2} + \ldots = \sum_{n=0} \gamma^n R_{n }
$$
where $R_{n}$ is the reward collected at time $n$.
**Definition: (Value of a Policy)** The ***value function*** $V^\pi: S\to \mathbb{R}$ of an MDP is the expected return
starting from state $s$, and then following policy $\pi$:
$$
V^\pi(s) = \mathbb{E}_{\pi}[G|Z_0 = s]
$$
The expectation above is taken over all randomnly varying quantities in $G$.
**Definition 4: (Value of a Policy)** The ***action-value function***, or the $Q$-function, $Q^\pi: S \times A \to \mathbb{R}$ , quantifies the value of a policy, starting at a state $s$ and taking the action $a$, and then following policy $\pi$:
$$
Q^\pi(s, a) = \mathbb{E}_\pi[G|Z_0=s, A_0 = a]
$$
The expectation above is taken over all randomnly varying quantities in $G$.
**Definition 5: (Planning)** We call $\pi^*$ the ***optimal policy*** if
$$
V^{\pi^*}(s) = \max_\pi V^{\pi}(s),\; \text{for all } s\in S.
$$
We call $V^{\pi^*}$ or $V^*$ the ***optimal value function***. We call an MDP ***solved*** when we know the optimal value function. The task of finding the optimal policy or the optimal value function is called ***planning***.
**Definition 6: Bellman Equations** The value functions and $Q$ functions can be defined via a set of recursive equations:
\begin{aligned}
V^\pi(s) &= \sum_{a\in A}\sum_{s'\in S}\pi(a|s)T(s'|s, a)r(s, a, s') + \gamma \sum_{a\in A} \sum_{s'\in S} \pi(a|s)T(s'|s, a) V^\pi(s')\\
Q^\pi(s, a) &= \sum_{s'\in S}T(s'|s, a)r(s, a, s') + \gamma \sum_{s'\in S} T(s'|s, a) \sum_{a'\in A}\pi(a'|s')Q^\pi(s', a')
\end{aligned}
**Definition 7: Bellman Optimality Equations** The optimal value functions and optimal $Q$ functions can be similarly defined via a set of recursive equations:
\begin{aligned}
V^*(s) &= \max_{a\in A}\left(\sum_{s'\in S}T(s'|s, a)r(s, a, s') + \gamma \sum_{s'\in S} T(s'|s, a) V^*(s')\right)\\
Q^*(s, a) &= \sum_{s'\in S}T(s'|s, a)r(s, a, s') + \gamma \sum_{s'\in S} T(s'|s, a) \max_{a'\in A}Q^*(s', a')
\end{aligned}