# Extra Definitions People Might Want for Midterm II ### CS181 ### Spring 2023 ![](https://i.imgur.com/xDR9VQd.png) ## ## Reinforcement Learning **Definition 1: (Agent's Representation of the World)** An ***Markov Decision Process (MDP)*** is defined by a set of 5 parameters, $\{S, A, T, R, \gamma\}$: 1. The state space, $S$ 2. The action space, $A$ 3. The dynamics or *transition*, $T$, which is a function, $T: S\times A \times S \to [0,1]$, where $T^a(s, s') = \mathbb{P}[s' | s, a]$, for $a\in A$; we can also write $T[s' | s, a]$ to denote the probability of transitioning to $s'$ by doing $a$ and state $s$. When the state space $S$ is finite, we often represent $T$ as a set of matrices, $T^a \in \mathbb{R}^{|S| \times |S|}$, where $T^a_{ij} = \mathbb{P}[s_i | s_j, a]$. 4. The reward function, $R: S\times A \times S \to \mathbb{R}$, which defines the reward received by the agent by transitioning to state $s'$ given that the agent starts at state $s$ and performs action $a$. **Note:** in some texts, the reward function is defined only in terms of the current state and the action, i.e. $R: S\times A \to \mathbb{R}$. 5. The discount factor, $\gamma\in [0,1]$, which describes how the agent prioritizes immediate versus future rewards. **Definition 2: (Agent's Strategy for Action Selection)** A function $\pi$ is called a ***policy*** if $$ \pi: S \to A $$ or $$ \pi: S \to [0,1]^{|A|}, $$ where $\pi(s)$ is a distribution over actions. When we want to reference the *probability of selecting action $a$ at state $s$ under the policy $\pi$*, we write $\pi(a|s)$. **Definition 3: (Cumulative Reward)** Given a trajectory as a sequence of states and actions, $\{(Z_0, A_0), (Z_1, A_1), (Z_2, A_2),\dots\}$, we quantify the ***return*** for that trajectory as: $$ G = R_{0} + \gamma R_{1} + \gamma^2R_{2} + \ldots = \sum_{n=0} \gamma^n R_{n } $$ where $R_{n}$ is the reward collected at time $n$. **Definition: (Value of a Policy)** The ***value function*** $V^\pi: S\to \mathbb{R}$ of an MDP is the expected return starting from state $s$, and then following policy $\pi$: $$ V^\pi(s) = \mathbb{E}_{\pi}[G|Z_0 = s] $$ The expectation above is taken over all randomnly varying quantities in $G$. **Definition 4: (Value of a Policy)** The ***action-value function***, or the $Q$-function, $Q^\pi: S \times A \to \mathbb{R}$ , quantifies the value of a policy, starting at a state $s$ and taking the action $a$, and then following policy $\pi$: $$ Q^\pi(s, a) = \mathbb{E}_\pi[G|Z_0=s, A_0 = a] $$ The expectation above is taken over all randomnly varying quantities in $G$. **Definition 5: (Planning)** We call $\pi^*$ the ***optimal policy*** if $$ V^{\pi^*}(s) = \max_\pi V^{\pi}(s),\; \text{for all } s\in S. $$ We call $V^{\pi^*}$ or $V^*$ the ***optimal value function***. We call an MDP ***solved*** when we know the optimal value function. The task of finding the optimal policy or the optimal value function is called ***planning***. **Definition 6: Bellman Equations** The value functions and $Q$ functions can be defined via a set of recursive equations: \begin{aligned} V^\pi(s) &= \sum_{a\in A}\sum_{s'\in S}\pi(a|s)T(s'|s, a)r(s, a, s') + \gamma \sum_{a\in A} \sum_{s'\in S} \pi(a|s)T(s'|s, a) V^\pi(s')\\ Q^\pi(s, a) &= \sum_{s'\in S}T(s'|s, a)r(s, a, s') + \gamma \sum_{s'\in S} T(s'|s, a) \sum_{a'\in A}\pi(a'|s')Q^\pi(s', a') \end{aligned} **Definition 7: Bellman Optimality Equations** The optimal value functions and optimal $Q$ functions can be similarly defined via a set of recursive equations: \begin{aligned} V^*(s) &= \max_{a\in A}\left(\sum_{s'\in S}T(s'|s, a)r(s, a, s') + \gamma \sum_{s'\in S} T(s'|s, a) V^*(s')\right)\\ Q^*(s, a) &= \sum_{s'\in S}T(s'|s, a)r(s, a, s') + \gamma \sum_{s'\in S} T(s'|s, a) \max_{a'\in A}Q^*(s', a') \end{aligned}