# Markov Decision Processes
A **Markov Decision Process (MDP)** is a mathematical framework used to model decision-making in situations where:
* **Outcomes are partially random:** Your choices influence probabilities but don't guarantee specific results.
* **Decisions are sequential:** The best action now depends on the potential future consequences.
## Formal Definition
A **discrete Markov Decision Process** consists of:
* A countable set of states $\mathcal{S}$ (State Space), including a set of terminal states $\mathcal{T} \subset \mathcal{S}$, and a countable set of actions $\mathcal{A}$ (Action Space).
* A time-indexed sequence of environment-generated pairs of random states $S_t \in \mathcal{S}$ and random rewards $R_t \in \mathcal{D}$ (a countable subset of $\mathbb{R}$), alternating with agent-controllable actions $A_t \in \mathcal{A}$ for time steps $t = 0, 1, 2, \dots$, satisfying the Markov Property: $$\mathbb{P}[(R_{t+1},S_{t+1})|S_t, A_t, S_{tâ1}, A_{t-1}, \dots , S_0, A_0] = \mathbb{P}[(R_{t+1},S_{t+1})|(S_t, A_t)] \quad \forall t \geq 0.$$
* Termination: If an outcome for $S_T$ (for some time step $T$) is a state in the set $\mathcal{T}$, then this sequence outcome terminates at time step $T$.
We refer to $\mathcal{N} = \mathcal{A}-\mathcal{T}$ the set of non-terminal state.
### Time Homogeneity
An MDP is **time-homogeneous** if the transition probabilities and reward function are constant over time steps. This simplifies analysis.
#### Defining Transition Dynamics and Rewards
* **Transition Probability Function** $\mathcal{P_R}$: $$\mathcal{P_R}(s,a,r,s') = \mathbb{P} [(R_{t+1}=r,S_{t+1}=s')|S_t=s, A_t=a].$$
* **State Transition Probability Function:** $$\mathcal{P}(s,a,s') = \sum_{r \in \mathcal{D}} \mathcal{P_R}(s,a,r,s').$$
* **Reward Transition Function**: \begin{split} \mathcal{R_T}(s,a,s') &= \mathbb{E}[R_{t+1}|(S_{t+1}=s', S_t=s,A_t=a)]. \\ &=\sum_{r \in \mathcal{D}} \frac{\mathcal{P_R}(s,a,r,s')}{\mathcal{P}(s,a,s')} \cdot r \\ &= \sum_{r \in \mathcal{D}} \frac{\mathcal{P_R}(s,a,r,s')}{\sum_{r \in \mathcal{D}}\mathcal{P_R}(s,a,r,s')} \cdot r. \end{split}
* **Reward Function**:
\begin{split} \mathcal{R}(s,a) &= \mathbb{E}[R_{t+1}|(S_t=s,A_t=a)]\\ &=\sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{D}} \mathcal{P_{R}}(s,a,r,s') \cdot r. \end{split}
#### MDP Class
```python=
from __future__ import annotations
from abc import ABC, abstractmethod
from collections import defaultdict
from dataclasses import dataclass
from typing import (DefaultDict, Dict, Iterable, Generic, Mapping,
Tuple, Sequence, TypeVar, Set)
from distribution import (Categorical, Distribution, FiniteDistribution)
from markov_process import (
FiniteMarkovRewardProcess, MarkovRewardProcess, StateReward, State,
NonTerminal, Terminal)
from policy import FinitePolicy, Policy
A = TypeVar('A')
S = TypeVar('S')
@dataclass(frozen=True)
class TransitionStep(Generic[S, A]):
'''A single step in the simulation of an MDP, containing:
state -- the state we start from
action -- the action we took at that state
next_state -- the state we ended up in after the action
reward -- the instantaneous reward we got for this transition
'''
state: NonTerminal[S]
action: A
next_state: State[S]
reward: float
class MarkovDecisionProcess(ABC, Generic[S, A]):
def apply_policy(self, policy: Policy[S, A]) -> MarkovRewardProcess[S]:
mdp = self
class RewardProcess(MarkovRewardProcess[S]):
def transition_reward(
self,
state: NonTerminal[S]
) -> Distribution[Tuple[State[S], float]]:
actions: Distribution[A] = policy.act(state)
return actions.apply(lambda a: mdp.step(state, a))
return RewardProcess()
@abstractmethod
def actions(self, state: NonTerminal[S]) -> Iterable[A]:
pass
@abstractmethod
def step(
self,
state: NonTerminal[S],
action: A
) -> Distribution[Tuple[State[S], float]]:
pass
def simulate_actions(
self,
start_states: Distribution[NonTerminal[S]],
policy: Policy[S, A]
) -> Iterable[TransitionStep[S, A]]:
'''Simulate this MDP with the given policy,
yielding the sequence of (states, action, next state, reward)
4-tuples encountered in the simulation trace.
'''
state: State[S] = start_states.sample()
while isinstance(state, NonTerminal):
action_distribution = policy.act(state)
action = action_distribution.sample()
next_distribution = self.step(state, action)
next_state, reward = next_distribution.sample()
yield TransitionStep(state, action, next_state, reward)
state = next_state
```
### Finite MDP
A Markov Decision Process is **finite** when it meets the following criteria:
* **Finite State Space**: The set of all possible states $\mathcal{S}$ contains a finite number of states, denoted as $\mathcal{S} = \{s_1, s_2, \dots, s_n \}$.
* **Finite Action Space for Each State**: For each state $s \in \mathcal{N}$, the action space $\mathcal{A}(s)$ comprises a finite set of actions available in that state.
* **Finite Set of Outcome Pairs**: For every combination of a non-terminal state and an action, the set of possible next-state and reward pairs is finite.
#### Finite MDP Class
```python=
ActionMapping = Mapping[A, StateReward[S]]
StateActionMapping = Mapping[NonTerminal[S], ActionMapping[A, S]]
class FiniteMarkovDecisionProcess(MarkovDecisionProcess[S, A]):
'''A Markov Decision Process with finite state and action spaces.
'''
mapping: StateActionMapping[S, A]
non_terminal_states: Sequence[NonTerminal[S]]
def __init__(
self,
mapping: Mapping[S, Mapping[A, FiniteDistribution[Tuple[S, float]]]]
):
non_terminals: Set[S] = set(mapping.keys())
self.mapping = {NonTerminal(s): {a: Categorical(
{(NonTerminal(s1) if s1 in non_terminals else Terminal(s1), r): p
for (s1, r), p in v}
) for a, v in d.items()} for s, d in mapping.items()}
self.non_terminal_states = list(self.mapping.keys())
def __repr__(self) -> str:
display = ""
for s, d in self.mapping.items():
display += f"From State {s.state}:\n"
for a, d1 in d.items():
display += f" With Action {a}:\n"
for (s1, r), p in d1:
opt = "Terminal " if isinstance(s1, Terminal) else ""
display += f" To [{opt}State {s1.state} and "\
+ f"Reward {r:.3f}] with Probability {p:.3f}\n"
return display
def step(self, state: NonTerminal[S], action: A) -> StateReward[S]:
action_map: ActionMapping[A, S] = self.mapping[state]
return action_map[action]
def apply_finite_policy(self, policy: FinitePolicy[S, A])\
-> FiniteMarkovRewardProcess[S]:
transition_mapping: Dict[S, FiniteDistribution[Tuple[S, float]]] = {}
for state in self.mapping:
action_map: ActionMapping[A, S] = self.mapping[state]
outcomes: DefaultDict[Tuple[S, float], float]\
= defaultdict(float)
actions = policy.act(state)
for action, p_action in actions:
for (s1, r), p in action_map[action]:
outcomes[(s1.state, r)] += p_action * p
transition_mapping[state.state] = Categorical(outcomes)
return FiniteMarkovRewardProcess(transition_mapping)
def actions(self, state: NonTerminal[S]) -> Iterable[A]:
'''All the actions allowed for the given state.
This will be empty for terminal states.
'''
return self.mapping[state].keys()
```
### Policy
A policy is the core of how an agent makes decisions in an MDP. It's like a rulebook the agent follows at each state:
* **Input:** The current state of the environment.
* **Output:** A probability distribution over actions. This means for each possible action, the policy tells you how likely the agent is to take it.
#### Formal Definition
A **policy** is a function $\pi: \mathcal{N} \times \mathcal{A} \rightarrow [0,1]$ defined by
$$
\pi(s,a) = \mathbb{P}[A_t=a|S_t=s].
$$
#### Types of Policies
* **Deterministic Policy:** In each state, there's only one action with a probability of 1, the rest are 0. Think of this as a very strict rulebook.
* **Stochastic Policy:** In each state, there might be several actions with non-zero probabilities. This allows for more flexible and potentially adaptive behavior.
#### Policy Class
```python=
from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Callable, Generic, Iterable, Mapping, TypeVar
from distribution import Choose, Constant, Distribution, FiniteDistribution
from markov_process import NonTerminal
A = TypeVar('A')
S = TypeVar('S')
class Policy(ABC, Generic[S, A]):
'''
A policy is a function that specifies what
we should do (the action) at a given state
of our MDP.
'''
@abstractmethod
def act(self, state: NonTerminal[S]) -> Distribution[A]:
'''
A distribution of actions to take from the
given non-terminal state.
'''
@dataclass(frozen=True)
class UniformPolicy(Policy[S, A]):
valid_actions: Callable[[S], Iterable[A]]
def act(self, state: NonTerminal[S]) -> Choose[A]:
return Choose(self.valid_actions(state.state))
@dataclass(frozen=True)
class RandomPolicy(Policy[S, A]):
'''
A policy that randomly selects one of several specified policies
each action.
Given the right inputs, this could simulate things like epsilon-greedy
policies::
RandomPolicy()
'''
policy_choices: Distribution[Policy[S, A]]
def act(self, state: NonTerminal[S]) -> Distribution[A]:
policy: Policy[S, A] = self.policy_choices.sample()
return policy.act(state)
@dataclass(frozen=True)
class DeterministicPolicy(Policy[S, A]):
action_for: Callable[[S], A]
def act(self, state: NonTerminal[S]) -> Constant[A]:
return Constant(self.action_for(state.state))
class Always(DeterministicPolicy[S, A]):
'''A constant policy: always return the same (specified) action for
every possible state.
'''
action: A
def __init__(self, action: A):
self.action = action
super().__init__(lambda _: action)
@dataclass(frozen=True)
class FinitePolicy(Policy[S, A]):
''' A policy where the state and action spaces are finite.
'''
policy_map: Mapping[S, FiniteDistribution[A]]
def __repr__(self) -> str:
display = ""
for s, d in self.policy_map.items():
display += f"For State {s}:\n"
for a, p in d:
display += f" Do Action {a} with Probability {p:.3f}\n"
return display
def act(self, state: NonTerminal[S]) -> FiniteDistribution[A]:
return self.policy_map[state.state]
class FiniteDeterministicPolicy(FinitePolicy[S, A]):
'''A deterministic policy where the state and action spaces are
finite.
'''
action_for: Mapping[S, A]
def __init__(self, action_for: Mapping[S, A]):
self.action_for = action_for
super().__init__(policy_map={s: Constant(a) for s, a in
self.action_for.items()})
def __repr__(self) -> str:
display = ""
for s, a in self.action_for.items():
display += f"For State {s}: Do Action {a}\n"
return display
```
### From MDP to MRP: How a Policy Simplifies the Problem
Imagine the MDP as a game where you have multiple choices at each step. A policy is like picking a specific strategy for how you'll play. Applying a policy does two main things:
1. **Locks in Decisions:** Instead of considering all possible actions at every state, the policy tells us the probability of taking each action. This removes a layer of uncertainty.
2. **Blends Together Outcomes:** The MRP's transition probabilities and rewards are calculated by taking the possible outcomes of the MDP and averaging them according to the policy's probabilities.
#### Why This Matters
* **Easier Analysis:** MRPs are generally simpler to analyze than MDPs. Finding good policies for MDPs often involves converting them to MRPs!
* **Focus on the Strategy:** Once we have a policy, we can analyze how well it will perform as if it were the only way to play the game.
#### The Transformation in Formulas
* **Transition Probability Function** $$\mathcal P^{\pi}_{R}(s,r,s') = \sum_{a \in \mathcal A} \pi(s,a) \cdot \mathcal P_R(s,a,r,s').$$
* **State Transition Probability Function** $$\mathcal P^{\pi}(s,s') = \sum_{a \in \mathcal A} \pi(s,a) \cdot \mathcal P(s,a,s').$$
* **Reward Transition Function** $$\mathcal R^{\pi}_{T}(s,s') = \sum_{a \in \mathcal A} \pi(s,a) \cdot \mathcal R_T(s,a,s').$$
* **Reward Function** $$\mathcal R^{\pi}(s) = \sum_{a \in \mathcal A} \pi(s,a) \cdot \mathcal R(s,a).$$
## Example: Simple Inventory MDP
The setup is the same as in the MRP example.
### Alternation
- To qualify as a Finite Markov Decision Process, we model the pair of next state and reward to be $(S_{t+1},\mathbb{E}[R_{t+1}|(S_t,A_t,S_{t+1})])$.
- So given a state $s$ and action $a$, the pairs of next state and reward would be: $(s', \mathcal R_T (s, a, s'))$ for all the $s'$ we transition to from $(s, a)$.
### Caculation of $\mathcal R_T$
* When $\mathcal S_{t+1} = (\alpha,\beta)$, $\alpha > 0$, for all $\alpha,\beta$ with $0\leq\alpha+\beta\leq\mathcal C$ and for all $\theta$ with $0\leq\theta\leq\mathcal C-(\alpha+\beta)$:
$$
\mathcal R_T((\alpha,\beta),\theta,(\alpha+\beta-i,\theta)) = -h\alpha
$$ for $0\leq i\leq\alpha+\beta-1$.
* When $\mathcal S_{t+1} = (0,\beta)$, there are two possibilities:
* the demand was exactly $\alpha+\beta$
* the demand was strictly greater than $\alpha+\beta$
#### Python Implementation
```python=
from dataclasses import dataclass
from typing import Tuple, Dict, Mapping
from markov_decision_process import FiniteMarkovDecisionProcess
from policy import FiniteDeterministicPolicy
from markov_process import FiniteMarkovProcess, FiniteMarkovRewardProcess
from distribution import Categorical
from scipy.stats import poisson
@dataclass(frozen=True)
class InventoryState:
on_hand: int
on_order: int
def inventory_position(self) -> int:
return self.on_hand + self.on_order
InvOrderMapping = Mapping[
InventoryState,
Mapping[int, Categorical[Tuple[InventoryState, float]]]
]
class SimpleInventoryMDPCap(FiniteMarkovDecisionProcess[InventoryState, int]):
def __init__(
self,
capacity: int,
poisson_lambda: float,
holding_cost: float,
stockout_cost: float
):
self.capacity: int = capacity
self.poisson_lambda: float = poisson_lambda
self.holding_cost: float = holding_cost
self.stockout_cost: float = stockout_cost
self.poisson_distr = poisson(poisson_lambda)
super().__init__(self.get_action_transition_reward_map())
def get_action_transition_reward_map(self) -> InvOrderMapping:
d: Dict[InventoryState, Dict[int, Categorical[Tuple[InventoryState,
float]]]] = {}
for alpha in range(self.capacity + 1):
for beta in range(self.capacity + 1 - alpha):
state: InventoryState = InventoryState(alpha, beta)
ip: int = state.inventory_position()
base_reward: float = - self.holding_cost * alpha
d1: Dict[int, Categorical[Tuple[InventoryState, float]]] = {}
for order in range(self.capacity - ip + 1):
sr_probs_dict: Dict[Tuple[InventoryState, float], float] =\
{(InventoryState(ip - i, order), base_reward):
self.poisson_distr.pmf(i) for i in range(ip)}
probability: float = 1 - self.poisson_distr.cdf(ip - 1)
reward: float = base_reward - self.stockout_cost * \
(self.poisson_lambda - ip *
(1 - self.poisson_distr.pmf(ip) / probability))
sr_probs_dict[(InventoryState(0, order), reward)] = \
probability
d1[order] = Categorical(sr_probs_dict)
d[state] = d1
return d
if __name__ == '__main__':
from pprint import pprint
user_capacity = 2
user_poisson_lambda = 1.0
user_holding_cost = 1.0
user_stockout_cost = 10.0
user_gamma = 0.9
si_mdp: FiniteMarkovDecisionProcess[InventoryState, int] =\
SimpleInventoryMDPCap(
capacity=user_capacity,
poisson_lambda=user_poisson_lambda,
holding_cost=user_holding_cost,
stockout_cost=user_stockout_cost
)
print("MDP Transition Map")
print("------------------")
print(si_mdp)
fdp: FiniteDeterministicPolicy[InventoryState, int] = \
FiniteDeterministicPolicy(
{InventoryState(alpha, beta): user_capacity - (alpha + beta)
for alpha in range(user_capacity + 1)
for beta in range(user_capacity + 1 - alpha)}
)
print("Deterministic Policy Map")
print("------------------------")
print(fdp)
implied_mrp: FiniteMarkovRewardProcess[InventoryState] =\
si_mdp.apply_finite_policy(fdp)
print("Implied MP Transition Map")
print("--------------")
print(FiniteMarkovProcess(
{s.state: Categorical({s1.state: p for s1, p in v.table().items()})
for s, v in implied_mrp.transition_map.items()}
))
print("Implied MRP Transition Reward Map")
print("---------------------")
print(implied_mrp)
print("Implied MP Stationary Distribution")
print("-----------------------")
implied_mrp.display_stationary_distribution()
print()
print("Implied MRP Reward Function")
print("---------------")
implied_mrp.display_reward_function()
print()
print("Implied MRP Value Function")
print("--------------")
implied_mrp.display_value_function(gamma=user_gamma)
print()
from dynamic_programming import evaluate_mrp_result
from dynamic_programming import policy_iteration_result
from dynamic_programming import value_iteration_result
print("Implied MRP Policy Evaluation Value Function")
print("--------------")
pprint(evaluate_mrp_result(implied_mrp, gamma=user_gamma))
print()
print("MDP Policy Iteration Optimal Value Function and Optimal Policy")
print("--------------")
opt_vf_pi, opt_policy_pi = policy_iteration_result(
si_mdp,
gamma=user_gamma
)
pprint(opt_vf_pi)
print(opt_policy_pi)
print()
print("MDP Value Iteration Optimal Value Function and Optimal Policy")
print("--------------")
opt_vf_vi, opt_policy_vi = value_iteration_result(si_mdp, gamma=user_gamma)
pprint(opt_vf_vi)
print(opt_policy_vi)
print()
```
``` python
MDP Transition Map
------------------
From State InventoryState(on_hand=0, on_order=0):
With Action 0:
To [State InventoryState(on_hand=0, on_order=0) and Reward -10.000] with Probability 1.000
With Action 1:
To [State InventoryState(on_hand=0, on_order=1) and Reward -10.000] with Probability 1.000
With Action 2:
To [State InventoryState(on_hand=0, on_order=2) and Reward -10.000] with Probability 1.000
From State InventoryState(on_hand=0, on_order=1):
With Action 0:
To [State InventoryState(on_hand=1, on_order=0) and Reward -0.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -5.820] with Probability 0.632
With Action 1:
To [State InventoryState(on_hand=1, on_order=1) and Reward -0.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=1) and Reward -5.820] with Probability 0.632
From State InventoryState(on_hand=0, on_order=2):
With Action 0:
To [State InventoryState(on_hand=2, on_order=0) and Reward -0.000] with Probability 0.368
To [State InventoryState(on_hand=1, on_order=0) and Reward -0.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -3.922] with Probability 0.264
From State InventoryState(on_hand=1, on_order=0):
With Action 0:
To [State InventoryState(on_hand=1, on_order=0) and Reward -1.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -6.820] with Probability 0.632
With Action 1:
To [State InventoryState(on_hand=1, on_order=1) and Reward -1.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=1) and Reward -6.820] with Probability 0.632
From State InventoryState(on_hand=1, on_order=1):
With Action 0:
To [State InventoryState(on_hand=2, on_order=0) and Reward -1.000] with Probability 0.368
To [State InventoryState(on_hand=1, on_order=0) and Reward -1.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -4.922] with Probability 0.264
From State InventoryState(on_hand=2, on_order=0):
With Action 0:
To [State InventoryState(on_hand=2, on_order=0) and Reward -2.000] with Probability 0.368
To [State InventoryState(on_hand=1, on_order=0) and Reward -2.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -5.922] with Probability 0.264
Deterministic Policy Map
------------------------
For State InventoryState(on_hand=0, on_order=0): Do Action 2
For State InventoryState(on_hand=0, on_order=1): Do Action 1
For State InventoryState(on_hand=0, on_order=2): Do Action 0
For State InventoryState(on_hand=1, on_order=0): Do Action 1
For State InventoryState(on_hand=1, on_order=1): Do Action 0
For State InventoryState(on_hand=2, on_order=0): Do Action 0
Implied MP Transition Map
--------------
From State InventoryState(on_hand=0, on_order=0):
To State InventoryState(on_hand=0, on_order=2) with Probability 1.000
From State InventoryState(on_hand=0, on_order=1):
To State InventoryState(on_hand=1, on_order=1) with Probability 0.368
To State InventoryState(on_hand=0, on_order=1) with Probability 0.632
From State InventoryState(on_hand=0, on_order=2):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
From State InventoryState(on_hand=1, on_order=0):
To State InventoryState(on_hand=1, on_order=1) with Probability 0.368
To State InventoryState(on_hand=0, on_order=1) with Probability 0.632
From State InventoryState(on_hand=1, on_order=1):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
From State InventoryState(on_hand=2, on_order=0):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
Implied MRP Transition Reward Map
---------------------
From State InventoryState(on_hand=0, on_order=0):
To [State InventoryState(on_hand=0, on_order=2) and Reward -10.000] with Probability 1.000
From State InventoryState(on_hand=0, on_order=1):
To [State InventoryState(on_hand=1, on_order=1) and Reward -0.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=1) and Reward -5.820] with Probability 0.632
From State InventoryState(on_hand=0, on_order=2):
To [State InventoryState(on_hand=2, on_order=0) and Reward -0.000] with Probability 0.368
To [State InventoryState(on_hand=1, on_order=0) and Reward -0.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -3.922] with Probability 0.264
From State InventoryState(on_hand=1, on_order=0):
To [State InventoryState(on_hand=1, on_order=1) and Reward -1.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=1) and Reward -6.820] with Probability 0.632
From State InventoryState(on_hand=1, on_order=1):
To [State InventoryState(on_hand=2, on_order=0) and Reward -1.000] with Probability 0.368
To [State InventoryState(on_hand=1, on_order=0) and Reward -1.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -4.922] with Probability 0.264
From State InventoryState(on_hand=2, on_order=0):
To [State InventoryState(on_hand=2, on_order=0) and Reward -2.000] with Probability 0.368
To [State InventoryState(on_hand=1, on_order=0) and Reward -2.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -5.922] with Probability 0.264
Implied MP Stationary Distribution
-----------------------
{InventoryState(on_hand=0, on_order=2): 0.117,
InventoryState(on_hand=0, on_order=1): 0.279,
InventoryState(on_hand=1, on_order=0): 0.162,
InventoryState(on_hand=0, on_order=0): 0.117,
InventoryState(on_hand=2, on_order=0): 0.162,
InventoryState(on_hand=1, on_order=1): 0.162}
Implied MRP Reward Function
---------------
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -10.0,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -3.679,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -1.036,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -4.679,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -2.036,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -3.036}
Implied MRP Value Function
--------------
{NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -43.596,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -37.971,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -37.329,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -38.971,
NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -38.329,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -39.329}
```
## Value Function in MDPs
Value functions are the core of solving and analyzing MDPs. Let's break them down:
### State-Value Function $V$
* **What it means:** The expected total discounted reward you'll get starting in a state $s$ and following a specific policy $\pi$: $$V^\pi(s)= \mathbb E_{\pi,\mathcal P_R}[G_t|S_t=s].$$
* **Why it's important:** Compares how good different states are under a fixed policy.
* **Bellman Equation:** The key relationship defining $V$: $$V^\pi(s) = \mathcal R^\pi(s)+\gamma\cdot\sum_{s'\in \mathcal N}\mathcal P^\pi(s,s')\cdot V^\pi(s') \quad (*)$$
### Action-Value Function $Q$
* **What it means:** The expected total discounted reward you'll get by taking action $a$ in state $s$, and then following policy $\pi$: $$Q^\pi(s,a)=\mathbb E_{\pi,\mathcal P_R}[G_t|(S_t=s,A_t=a)].$$
* **Why it's important:** Compares how good different actions are in a given state under a fixed policy.
* **Relationship to $V^\pi$:** $$V^\pi(s)=\sum_{a\in \mathcal A}\pi(s,a)\cdot Q^\pi(s,a)\text{ for all }s\in \mathcal N. \quad (\triangle)$$
**MDP Prediction problem**: Evaluating $V^\pi(\cdot)$ and $Q^\pi(\cdot)$ for a fixed policy $\pi$.
### Optimal Value Function $V^*, Q^*$
* **What they mean:** The highest achievable values for each state $V^*$ and state-action pair $Q^*$ by following any policy: \begin{split} V^*(s) &= \max_{\pi \in \Pi} V^\pi(s) \\ Q^*(s,a) &= \max_{\pi \in \Pi} Q^\pi(s,a) \end{split}
* **Why they're important:** They represent the MDP's true potential. Once we have these, we can find the best policy.
* **Bellman Optimality Equations:** The key relationships defining optimal value functions
* For $V^*:$ $$V^*(s) = \max_{a \in \mathcal{A}} \left\{ \mathcal{R}(s,a) + \gamma \cdot \sum_{s' \in \mathcal{N}} \mathcal{P}(s,a,s') \cdot V^*(s') \right\}$$
* For $Q^*:$ $$Q^*(s,a) = \mathcal{R}(s,a) + \gamma \cdot \sum_{s' \in \mathcal{N}} \mathcal{P}(s,a,s') \cdot \max_{a' \in \mathcal{A}} Q^*(s',a')$$
**MDP Control Problem**: Computing $V^*(\cdot)$ and $Q^*(\cdot)$.
### Key Ideas
* **Policies matter:** Value functions depend on the policy being followed.
* **Optimal policy:** A policy is *optimal* if it achieves the optimal value functions at every state: $$\pi^*\in\Pi\text{ is optimal if }V^{\pi^*}(s)\geq V^\pi(s) \quad \forall \pi \in \Pi, \ \forall s \in \mathcal{N}$$
* **Finding the best:** Dynamic programming (like Policy Iteration, Value Iteration) and reinforcement learning algorithms focus on finding optimal policies using the information from value functions.
**Lemma** For any two optimal policies $\pi^*_1$ and $\pi^*_2$, $V^{\pi^*_1}(s)=V^{\pi^*_2}(s)\text{ for all }s \in \mathcal{N}$.
**Proof** Since $\pi^*_1$ is an optimal policy, we have $V^{\pi^*_1}(s)\geq V^{\pi^*_2}(s)$ for all $s\in\mathcal{N}$. Similarly, $V^{\pi^*_2}(s)\geq V^{\pi^*_1}(s)$ for all $s\in\mathcal{N}$. It follows that $V^{\pi^*_1}(s)=V^{\pi^*_2}(s)$ for all $s\in\mathcal{N}$. $\Box$
**Theorem** For any discrete-time, countable-state, time-homogeneous MDP, the following statements hold:
* (Existence) There exists at least one optimal policy $\pi^* \in \Pi$ that dominates or equals the performance of all other policies for every state $s \in \mathcal{N}$.
* (Optimal Value Function Achievement) Any Optimal Policy $\pi^*$ attains the optimal value function, i.e. $V^{\pi^*}(s) = V^*(s)$ for all states $s \in \mathcal{N}$.
* (Optimal Action-Value Function Achievement) Similarly, every optimal policy $\pi^*$ realizes the Optimal Action-Value Function, i.e. $Q^{\pi^*}(s,a) = Q^*(s,a)$ for all state-action pairs $(s,a) \in \mathcal{N} \times \mathcal{A}$.
**Proof**
The preceding lemma indicates that all that's required is to find an optimal policy that achieves both the optimal value function and the optimal action-value function.
We define a deterministic policy $\pi^*_D$ as
$$
\pi_D^*(s) = \arg \max_{a \in \mathcal{A}} Q^*(s, a) \text{ for all } s \in \mathcal{N}.
$$
Then,
* (Achievement $V^*$ and $Q^*$): Given that $\pi_D^*(s) = \arg \max_{a \in \mathcal{A}} Q^*(s, a)$ and $V^*(s) = \max_{a \in \mathcal{A}} Q^*(s, a)$ for all $s \in \mathcal{N}$, it follows that for every $s \in \mathcal{N}$
$$
V^*(s) = Q^*(s, \pi_D^*(s)),
$$
leading to
\begin{split}
V^{\pi_D^*}(s) &= V^*(s) \quad \forall s \in \mathcal{N}, \\
Q^{\pi_D^*}(s, a) &= Q^*(s, a) \quad \forall s \in \mathcal{N}, \ \forall a \in \mathcal{A}.
\end{split}
* (Optimality of $\pi_D^*$): Assume the contrary. Then there exists a policy $\pi \in \Pi$ and a state $s \in \mathcal{N}$ such that $V^\pi(s) > V^{\pi_D^*}(s)$. However, because $V^{\pi_D^*}(s) = V^*(s)$, this implies $V^\pi(s) > V^*(s)$, which is a contradiction.
## Going Beyond the Basics: MDP Variants and Extensions
While the core concepts of MDPs focus on discrete time steps and finite state/action spaces, let's explore how we can adapt the framework to handle more complex and realistic scenarios:
### Tackling Large or Infinite State/Action Spaces
* **Function approximation:** Use machine learning models (like neural networks) to efficiently represent and estimate value functions (state or action values) over vast spaces. This avoids the need to explicitly calculate values for every possible state.
* **Policy Gradient Methods:** For large or continuous action spaces, these methods optimize the policy itself without calculating values for all actions. Think of it as directly searching for a good way to act, rather than evaluating every possible choice.
### Different Notions of Time
* **Episodic MDPs:** Tasks have a clear end (like reaching a goal).
* **Continuing MDPs:** No end state, the process goes on indefinitely (think of a robot that needs to keep operating).
* **Average-Reward MDPs:** Focus on maximizing the average reward per time step, good for long-term tasks where immediate rewards aren't the only focus.
* **Continuous-Time MDPs:** Model time as continuous, not in discrete steps. This gets mathematically trickier (think calculus), but essential for some real-world problems.
### When You Can't See Everything: POMDPs
* **The Challenge:** In a **Partially Observable MDP (POMDP)**, you can't directly see the true state of the world. Think of a robot with noisy sensors, not knowing exactly where it is.
* **Belief States:** The robot keeps track of a probability distribution over possible states â its 'belief' about where it could be.
* **Acting on Uncertainty:** The policy decides the best action based on this uncertain belief, not a single known state.
### Why This Matters
These extensions make MDPs applicable to a wider range of problems:
* **Complex systems:** Handling massive state spaces needed for complex systems.
* **Realistic decision-making:** Modeling uncertainty in how an agent perceives the world.
* **Flexibility:** Accommodating different goals (short-term vs. long-term rewards) and how time is modeled.
## References
- Chapter 4 of the [RLForFinanceBook](https://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf)
- [Markov Decision Process and Bellman Equations](https://github.com/coverdrive/technical-documents/blob/master/finance/cme241/Tour-MDP.pdf) slides for CME 241: Foundations of Reinforcement Learning with Applications in Finance