# Markov Processes
## Concepts of Process and State
* **Process:** A sequence of random events or outcomes over time. Examples include stock price fluctuations, weather patterns, or a robot's movement through a maze.
* **State:** The internal condition of the process at a specific time, denoted as $S_t$. The state captures all the information necessary to predict the probabilities of future states, without requiring the entire history.
**Key Objective:** Our goal is to understand and calculate the probability $\mathbb{P}[S_{t+1} \mid S_t, S_{t-1}, \dots, S_0]$, which describes the likelihood of the process transitioning to a specific future state given its past and present states.
### Random Walks: Illustrative Examples
**Random walks** are fundamental examples of stochastic processes. Let's explore different variations:
1. **Simple Random Walk:** In each time step, the process moves one step to the right or left with equal probability:
$$
\mathbb{P}[X_{t+1} = X_t + 1] + \mathbb{P}[X_{t+1} = X_t - 1] = 1.
$$
2. **Mean-Reverting Random Walk:** This walk exhibits a tendency to move towards a central level $L$. The probability of an upward move depends on the distance from $L$: $$\mathbb{P}[X_{t+1} = X_t + 1] = \frac{1}{1 + e^{-\alpha(L-X_t)}}.$$
* $\alpha$ controls the strength of the mean-reversion.
```python=
import numpy as np
import matplotlib.pyplot as plt
##Parameters
alphas = [0.5, 1, 2] # different alpha values
colors = ['red', 'blue', 'green'] # colors for the lines
L = 0 # level towards which the process reverts
T = 100 # number of time steps
np.random.seed(0) # for reproducibility
# Plotting
plt.figure(figsize=(10, 6))
# Simulate and plot the process for different alphas
for alpha, color in zip(alphas, colors):
X_t = np.zeros(T) # initialize the process
for t in range(1, T):
prob_up = 1 / (1 + np.exp(-alpha * (L - X_t[t-1])))
if np.random.rand() < prob_up:
X_t[t] = X_t[t-1] + 1
else:
X_t[t] = X_t[t-1] - 1
plt.plot(X_t, label=f'Random Walk $X_t$ with $\\alpha$={alpha}', color=color)
plt.axhline(y=L, color='r', linestyle='--', label='Level $L$')
plt.xlabel('Time $t$')
plt.ylabel('$X_t$')
plt.title('Mean-Reverting Random Walk with Different $\\alpha$')
plt.legend()
plt.show()
```

3. **One-Step Reversal Random Walk:** Introduces a bias in the opposite direction of the previous step:
$$\mathbb{P}[X_{t+1} = X_t + 1] =
\begin{cases}
0.5 (1 - \alpha(X_t - X_{t-1}) &\text{ if } t>0 \\
0.5 &\text{ if } t=0
\end{cases}
$$
* $\alpha \in [0,1]$ is the pull-strength parameter.
```python=
import numpy as np
import matplotlib.pyplot as plt
def one_step_reversal_random_walk(steps, alpha):
X = np.zeros(steps)
for t in range(1, steps):
if t == 1: # when t=0
p = 0.5 # default
else:
delta = X[t-1] - X[t-2]
p = 0.5 * (1 - alpha * delta) # adjust p according to the last step
if np.random.rand() < p:
X[t] = X[t-1] + 1
else:
X[t] = X[t-1] - 1
return X
steps = 100
# Plotting
plt.figure(figsize=(10, 6))
alphas = [0.0, 0.5, 1.0]
colors = ['red', 'blue', 'green']
for alpha, color in zip(alphas, colors):
X = one_step_reversal_random_walk(steps, alpha)
plt.plot(X, label=f'alpha = {alpha}', color=color)
plt.xlabel('Time step')
plt.ylabel('Position')
plt.title('One-Step Reversal Random Walk')
plt.legend()
plt.grid(True)
plt.show()
```

4. **Momentum Reversal Random Walk:** Exhibits a tendency to continue in the direction of recent movement, but with a diminishing pull:
$$
\mathbb{P}[X_{t+1} = X_t + 1] =
\begin{cases}
\frac{1}{1+(\frac{U_t+D_t}{D_t}-1)^{\alpha}} &\text{ if } t>0 \\
0.5 &\text{ if } t=0
\end{cases}
$$
* $\alpha \in \mathbb{R}_{\geq 0}$ is the pull-strength parameter.
* $U_t$ and $D_t$ track cumulative upward and downward movements, respectively.
``` python=
import numpy as np
import matplotlib.pyplot as plt
def momentum_reversal_random_walk(T, alpha):
X = np.zeros(T)
U = np.zeros(T) # times of ups
D = np.zeros(T) # times of downs
for t in range(1, T):
U[t] = U[t-1] + max(X[t-1] - X[t-2], 0) if t > 1 else 0
D[t] = D[t-1] + max(X[t-2] - X[t-1], 0) if t > 1 else 0
if t == 1: # initial
p = 0.5
else:
if D[t] == 0: # avoid / 0
ratio = 1
else:
ratio = (U[t] + D[t]) / D[t]
p = 1 / (1 + (ratio - 1)**alpha)
X[t] = X[t-1] + 1 if np.random.rand() < p else X[t-1] - 1
return X
# Parameters
T = 100 # time steps
alpha_values = [0.5, 1, 2]
# Plotting
plt.figure(figsize=(12, 8))
for alpha in alpha_values:
np.random.seed(0)
X = momentum_reversal_random_walk(T, alpha)
plt.plot(X, label=f'α={alpha}')
plt.title('Momentum Reversal Random Walk')
plt.xlabel('Time')
plt.ylabel('Position')
plt.legend()
plt.grid(True)
plt.show()
```

## Markov Processes
A **Markov Process** is a mathematical model describing a sequence of random events where the probability of each event depends only on the state attained in the previous event.
### Key Components:
* **State Space $\mathcal{S}$:** A countable set of possible states the process can occupy.
* **Terminal States $\mathcal{T} \subset \mathcal{S}$:** A subset of states where the process ends.
* **Non-Terminal States $\mathcal{N} = \mathcal{S} - \mathcal{T}$:** States where the process continues to evolve.
* **Transition Probabilities:** The probabilities of moving from one state to another at a given time step.
* **Markov Property:** The probability of the next state depends solely on the current state, regardless of the history: $$ \mathbb{P}[S_{t+1}|S_t, S_{t−1}, \dots , S_0] = \mathbb{P}[S_{t+1}|S_t] \quad \forall t \geq 0 $$
### Types of Markov Processes
* **Discrete Markov Process:** State space is countable (e.g., the outcome of a coin toss).
```python=
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Generic, TypeVar, Iterable, Iterator
S = TypeVar('S') # Define a generic type variable
class State(Generic[S]):
value: S
# Ensure the Distribution class also supports generics
class Distribution(Generic[S]):
def sample(self) -> S:
pass # Implement the specific sampling logic
@dataclass
class NonTerminal(State[S]): # NonTerminal as a specific form of State
pass
# Use generics in MarkovProcess
class MarkovProcess(ABC, Generic[S]):
@abstractmethod
def transition(self, state: NonTerminal[S]) -> Distribution[State[S]]:
pass
def simulate(self, start_state_distr: Distribution[NonTerminal[S]]) -> Iterator[State[S]]:
st: State[S] = start_state_distr.sample()
yield st
while isinstance(st, NonTerminal):
st = this.transition(st).sample()
yield st
```
* **Time-Homogeneous Markov Process:** Transition probabilities are independent of time. This allows us to define the entire dynamics using a transition probability function $\mathcal{P}: \mathcal{N} \times \mathcal{S} \to [0,1]$.
``` python=
import numpy as np
def simulate_markov_process(P, mu, num_steps):
"""
Simulate a time-homogeneous Markov process.
Parameters:
- P: A transition probability function, represented as a 2D numpy array.
P[i, j] = Probability of transitioning from state i to state j.
- mu: An initial state distribution, represented as a 1D numpy array.
mu[i] = Probability of starting in state i.
- num_steps: Number of steps to simulate.
Returns:
- A list representing the simulated state sequence.
"""
# Determine the number of states from the transition matrix
num_states = P.shape[0]
# Sample the initial state based on the initial distribution mu
current_state = np.random.choice(range(num_states), p=mu)
# Initialize the state sequence with the initial state
state_sequence = [current_state]
# Simulate the process for the specified number of steps
for _ in range(1, num_steps):
current_state = np.random.choice(range(num_states), p=P[current_state])
state_sequence.append(current_state)
return state_sequence
```
* **Finite Markov Process:** Both the state space and the set of possible transitions from each state are finite.
```python=
import numpy as np
class FiniteMarkovProcess:
def __init__(self, P, non_terminal_states):
"""
Initialize a Finite Markov Process.
Parameters:
- P: A transition probability matrix of shape (n, m), where
n is the total number of states and m is the number of non-terminal states.
- non_terminal_states: A list of indices representing non-terminal states.
"""
self.P = P
self.non_terminal_states = non_terminal_states
self.n = P.shape[0] # Total number of states
self.m = len(non_terminal_states) # Number of non-terminal states
def transition(self, current_state):
"""
Perform a transition from the current state.
Parameters:
- current_state: The index of the current state.
Returns:
- The index of the next state.
"""
return np.random.choice(range(self.n), p=self.P[current_state])
def simulate(self, start_state, num_steps):
"""
Simulate the Markov process from a start state for a given number of steps.
Parameters:
- start_state: The index of the start state.
- num_steps: Number of steps to simulate.
Returns:
- A list representing the simulated state sequence.
"""
current_state = start_state
state_sequence = [current_state]
for _ in range(1, num_steps):
current_state = self.transition(current_state)
state_sequence.append(current_state)
return state_sequence
```
### Stationary Distribution
A stationary distribution $\pi$ describes the long-term probability of the process being in each state, satisfying:
$$
\pi(s') = \sum_{s \in \mathcal{N}} \pi(s) \cdot \mathcal{P}(s, s') \quad \forall s' \in \mathcal{N}
$$
For finite Markov processes, this translates to the eigenvector equation: $\pi^T = \pi^T P$.
```python=
import numpy as np
from distribution import FiniteDistribution, Categorical
def get_transition_matrix(self) -> np.ndarray:
sz = len(self.non_terminal_states)
mat = np.zeros((sz, sz))
for i, s1 in enumerate(self.non_terminal_states):
for j, s2 in enumerate(self.non_terminal_states):
mat[i, j] = self.transition(s1).probability(s2)
return mat
def get_stationary_distribution(self) -> FiniteDistribution[S]:
eig_vals, eig_vecs = np.linalg.eig(self.get_transition_matrix().T)
index_of_first_unit_eig_val = np.where(
np.abs(eig_vals - 1) < 1e-8)[0][0]
eig_vec_of_unit_eig_val = np.real(
eig_vecs[:, index_of_first_unit_eig_val])
return Categorical({
self.non_terminal_states[i].state: ev
for i, ev in enumerate(eig_vec_of_unit_eig_val /
sum(eig_vec_of_unit_eig_val))
})
```
## Markov Reward Processes (MRPs)
**Markov Reward Processes** extend Markov Processes by incorporating rewards associated with each state transition. This allows us to model scenarios where actions and state changes lead to positive or negative consequences.
### Key Concepts
* **Rewards $R_t$:** A sequence of numerical values earned at each time step. Rewards follow the Markov Property: $$ \mathbb{P}[(R_{t+1}, S_{t+1}) \mid S_t, S_{t-1}, \dots, S_0] = \mathbb{P}[(R_{t+1}, S_{t+1}) \mid S_t] \quad \forall t\geq0.$$
* **Transition Probability Function $\mathcal{P}_R$:** In a time-homogeneous MRP, this function defines the probability of transitioning to a new state and receiving a reward: $$ \mathcal{P}_R(s,r,s') = \mathbb{P}[(R_{t+1}=r, S_{t+1}=s') \mid S_t=s] \quad \forall t. $$
* **Reward Transition Function $\mathcal{R}_T$:** Calculates the expected reward for transitioning from state $s$ to $s'$: \begin{split} \mathcal{R}_T(s,s') &= \mathbb{E}[R_{t+1} \mid S_{t+1} = s', S_t = s] \\ &= \sum_{r \in \mathcal{D}} \frac{\mathcal{P}_R(s,r,s')}{\mathcal{P}(s,s')} \cdot r \\ &= \sum_{r \in \mathcal{D}} \frac{\mathcal{P}_R(s,r,s')}{\sum_{r \in \mathcal{D}} \mathcal{P}_R(s,r,s')} \cdot r. \end{split}
* **Reward Function $\mathcal{R}$:** Represents the expected reward from being in state $s$: \begin{split}\mathcal{R}_T(s,s') &= \mathbb{E}[R_{t+1} \mid S_{t+1} = s', S_t = s] \\ &= \sum_{r \in \mathcal{D}} \frac{\mathcal{P}_R(s,r,s')}{\mathcal{P}(s,s')} \cdot r \\ &= \sum_{r \in \mathcal{D}} \frac{\mathcal{P}_R(s,r,s')}{\sum_{r \in \mathcal{D}} \mathcal{P}_R(s,r,s')} \cdot r. \end{split}
#### Class of MRP
```python=
from typing import List, Tuple, Dict
import numpy as np
S = TypeVar('S')
X = TypeVar('X')
class State(ABC, Generic[S]):
state: S
def on_non_terminal(
self,
f: Callable[[NonTerminal[S]], X],
default: X
) -> X:
if isinstance(self, NonTerminal):
return f(self)
else:
return default
@dataclass(frozen=True)
class Terminal(State[S]):
state: S
# Reward processes
@dataclass(frozen=True)
class TransitionStep(Generic[S]):
state: NonTerminal[S]
next_state: State[S]
reward: float
def add_return(self, γ: float, return_: float) -> ReturnStep[S]:
'''Given a gamma and the return from 'next_state', this annotates the
transition with a return for 'state'.
'''
return ReturnStep(
self.state,
self.next_state,
self.reward,
return_=self.reward + gamma * return_
)
class MarkovRewardProcess:
@abstracmethod
def transition_reward(self, state: NonTerminal[S])\
-> Distribution[Tuple[State[S], float]]:
pass
def simulate_reward(
self,
start_state_distribution: Distribution[NonTerminal[S]]
) -> Iterable[TransitionStep[S]]:
state: State[S] = start_state_distribution.sample()
reward: float = 0.
while isinstance(state, NonTerminal):
next_distribution = self.transition_reward(state)
next_state, reward = next_distribution.sample()
yield TransitionStep(state, next_state, reward)
state = next_state
from distribution import Distribution, SampledDistribution
def transition(self, state: NonTerminal[S]) -> Distribution[State[S]]:
distribution = self.transition_reward(state)
def next_state(distribution=distribution):
next_s, _ = distribution.sample()
return next_s
return SampledDistribution(next_state)
```
#### Class of finite MRP
```python=
from typing import Dict, Tuple
import numpy as np
State = int # Define a type for states
Reward = float # Define a type for rewards
TransitionProbabilities = Dict[State, Dict[State, float]] # Nested dict for transition probabilities
RewardFunction = Dict[Tuple[State, State], Reward] # Reward for transitioning from state to state
class FiniteMarkovRewardProcess:
def __init__(self, transition_probabilities: TransitionProbabilities, reward_function: RewardFunction):
self.transition_probabilities = transition_probabilities
self.reward_function = reward_function
self.states = list(transition_probabilities.keys())
def get_expected_reward(self, current_state: State) -> float:
"""
Calculates the expected reward for being in a given state.
"""
expected_reward = 0.0
for next_state, prob in self.transition_probabilities[current_state].items():
reward = self.reward_function.get((current_state, next_state), 0)
expected_reward += prob * reward
return expected_reward
def get_transition_probabilities(self, current_state: State) -> Dict[State, float]:
"""
Returns the transition probabilities from the given state to all possible next states.
"""
return self.transition_probabilities[current_state]
finite_mrp = FiniteMarkovRewardProcess(transition_probabilities, reward_function)
```
### Accumulated Discounted Rewards
We often want to evaluate the total expected return from a given state. The **accumulated discounted rewards**, denoted by $G_t$, account for both immediate rewards and future rewards, with future rewards being discounted to reflect uncertainty and time preference:
$$
G_t = \sum_{i=t+1}^\infty \gamma^{i-t-1} R_i = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots,
$$
* Discount Factor $\gamma$: A value between 0 and 1 determining how much we prioritize immediate rewards versus future rewards.
### Value Function
The **value function** $V$ is the core concept in MRPs. It represents the long-term expected return starting from a non-terminal state $s$:
$$
V(s) = \mathbb{E}[G_t \mid S_t = s] \quad \forall s \in \mathcal{N}.
$$
#### Bellman Equation
The Bellman Equation expresses the value function recursively:
$$
V(s) = \mathcal{R}(s) + \gamma \sum_{s' \in \mathcal{N}} \mathcal{P}(s, s') \cdot V(s').
$$
This equation states that the value of a state is the immediate reward plus the discounted expected value of the next state.
#### Computational Considerations
* **Finite MRPs:** The value function can be solved using linear algebra techniques: $$V = R + \gamma P V \Rightarrow V = (I - \gamma P)^{-1}.$$
* **Large State Spaces:** Dynamic programming methods are often necessary.
#### Get value function
```python=
import numpy as np
def get_value_function(R, P, gamma, threshold=0.001):
"""
Calculate the value function for a finite Markov Reward Process.
Parameters:
- R: Numpy array representing the expected rewards for each state.
- P: Numpy matrix representing the transition probabilities.
- gamma: Discount factor.
- threshold: Convergence threshold.
Returns:
- V: Numpy array representing the value function for each state.
"""
num_states = R.shape[0]
V = np.zeros(num_states) # Initialize the value function with zeros
while True:
delta = 0 # Initialize delta to track changes in the value function
for s in range(num_states):
v = V[s] # Store the current value function for state s
# Update the value function based on the Bellman equation
V[s] = R[s] + gamma * np.dot(P[s, :], V)
# Update delta to check for convergence
delta = max(delta, abs(v - V[s]))
# Check if the change in the value function is below the threshold for all states
if delta < threshold:
break
return V
# Example usage
R = np.array([1, 0, -1]) # Example rewards for each state
P = np.array([[0.9, 0.1, 0.0], # Example transition probabilities matrix
[0.5, 0.0, 0.5],
[0.0, 0.1, 0.9]])
gamma = 0.99 # Discount factor
V = get_value_function(R, P, gamma)
print("Value Function:", V)
```
``` python
Output:
putho=
Value Function: [ 9.26349365 0.08829004 -9.08611006]
```
## Example: A Simple Inventory Management Process
### Scenario
A small store wants to optimize its inventory management to minimize costs associated with holding excess stock and missing out on sales due to stockouts.
### Key Parameters
* **Capacity $C$:** Maximum inventory the store can hold.
* **In-hand Inventory $\alpha$:** Stock on hand at the end of the day.
* **On-order Inventory $\beta$:** Stock ordered 36 hours ago, arriving at the start of the next day.
* **Holding Cost $h$:** Cost per unit for storing unsold items overnight.
* **Stockout Cost $p$:** Cost per unit of unmet demand.
### Daily Operations
1. **Store Opens (8 AM):** Current state is $S_t= (\alpha,\beta)$.
2. **Order Calculation:** Order quantity $= \max(C−(\alpha+\beta),0)$. (Order if current inventory falls below capacity).
3. **Receive Previous Order (6 AM):** Inventory ordered 36 hours ago arrives.
4. **Demand:** Customer demand $i$ is random, modeled by a Poisson distribution with parameter $\lambda$.
5. **Sales:** Items sold $= \min(\alpha+\beta,i)$.
6. **New State:** Update inventory levels for the start of the next day: $$S_{t+1} = (\max(\alpha+\beta-i, 0), \max(C-(\alpha+\beta), 0))$$
7. **Record Costs (6 PM):**
* Holding cost: $h \cdot \alpha$
* Stockout cost: $p \cdot \max(i - (\alpha+\beta), 0)$, where $i$ is the day's demand.
8. **Calculate Reward:** $R_{t+1}$ is the negative sum of holding and stockout costs.
### Modeling with a MRP
* **State Space:** Possible combinations of in-hand and on-order inventory $(\alpha,\beta)$.
* **Transition Probabilities:** Determined by the Poisson demand distribution.
* **Rewards:** Negative costs incurred at each step.
* **Discount Factor $\gamma$:** Reflects preference for immediate vs. future rewards (e.g., $\gamma = 0.9$ indicates we value today's costs more than tomorrow's).
**Goal:** Find the optimal inventory policy that maximizes the long-term expected return (minimizes costs) using the value function of the MRP.
#### Python Implementation
```python=
from dataclasses import dataclass
from typing import Tuple, Dict, Mapping
from markov_process import MarkovRewardProcess
from markov_process import FiniteMarkovRewardProcess
from markov_process import State, NonTerminal
from scipy.stats import poisson
from distribution import SampledDistribution, Categorical, \
FiniteDistribution
import numpy as np
@dataclass(frozen=True)
class InventoryState:
on_hand: int
on_order: int
def inventory_position(self) -> int:
return self.on_hand + self.on_order
class SimpleInventoryMRP(MarkovRewardProcess[InventoryState]):
def __init__(
self,
capacity: int,
poisson_lambda: float,
holding_cost: float,
stockout_cost: float
):
self.capacity = capacity
self.poisson_lambda: float = poisson_lambda
self.holding_cost: float = holding_cost
self.stockout_cost: float = stockout_cost
def transition_reward(
self,
state: NonTerminal[InventoryState]
) -> SampledDistribution[Tuple[State[InventoryState], float]]:
def sample_next_state_reward(state=state) ->\
Tuple[State[InventoryState], float]:
demand_sample: int = np.random.poisson(self.poisson_lambda)
ip: int = state.state.inventory_position()
next_state: InventoryState = InventoryState(
max(ip - demand_sample, 0),
max(self.capacity - ip, 0)
)
reward: float = - self.holding_cost * state.state.on_hand\
- self.stockout_cost * max(demand_sample - ip, 0)
return NonTerminal(next_state), reward
return SampledDistribution(sample_next_state_reward)
class SimpleInventoryMRPFinite(FiniteMarkovRewardProcess[InventoryState]):
def __init__(
self,
capacity: int,
poisson_lambda: float,
holding_cost: float,
stockout_cost: float
):
self.capacity: int = capacity
self.poisson_lambda: float = poisson_lambda
self.holding_cost: float = holding_cost
self.stockout_cost: float = stockout_cost
self.poisson_distr = poisson(poisson_lambda)
super().__init__(self.get_transition_reward_map())
def get_transition_reward_map(self) -> \
Mapping[
InventoryState,
FiniteDistribution[Tuple[InventoryState, float]]
]:
d: Dict[InventoryState, Categorical[Tuple[InventoryState, float]]] = {}
for alpha in range(self.capacity + 1):
for beta in range(self.capacity + 1 - alpha):
state = InventoryState(alpha, beta)
ip = state.inventory_position()
beta1 = self.capacity - ip
base_reward = - self.holding_cost * state.on_hand
sr_probs_map: Dict[Tuple[InventoryState, float], float] =\
{(InventoryState(ip - i, beta1), base_reward):
self.poisson_distr.pmf(i) for i in range(ip)}
probability = 1 - self.poisson_distr.cdf(ip - 1)
reward = base_reward - self.stockout_cost * \
(self.poisson_lambda - ip *
(1 - self.poisson_distr.pmf(ip) / probability))
sr_probs_map[(InventoryState(0, beta1), reward)] = probability
d[state] = Categorical(sr_probs_map)
return d
if __name__ == '__main__':
user_capacity = 2
user_poisson_lambda = 1.0
user_holding_cost = 1.0
user_stockout_cost = 10.0
user_gamma = 0.9
si_mrp = SimpleInventoryMRPFinite(
capacity=user_capacity,
poisson_lambda=user_poisson_lambda,
holding_cost=user_holding_cost,
stockout_cost=user_stockout_cost
)
from markov_process import FiniteMarkovProcess
print("Transition Map")
print("--------------")
print(FiniteMarkovProcess(
{s.state: Categorical({s1.state: p for s1, p in v.table().items()})
for s, v in si_mrp.transition_map.items()}
))
print("Transition Reward Map")
print("---------------------")
print(si_mrp)
print("Stationary Distribution")
print("-----------------------")
si_mrp.display_stationary_distribution()
print()
print("Reward Function")
print("---------------")
si_mrp.display_reward_function()
print()
print("Value Function")
print("--------------")
si_mrp.display_value_function(gamma=user_gamma)
print()
```
```python
Transition Map
--------------
From State InventoryState(on_hand=0, on_order=0):
To State InventoryState(on_hand=0, on_order=2) with Probability 1.000
From State InventoryState(on_hand=0, on_order=1):
To State InventoryState(on_hand=1, on_order=1) with Probability 0.368
To State InventoryState(on_hand=0, on_order=1) with Probability 0.632
From State InventoryState(on_hand=0, on_order=2):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
From State InventoryState(on_hand=1, on_order=0):
To State InventoryState(on_hand=1, on_order=1) with Probability 0.368
To State InventoryState(on_hand=0, on_order=1) with Probability 0.632
From State InventoryState(on_hand=1, on_order=1):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
From State InventoryState(on_hand=2, on_order=0):
To State InventoryState(on_hand=2, on_order=0) with Probability 0.368
To State InventoryState(on_hand=1, on_order=0) with Probability 0.368
To State InventoryState(on_hand=0, on_order=0) with Probability 0.264
Transition Reward Map
---------------------
From State InventoryState(on_hand=0, on_order=0):
To [State InventoryState(on_hand=0, on_order=2) and Reward -10.000] with Probability 1.000
From State InventoryState(on_hand=0, on_order=1):
To [State InventoryState(on_hand=1, on_order=1) and Reward -0.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=1) and Reward -5.820] with Probability 0.632
From State InventoryState(on_hand=0, on_order=2):
To [State InventoryState(on_hand=2, on_order=0) and Reward -0.000] with Probability 0.368
To [State InventoryState(on_hand=1, on_order=0) and Reward -0.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -3.922] with Probability 0.264
From State InventoryState(on_hand=1, on_order=0):
To [State InventoryState(on_hand=1, on_order=1) and Reward -1.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=1) and Reward -6.820] with Probability 0.632
From State InventoryState(on_hand=1, on_order=1):
To [State InventoryState(on_hand=2, on_order=0) and Reward -1.000] with Probability 0.368
To [State InventoryState(on_hand=1, on_order=0) and Reward -1.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -4.922] with Probability 0.264
From State InventoryState(on_hand=2, on_order=0):
To [State InventoryState(on_hand=2, on_order=0) and Reward -2.000] with Probability 0.368
To [State InventoryState(on_hand=1, on_order=0) and Reward -2.000] with Probability 0.368
To [State InventoryState(on_hand=0, on_order=0) and Reward -5.922] with Probability 0.264
Stationary Distribution
-----------------------
{InventoryState(on_hand=2, on_order=0): 0.162,
InventoryState(on_hand=1, on_order=1): 0.162,
InventoryState(on_hand=0, on_order=2): 0.117,
InventoryState(on_hand=0, on_order=1): 0.279,
InventoryState(on_hand=1, on_order=0): 0.162,
InventoryState(on_hand=0, on_order=0): 0.117}
Reward Function
---------------
{NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -2.036,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -3.036,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -4.679,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -1.036,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -3.679,
NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -10.0}
Value Function
--------------
{NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -38.329,
NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -39.329,
NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -38.971,
NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -37.329,
NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -37.971,
NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -43.596}
```
## References
- Chapter 3 of the [RLForFinanceBook](https://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf)
- [Markov Process and Markov Reward Process](https://github.com/coverdrive/technical-documents/blob/master/finance/cme241/Tour-MP.pdf) slides for CME 241: Foundations of Reinforcement Learning with Applications in Finance