MP - HackMD

# Markov Processes ## Concepts of Process and State * **Process:** A sequence of random events or outcomes over time. Examples include stock price fluctuations, weather patterns, or a robot's movement through a maze. * **State:** The internal condition of the process at a specific time, denoted as $S_t$. The state captures all the information necessary to predict the probabilities of future states, without requiring the entire history. **Key Objective:** Our goal is to understand and calculate the probability $\mathbb{P}[S_{t+1} \mid S_t, S_{t-1}, \dots, S_0]$, which describes the likelihood of the process transitioning to a specific future state given its past and present states. ### Random Walks: Illustrative Examples **Random walks** are fundamental examples of stochastic processes. Let's explore different variations: 1. **Simple Random Walk:** In each time step, the process moves one step to the right or left with equal probability: $$ \mathbb{P}[X_{t+1} = X_t + 1] + \mathbb{P}[X_{t+1} = X_t - 1] = 1. $$ 2. **Mean-Reverting Random Walk:** This walk exhibits a tendency to move towards a central level $L$. The probability of an upward move depends on the distance from $L$: $$\mathbb{P}[X_{t+1} = X_t + 1] = \frac{1}{1 + e^{-\alpha(L-X_t)}}.$$ * $\alpha$ controls the strength of the mean-reversion. ```python= import numpy as np import matplotlib.pyplot as plt ##Parameters alphas = [0.5, 1, 2] # different alpha values colors = ['red', 'blue', 'green'] # colors for the lines L = 0 # level towards which the process reverts T = 100 # number of time steps np.random.seed(0) # for reproducibility # Plotting plt.figure(figsize=(10, 6)) # Simulate and plot the process for different alphas for alpha, color in zip(alphas, colors): X_t = np.zeros(T) # initialize the process for t in range(1, T): prob_up = 1 / (1 + np.exp(-alpha * (L - X_t[t-1]))) if np.random.rand() < prob_up: X_t[t] = X_t[t-1] + 1 else: X_t[t] = X_t[t-1] - 1 plt.plot(X_t, label=f'Random Walk $X_t$ with $\\alpha$={alpha}', color=color) plt.axhline(y=L, color='r', linestyle='--', label='Level $L$') plt.xlabel('Time $t$') plt.ylabel('$X_t$') plt.title('Mean-Reverting Random Walk with Different $\\alpha$') plt.legend() plt.show() ``` ![Mean-Reverting_random_walk](https://hackmd.io/_uploads/HkOobDNpa.png) 3. **One-Step Reversal Random Walk:** Introduces a bias in the opposite direction of the previous step: $$\mathbb{P}[X_{t+1} = X_t + 1] = \begin{cases} 0.5 (1 - \alpha(X_t - X_{t-1}) &\text{ if } t>0 \\ 0.5 &\text{ if } t=0 \end{cases} $$ * $\alpha \in [0,1]$ is the pull-strength parameter. ```python= import numpy as np import matplotlib.pyplot as plt def one_step_reversal_random_walk(steps, alpha): X = np.zeros(steps) for t in range(1, steps): if t == 1: # when t=0 p = 0.5 # default else: delta = X[t-1] - X[t-2] p = 0.5 * (1 - alpha * delta) # adjust p according to the last step if np.random.rand() < p: X[t] = X[t-1] + 1 else: X[t] = X[t-1] - 1 return X steps = 100 # Plotting plt.figure(figsize=(10, 6)) alphas = [0.0, 0.5, 1.0] colors = ['red', 'blue', 'green'] for alpha, color in zip(alphas, colors): X = one_step_reversal_random_walk(steps, alpha) plt.plot(X, label=f'alpha = {alpha}', color=color) plt.xlabel('Time step') plt.ylabel('Position') plt.title('One-Step Reversal Random Walk') plt.legend() plt.grid(True) plt.show() ``` ![one-step_reversal_graph](https://hackmd.io/_uploads/rkYfg6Npp.png) 4. **Momentum Reversal Random Walk:** Exhibits a tendency to continue in the direction of recent movement, but with a diminishing pull: $$ \mathbb{P}[X_{t+1} = X_t + 1] = \begin{cases} \frac{1}{1+(\frac{U_t+D_t}{D_t}-1)^{\alpha}} &\text{ if } t>0 \\ 0.5 &\text{ if } t=0 \end{cases} $$ * $\alpha \in \mathbb{R}_{\geq 0}$ is the pull-strength parameter. * $U_t$ and $D_t$ track cumulative upward and downward movements, respectively. ``` python= import numpy as np import matplotlib.pyplot as plt def momentum_reversal_random_walk(T, alpha): X = np.zeros(T) U = np.zeros(T) # times of ups D = np.zeros(T) # times of downs for t in range(1, T): U[t] = U[t-1] + max(X[t-1] - X[t-2], 0) if t > 1 else 0 D[t] = D[t-1] + max(X[t-2] - X[t-1], 0) if t > 1 else 0 if t == 1: # initial p = 0.5 else: if D[t] == 0: # avoid / 0 ratio = 1 else: ratio = (U[t] + D[t]) / D[t] p = 1 / (1 + (ratio - 1)**alpha) X[t] = X[t-1] + 1 if np.random.rand() < p else X[t-1] - 1 return X # Parameters T = 100 # time steps alpha_values = [0.5, 1, 2] # Plotting plt.figure(figsize=(12, 8)) for alpha in alpha_values: np.random.seed(0) X = momentum_reversal_random_walk(T, alpha) plt.plot(X, label=f'α={alpha}') plt.title('Momentum Reversal Random Walk') plt.xlabel('Time') plt.ylabel('Position') plt.legend() plt.grid(True) plt.show() ``` ![Momentum_graph](https://hackmd.io/_uploads/HJPx_T4ap.png) ## Markov Processes A **Markov Process** is a mathematical model describing a sequence of random events where the probability of each event depends only on the state attained in the previous event. ### Key Components: * **State Space $\mathcal{S}$:** A countable set of possible states the process can occupy. * **Terminal States $\mathcal{T} \subset \mathcal{S}$:** A subset of states where the process ends. * **Non-Terminal States $\mathcal{N} = \mathcal{S} - \mathcal{T}$:** States where the process continues to evolve. * **Transition Probabilities:** The probabilities of moving from one state to another at a given time step. * **Markov Property:** The probability of the next state depends solely on the current state, regardless of the history: $$ \mathbb{P}[S_{t+1}|S_t, S_{t−1}, \dots , S_0] = \mathbb{P}[S_{t+1}|S_t] \quad \forall t \geq 0 $$ ### Types of Markov Processes * **Discrete Markov Process:** State space is countable (e.g., the outcome of a coin toss). ```python= from abc import ABC, abstractmethod from dataclasses import dataclass from typing import Generic, TypeVar, Iterable, Iterator S = TypeVar('S') # Define a generic type variable class State(Generic[S]): value: S # Ensure the Distribution class also supports generics class Distribution(Generic[S]): def sample(self) -> S: pass # Implement the specific sampling logic @dataclass class NonTerminal(State[S]): # NonTerminal as a specific form of State pass # Use generics in MarkovProcess class MarkovProcess(ABC, Generic[S]): @abstractmethod def transition(self, state: NonTerminal[S]) -> Distribution[State[S]]: pass def simulate(self, start_state_distr: Distribution[NonTerminal[S]]) -> Iterator[State[S]]: st: State[S] = start_state_distr.sample() yield st while isinstance(st, NonTerminal): st = this.transition(st).sample() yield st ``` * **Time-Homogeneous Markov Process:** Transition probabilities are independent of time. This allows us to define the entire dynamics using a transition probability function $\mathcal{P}: \mathcal{N} \times \mathcal{S} \to [0,1]$. ``` python= import numpy as np def simulate_markov_process(P, mu, num_steps): """ Simulate a time-homogeneous Markov process. Parameters: - P: A transition probability function, represented as a 2D numpy array. P[i, j] = Probability of transitioning from state i to state j. - mu: An initial state distribution, represented as a 1D numpy array. mu[i] = Probability of starting in state i. - num_steps: Number of steps to simulate. Returns: - A list representing the simulated state sequence. """ # Determine the number of states from the transition matrix num_states = P.shape[0] # Sample the initial state based on the initial distribution mu current_state = np.random.choice(range(num_states), p=mu) # Initialize the state sequence with the initial state state_sequence = [current_state] # Simulate the process for the specified number of steps for _ in range(1, num_steps): current_state = np.random.choice(range(num_states), p=P[current_state]) state_sequence.append(current_state) return state_sequence ``` * **Finite Markov Process:** Both the state space and the set of possible transitions from each state are finite. ```python= import numpy as np class FiniteMarkovProcess: def __init__(self, P, non_terminal_states): """ Initialize a Finite Markov Process. Parameters: - P: A transition probability matrix of shape (n, m), where n is the total number of states and m is the number of non-terminal states. - non_terminal_states: A list of indices representing non-terminal states. """ self.P = P self.non_terminal_states = non_terminal_states self.n = P.shape[0] # Total number of states self.m = len(non_terminal_states) # Number of non-terminal states def transition(self, current_state): """ Perform a transition from the current state. Parameters: - current_state: The index of the current state. Returns: - The index of the next state. """ return np.random.choice(range(self.n), p=self.P[current_state]) def simulate(self, start_state, num_steps): """ Simulate the Markov process from a start state for a given number of steps. Parameters: - start_state: The index of the start state. - num_steps: Number of steps to simulate. Returns: - A list representing the simulated state sequence. """ current_state = start_state state_sequence = [current_state] for _ in range(1, num_steps): current_state = self.transition(current_state) state_sequence.append(current_state) return state_sequence ``` ### Stationary Distribution A stationary distribution $\pi$ describes the long-term probability of the process being in each state, satisfying: $$ \pi(s') = \sum_{s \in \mathcal{N}} \pi(s) \cdot \mathcal{P}(s, s') \quad \forall s' \in \mathcal{N} $$ For finite Markov processes, this translates to the eigenvector equation: $\pi^T = \pi^T P$. ```python= import numpy as np from distribution import FiniteDistribution, Categorical def get_transition_matrix(self) -> np.ndarray: sz = len(self.non_terminal_states) mat = np.zeros((sz, sz)) for i, s1 in enumerate(self.non_terminal_states): for j, s2 in enumerate(self.non_terminal_states): mat[i, j] = self.transition(s1).probability(s2) return mat def get_stationary_distribution(self) -> FiniteDistribution[S]: eig_vals, eig_vecs = np.linalg.eig(self.get_transition_matrix().T) index_of_first_unit_eig_val = np.where( np.abs(eig_vals - 1) < 1e-8)[0][0] eig_vec_of_unit_eig_val = np.real( eig_vecs[:, index_of_first_unit_eig_val]) return Categorical({ self.non_terminal_states[i].state: ev for i, ev in enumerate(eig_vec_of_unit_eig_val / sum(eig_vec_of_unit_eig_val)) }) ``` ## Markov Reward Processes (MRPs) **Markov Reward Processes** extend Markov Processes by incorporating rewards associated with each state transition. This allows us to model scenarios where actions and state changes lead to positive or negative consequences. ### Key Concepts * **Rewards $R_t$:** A sequence of numerical values earned at each time step. Rewards follow the Markov Property: $$ \mathbb{P}[(R_{t+1}, S_{t+1}) \mid S_t, S_{t-1}, \dots, S_0] = \mathbb{P}[(R_{t+1}, S_{t+1}) \mid S_t] \quad \forall t\geq0.$$ * **Transition Probability Function $\mathcal{P}_R$:** In a time-homogeneous MRP, this function defines the probability of transitioning to a new state and receiving a reward: $$ \mathcal{P}_R(s,r,s') = \mathbb{P}[(R_{t+1}=r, S_{t+1}=s') \mid S_t=s] \quad \forall t. $$ * **Reward Transition Function $\mathcal{R}_T$:** Calculates the expected reward for transitioning from state $s$ to $s'$: \begin{split} \mathcal{R}_T(s,s') &= \mathbb{E}[R_{t+1} \mid S_{t+1} = s', S_t = s] \\ &= \sum_{r \in \mathcal{D}} \frac{\mathcal{P}_R(s,r,s')}{\mathcal{P}(s,s')} \cdot r \\ &= \sum_{r \in \mathcal{D}} \frac{\mathcal{P}_R(s,r,s')}{\sum_{r \in \mathcal{D}} \mathcal{P}_R(s,r,s')} \cdot r. \end{split} * **Reward Function $\mathcal{R}$:** Represents the expected reward from being in state $s$: \begin{split}\mathcal{R}_T(s,s') &= \mathbb{E}[R_{t+1} \mid S_{t+1} = s', S_t = s] \\ &= \sum_{r \in \mathcal{D}} \frac{\mathcal{P}_R(s,r,s')}{\mathcal{P}(s,s')} \cdot r \\ &= \sum_{r \in \mathcal{D}} \frac{\mathcal{P}_R(s,r,s')}{\sum_{r \in \mathcal{D}} \mathcal{P}_R(s,r,s')} \cdot r. \end{split} #### Class of MRP ```python= from typing import List, Tuple, Dict import numpy as np S = TypeVar('S') X = TypeVar('X') class State(ABC, Generic[S]): state: S def on_non_terminal( self, f: Callable[[NonTerminal[S]], X], default: X ) -> X: if isinstance(self, NonTerminal): return f(self) else: return default @dataclass(frozen=True) class Terminal(State[S]): state: S # Reward processes @dataclass(frozen=True) class TransitionStep(Generic[S]): state: NonTerminal[S] next_state: State[S] reward: float def add_return(self, γ: float, return_: float) -> ReturnStep[S]: '''Given a gamma and the return from 'next_state', this annotates the transition with a return for 'state'. ''' return ReturnStep( self.state, self.next_state, self.reward, return_=self.reward + gamma * return_ ) class MarkovRewardProcess: @abstracmethod def transition_reward(self, state: NonTerminal[S])\ -> Distribution[Tuple[State[S], float]]: pass def simulate_reward( self, start_state_distribution: Distribution[NonTerminal[S]] ) -> Iterable[TransitionStep[S]]: state: State[S] = start_state_distribution.sample() reward: float = 0. while isinstance(state, NonTerminal): next_distribution = self.transition_reward(state) next_state, reward = next_distribution.sample() yield TransitionStep(state, next_state, reward) state = next_state from distribution import Distribution, SampledDistribution def transition(self, state: NonTerminal[S]) -> Distribution[State[S]]: distribution = self.transition_reward(state) def next_state(distribution=distribution): next_s, _ = distribution.sample() return next_s return SampledDistribution(next_state) ``` #### Class of finite MRP ```python= from typing import Dict, Tuple import numpy as np State = int # Define a type for states Reward = float # Define a type for rewards TransitionProbabilities = Dict[State, Dict[State, float]] # Nested dict for transition probabilities RewardFunction = Dict[Tuple[State, State], Reward] # Reward for transitioning from state to state class FiniteMarkovRewardProcess: def __init__(self, transition_probabilities: TransitionProbabilities, reward_function: RewardFunction): self.transition_probabilities = transition_probabilities self.reward_function = reward_function self.states = list(transition_probabilities.keys()) def get_expected_reward(self, current_state: State) -> float: """ Calculates the expected reward for being in a given state. """ expected_reward = 0.0 for next_state, prob in self.transition_probabilities[current_state].items(): reward = self.reward_function.get((current_state, next_state), 0) expected_reward += prob * reward return expected_reward def get_transition_probabilities(self, current_state: State) -> Dict[State, float]: """ Returns the transition probabilities from the given state to all possible next states. """ return self.transition_probabilities[current_state] finite_mrp = FiniteMarkovRewardProcess(transition_probabilities, reward_function) ``` ### Accumulated Discounted Rewards We often want to evaluate the total expected return from a given state. The **accumulated discounted rewards**, denoted by $G_t$, account for both immediate rewards and future rewards, with future rewards being discounted to reflect uncertainty and time preference: $$ G_t = \sum_{i=t+1}^\infty \gamma^{i-t-1} R_i = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots, $$ * Discount Factor $\gamma$: A value between 0 and 1 determining how much we prioritize immediate rewards versus future rewards. ### Value Function The **value function** $V$ is the core concept in MRPs. It represents the long-term expected return starting from a non-terminal state $s$: $$ V(s) = \mathbb{E}[G_t \mid S_t = s] \quad \forall s \in \mathcal{N}. $$ #### Bellman Equation The Bellman Equation expresses the value function recursively: $$ V(s) = \mathcal{R}(s) + \gamma \sum_{s' \in \mathcal{N}} \mathcal{P}(s, s') \cdot V(s'). $$ This equation states that the value of a state is the immediate reward plus the discounted expected value of the next state. #### Computational Considerations * **Finite MRPs:** The value function can be solved using linear algebra techniques: $$V = R + \gamma P V \Rightarrow V = (I - \gamma P)^{-1}.$$ * **Large State Spaces:** Dynamic programming methods are often necessary. #### Get value function ```python= import numpy as np def get_value_function(R, P, gamma, threshold=0.001): """ Calculate the value function for a finite Markov Reward Process. Parameters: - R: Numpy array representing the expected rewards for each state. - P: Numpy matrix representing the transition probabilities. - gamma: Discount factor. - threshold: Convergence threshold. Returns: - V: Numpy array representing the value function for each state. """ num_states = R.shape[0] V = np.zeros(num_states) # Initialize the value function with zeros while True: delta = 0 # Initialize delta to track changes in the value function for s in range(num_states): v = V[s] # Store the current value function for state s # Update the value function based on the Bellman equation V[s] = R[s] + gamma * np.dot(P[s, :], V) # Update delta to check for convergence delta = max(delta, abs(v - V[s])) # Check if the change in the value function is below the threshold for all states if delta < threshold: break return V # Example usage R = np.array([1, 0, -1]) # Example rewards for each state P = np.array([[0.9, 0.1, 0.0], # Example transition probabilities matrix [0.5, 0.0, 0.5], [0.0, 0.1, 0.9]]) gamma = 0.99 # Discount factor V = get_value_function(R, P, gamma) print("Value Function:", V) ``` ``` python Output: putho= Value Function: [ 9.26349365 0.08829004 -9.08611006] ``` ## Example: A Simple Inventory Management Process ### Scenario A small store wants to optimize its inventory management to minimize costs associated with holding excess stock and missing out on sales due to stockouts. ### Key Parameters * **Capacity $C$:** Maximum inventory the store can hold. * **In-hand Inventory $\alpha$:** Stock on hand at the end of the day. * **On-order Inventory $\beta$:** Stock ordered 36 hours ago, arriving at the start of the next day. * **Holding Cost $h$:** Cost per unit for storing unsold items overnight. * **Stockout Cost $p$:** Cost per unit of unmet demand. ### Daily Operations 1. **Store Opens (8 AM):** Current state is $S_t= (\alpha,\beta)$. 2. **Order Calculation:** Order quantity $= \max(C−(\alpha+\beta),0)$. (Order if current inventory falls below capacity). 3. **Receive Previous Order (6 AM):** Inventory ordered 36 hours ago arrives. 4. **Demand:** Customer demand $i$ is random, modeled by a Poisson distribution with parameter $\lambda$. 5. **Sales:** Items sold $= \min(\alpha+\beta,i)$. 6. **New State:** Update inventory levels for the start of the next day: $$S_{t+1} = (\max(\alpha+\beta-i, 0), \max(C-(\alpha+\beta), 0))$$ 7. **Record Costs (6 PM):** * Holding cost: $h \cdot \alpha$ * Stockout cost: $p \cdot \max(i - (\alpha+\beta), 0)$, where $i$ is the day's demand. 8. **Calculate Reward:** $R_{t+1}$ is the negative sum of holding and stockout costs. ### Modeling with a MRP * **State Space:** Possible combinations of in-hand and on-order inventory $(\alpha,\beta)$. * **Transition Probabilities:** Determined by the Poisson demand distribution. * **Rewards:** Negative costs incurred at each step. * **Discount Factor $\gamma$:** Reflects preference for immediate vs. future rewards (e.g., $\gamma = 0.9$ indicates we value today's costs more than tomorrow's). **Goal:** Find the optimal inventory policy that maximizes the long-term expected return (minimizes costs) using the value function of the MRP. #### Python Implementation ```python= from dataclasses import dataclass from typing import Tuple, Dict, Mapping from markov_process import MarkovRewardProcess from markov_process import FiniteMarkovRewardProcess from markov_process import State, NonTerminal from scipy.stats import poisson from distribution import SampledDistribution, Categorical, \ FiniteDistribution import numpy as np @dataclass(frozen=True) class InventoryState: on_hand: int on_order: int def inventory_position(self) -> int: return self.on_hand + self.on_order class SimpleInventoryMRP(MarkovRewardProcess[InventoryState]): def __init__( self, capacity: int, poisson_lambda: float, holding_cost: float, stockout_cost: float ): self.capacity = capacity self.poisson_lambda: float = poisson_lambda self.holding_cost: float = holding_cost self.stockout_cost: float = stockout_cost def transition_reward( self, state: NonTerminal[InventoryState] ) -> SampledDistribution[Tuple[State[InventoryState], float]]: def sample_next_state_reward(state=state) ->\ Tuple[State[InventoryState], float]: demand_sample: int = np.random.poisson(self.poisson_lambda) ip: int = state.state.inventory_position() next_state: InventoryState = InventoryState( max(ip - demand_sample, 0), max(self.capacity - ip, 0) ) reward: float = - self.holding_cost * state.state.on_hand\ - self.stockout_cost * max(demand_sample - ip, 0) return NonTerminal(next_state), reward return SampledDistribution(sample_next_state_reward) class SimpleInventoryMRPFinite(FiniteMarkovRewardProcess[InventoryState]): def __init__( self, capacity: int, poisson_lambda: float, holding_cost: float, stockout_cost: float ): self.capacity: int = capacity self.poisson_lambda: float = poisson_lambda self.holding_cost: float = holding_cost self.stockout_cost: float = stockout_cost self.poisson_distr = poisson(poisson_lambda) super().__init__(self.get_transition_reward_map()) def get_transition_reward_map(self) -> \ Mapping[ InventoryState, FiniteDistribution[Tuple[InventoryState, float]] ]: d: Dict[InventoryState, Categorical[Tuple[InventoryState, float]]] = {} for alpha in range(self.capacity + 1): for beta in range(self.capacity + 1 - alpha): state = InventoryState(alpha, beta) ip = state.inventory_position() beta1 = self.capacity - ip base_reward = - self.holding_cost * state.on_hand sr_probs_map: Dict[Tuple[InventoryState, float], float] =\ {(InventoryState(ip - i, beta1), base_reward): self.poisson_distr.pmf(i) for i in range(ip)} probability = 1 - self.poisson_distr.cdf(ip - 1) reward = base_reward - self.stockout_cost * \ (self.poisson_lambda - ip * (1 - self.poisson_distr.pmf(ip) / probability)) sr_probs_map[(InventoryState(0, beta1), reward)] = probability d[state] = Categorical(sr_probs_map) return d if __name__ == '__main__': user_capacity = 2 user_poisson_lambda = 1.0 user_holding_cost = 1.0 user_stockout_cost = 10.0 user_gamma = 0.9 si_mrp = SimpleInventoryMRPFinite( capacity=user_capacity, poisson_lambda=user_poisson_lambda, holding_cost=user_holding_cost, stockout_cost=user_stockout_cost ) from markov_process import FiniteMarkovProcess print("Transition Map") print("--------------") print(FiniteMarkovProcess( {s.state: Categorical({s1.state: p for s1, p in v.table().items()}) for s, v in si_mrp.transition_map.items()} )) print("Transition Reward Map") print("---------------------") print(si_mrp) print("Stationary Distribution") print("-----------------------") si_mrp.display_stationary_distribution() print() print("Reward Function") print("---------------") si_mrp.display_reward_function() print() print("Value Function") print("--------------") si_mrp.display_value_function(gamma=user_gamma) print() ``` ```python Transition Map -------------- From State InventoryState(on_hand=0, on_order=0): To State InventoryState(on_hand=0, on_order=2) with Probability 1.000 From State InventoryState(on_hand=0, on_order=1): To State InventoryState(on_hand=1, on_order=1) with Probability 0.368 To State InventoryState(on_hand=0, on_order=1) with Probability 0.632 From State InventoryState(on_hand=0, on_order=2): To State InventoryState(on_hand=2, on_order=0) with Probability 0.368 To State InventoryState(on_hand=1, on_order=0) with Probability 0.368 To State InventoryState(on_hand=0, on_order=0) with Probability 0.264 From State InventoryState(on_hand=1, on_order=0): To State InventoryState(on_hand=1, on_order=1) with Probability 0.368 To State InventoryState(on_hand=0, on_order=1) with Probability 0.632 From State InventoryState(on_hand=1, on_order=1): To State InventoryState(on_hand=2, on_order=0) with Probability 0.368 To State InventoryState(on_hand=1, on_order=0) with Probability 0.368 To State InventoryState(on_hand=0, on_order=0) with Probability 0.264 From State InventoryState(on_hand=2, on_order=0): To State InventoryState(on_hand=2, on_order=0) with Probability 0.368 To State InventoryState(on_hand=1, on_order=0) with Probability 0.368 To State InventoryState(on_hand=0, on_order=0) with Probability 0.264 Transition Reward Map --------------------- From State InventoryState(on_hand=0, on_order=0): To [State InventoryState(on_hand=0, on_order=2) and Reward -10.000] with Probability 1.000 From State InventoryState(on_hand=0, on_order=1): To [State InventoryState(on_hand=1, on_order=1) and Reward -0.000] with Probability 0.368 To [State InventoryState(on_hand=0, on_order=1) and Reward -5.820] with Probability 0.632 From State InventoryState(on_hand=0, on_order=2): To [State InventoryState(on_hand=2, on_order=0) and Reward -0.000] with Probability 0.368 To [State InventoryState(on_hand=1, on_order=0) and Reward -0.000] with Probability 0.368 To [State InventoryState(on_hand=0, on_order=0) and Reward -3.922] with Probability 0.264 From State InventoryState(on_hand=1, on_order=0): To [State InventoryState(on_hand=1, on_order=1) and Reward -1.000] with Probability 0.368 To [State InventoryState(on_hand=0, on_order=1) and Reward -6.820] with Probability 0.632 From State InventoryState(on_hand=1, on_order=1): To [State InventoryState(on_hand=2, on_order=0) and Reward -1.000] with Probability 0.368 To [State InventoryState(on_hand=1, on_order=0) and Reward -1.000] with Probability 0.368 To [State InventoryState(on_hand=0, on_order=0) and Reward -4.922] with Probability 0.264 From State InventoryState(on_hand=2, on_order=0): To [State InventoryState(on_hand=2, on_order=0) and Reward -2.000] with Probability 0.368 To [State InventoryState(on_hand=1, on_order=0) and Reward -2.000] with Probability 0.368 To [State InventoryState(on_hand=0, on_order=0) and Reward -5.922] with Probability 0.264 Stationary Distribution ----------------------- {InventoryState(on_hand=2, on_order=0): 0.162, InventoryState(on_hand=1, on_order=1): 0.162, InventoryState(on_hand=0, on_order=2): 0.117, InventoryState(on_hand=0, on_order=1): 0.279, InventoryState(on_hand=1, on_order=0): 0.162, InventoryState(on_hand=0, on_order=0): 0.117} Reward Function --------------- {NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -2.036, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -3.036, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -4.679, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -1.036, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -3.679, NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -10.0} Value Function -------------- {NonTerminal(state=InventoryState(on_hand=1, on_order=1)): -38.329, NonTerminal(state=InventoryState(on_hand=2, on_order=0)): -39.329, NonTerminal(state=InventoryState(on_hand=1, on_order=0)): -38.971, NonTerminal(state=InventoryState(on_hand=0, on_order=2)): -37.329, NonTerminal(state=InventoryState(on_hand=0, on_order=1)): -37.971, NonTerminal(state=InventoryState(on_hand=0, on_order=0)): -43.596} ``` ## References - Chapter 3 of the [RLForFinanceBook](https://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf) - [Markov Process and Markov Reward Process](https://github.com/coverdrive/technical-documents/blob/master/finance/cme241/Tour-MP.pdf) slides for CME 241: Foundations of Reinforcement Learning with Applications in Finance