Apply Q-learning on McCall Search Model

# Apply Q-learning on McCall Search Model Ya-Zhu Yang ## Introduction The unemployed often face a challenging dilemma: whether to accept a new job offer or to defer gratification and continue the job search. This report seeks to provide a framework for making optimal decisions in such situations, with a particular focus on maximizing income. McCall Search model provides a framework for quantitatively analyzing the decision-making process of the unemployed. By maximizing the value function and utilizing the Bellman Equation, we can assign numerical values to the utility of various choices, facilitating a comprehensive comparison. This report leverages the Q-learning algorithm to create a model that supports the unemployed in making data-driven decisions. By constructing a Q-table and iteratively refining Q-values, we can identify the most advantageous course of action for maximizing income. ## McCall Search Model McCall's model seeks to investigate the decision-making process of unemployed individuals when confronted with stochastic wage offers. The decision to accept or reject a job is contingent upon current wages, prospective wages, unemployment compensation, and the anticipated duration of job search. Prior to an in-depth analysis of the Model, we shall explore the four stages of its research. This will help us establish a better framework for understanding this report. 1. **Set initial values for model parameters:** Specify the distribution of wages, discount factor, and waiting cost. 2. **Define a value function:** By applying dynamic programming, we iteratively compute a value function that reflects the optimal expected utility for an unemployed individual facing a specific choice at each time step. 3. **Numerical solution:** By numerically solving the value function, the optimal policy can be determined. 4. **Simulation and data analysis:** Run the model to simulate the system and analyze the results to understand how the threshold wage changes with different economic parameters. ### Parameter Setting #### Fundamental Assumption For unemployed individuals seeking employment, we posit the following assumption regarding each job opportunity they encounter: * New job opportunities emerge randomly over time. * While each job provides a base salary, the specific salary offered in each round is a random variable drawn from a particular distribution. * Unemployed individuals commence work immediately upon accepting a job offer; otherwise, they remain in the job search. The relationship between the unemployed and the wage offered for each job is based on the following assumptions: * Unemployed individuals are offered a job with a fixed wage of $w_t$ at every opportunity, which occurs randomly over time. * $\{w_t\}_{t \geq 0}$ is iid, and $q(w)=$ probability of oberserving $w$ in a wage offer set $\mathbb{W}$ * The unemployed are able to observe the wage $w_t$ from the time $t$ start. #### Decision-Making Method With these basic assumptions in place, we may now turn our attention to the factors that influence the decision-making processes of the unemployed. At time $t$ , the unemployed individual faces the following decision and outcome scenarios: 1. Accept this position with a permanent, fixed-salary $w_t$ contract. 2. Decline this job offer, apply for unemployment benefits $c$ , and reassess future career path. The primary goal of an unemployed individual is to maximize their earnings, both present and future. This objective can be represented mathematically by: $$ \max \ \mathbb{E}( \sum_{t=0}^{\infty} \beta^t y_t) $$ * $\beta∈(0,1)$ is a ratio representing the relative value of future earnings compared to current earnings, called “discount factor”. * $y_t$ is the income in this period. $y_t=w_t$ when accepting the job ; $y_t=c$ when declining the job. From an intuitive perspective, the unemployed are required to weigh two opposing factors in their decision-making. By spending excessive time waiting for a better salary, one is discounting the future utility of their earnings. By accepting a lower salary, one may be forfeiting opportunities for higher future earnings. Consequently, accurate decision-making is of paramount importance yet complex. We shall proceed by quantifying the situation and the decisions involved, and formulating functions to express them. ### Value Function The value function $v^*(w)$ is defined as the maximum total income, both present and future, that an unemployed individual can achieve by accepting a job offer with a wage of $w_t$. Hence, the function is required to adhere to the following recursive relation for each period. $$ v^*(w)= \max \left\{ \frac{w}{1 - \beta}, \, c + \beta \sum_{w' \in \mathbb{W}} v^*(w') q (w') \right\} $$ The solution to function $v^*$ is key to making a more informed decision between acceptance and rejection. Our optimal choice involves selecting the maximum value enclosed in parentheses. In what follows, we seek to define a map $\sigma$ that can optimally guide our decision between acceptance and rejection based on the current state $w$. We will consider mappings from the real numbers to the binary set $\left\{0,1\right\}$ , where $0$ denotes rejection and $1$ denotes acceptance. $$ \sigma(w) := \mathbf{1} \left\{ \frac{w}{1 - \beta} \geq c + \beta \sum_{w' \in \mathbb W} v^*(w') q (w') \right\} $$ where $\mathbf{1}\{ P \} = 1$ when statement $P$ is true , and equals $0$ otherwise. The critical point of $w$ in the function $\sigma$ is designated as $\bar w$ , and can be represented by the following pattern. A worker should accept a job offer if and only if the offered wage exceeds the reservation wage $\bar w$. $$ \bar w := (1 - \beta) \left\{ c + \beta \sum_{w'\in \mathbb W} v^*(w') q (w') \right\} $$ ### Optimize Strategy To commence, the value function corresponding to each potential $w \in \mathbb W$ must be determined. For the sake of simplicity, we adopt the following notation: $$ \mathbb W := \{w_1, \ldots, w_n \} \quad \text{,} \quad v^*(i) := v^*(w_i)\quad \text{,} \quad v^* = (v^*(i))_{i=1}^n $$ The recursive relation for the value function can now be expressed as: $$ v^*(i) = \max \left\{ \frac{w(i)}{1 - \beta}, \, c + \beta \sum_{1 \leq j \leq n} v^*(j) q (j) \right\} \quad \text{for } i = 1, \ldots, n $$ #### Lemma：Banach Contraction Mapping Principle **Statement：** If $T : \mathbb R^n → \mathbb R^n$ is a contraction mapping on a complete metric space $(\mathbb R^n, d)$ , then $T$ has a unique fixed point $v^*$. Furthermore, for all $v$ , $T^k (v) = v^*$ for $k→∞$. $$ (Tv)(i) = \max \left\{ \frac{w(i)}{1 - \beta}, \, c + \beta \sum_{1 \leq j \leq n} v(j) q (j) \right\} \quad \text{for } i = 1, \ldots, n $$ Since the $T$ function we defined conforms to the lemma, we can utilize this principle to obtain an approximate value of $v^*$ by repeatedly iterating from an initial guess until the error falls within our tolerance. #### Algorithmic Procedure 1. Any arbitrary initial value $v \in \mathbb R^n$ can be chosen. However, a good initial guess can reduce the number of iterations required. 2. By iteratively computing the following equation, we can obtain the updated value of $v' \in \mathbb R^n$. 3. Computing the measure of difference between $v$ and $v'$, such as $\max |v(i)- v'(i)|$ or $\sup |v(i)- v'(i)|$. 4. The iterative process continues until the error metric falls below the specified tolerance, with $v'$ being reassigned the value of $v$ at each iteration where the error criterion is not met. 5. Output $v$ as an approximation of $v^*$. #### Data-Driven Observations and Calculations A virtual parameter is introduced to facilitate the examination of the wage distribution, the convergence properties of the function, and the computation of the retained wage as demonstrated below. ```python= !pip install quantecon import matplotlib.pyplot as plt plt.rcParams["figure.figsize"] = (11, 5) import numpy as np from numba import jit, float64 from numba.experimental import jitclass import quantecon as qe from quantecon.distributions import BetaBinomial # Set parameters n, a, b = 50, 250, 150 q_default = BetaBinomial(n, a, b).pdf() w_min, w_max = 20, 100 w_default = np.linspace(w_min, w_max, n+1) # Plot PDF of Wage Outcomes fig, ax = plt.subplots() ax.plot(w_default, q_default, '-o', label='$q(w(i))$') ax.set_xlabel("Wages") ax.set_ylabel("Probabilities") ax.set_title("PDF of Wage Outcomes") plt.show() # Utilize "Numba" to Accelerate Code mccall_data = [ ('c', float64), ('β', float64), ('w', float64[:]), ('q', float64[:]) ] # Assign parameters to class @jitclass(mccall_data) class McCallModel: def __init__(self, c=25, β=0.99, w=w_default, q=q_default): self.c, self.β = c, β self.w, self.q = w_default, q_default def state_action_values(self, i, v): c, β, w, q = self.c, self.β, self.w, self.q accept = w[i] / (1 - β) reject = c + β * np.sum(v * q) return np.array([accept, reject]) # Construct Value Function def plot_value_function_seq(mcm, ax, num_plots=6): n = len(mcm.w) v = mcm.w / (1 - mcm.β) v_next = np.empty_like(v) for i in range(num_plots): ax.plot(mcm.w, v, '-', alpha=0.4, label=f"iterate {i}") # Update guess for j in range(n): v_next[j] = np.max(mcm.state_action_values(j, v)) v[:] = v_next ax.legend(loc='lower right') #Plot Iteration Process mcm = McCallModel() fig, ax = plt.subplots() ax.set_xlabel("Wage") ax.set_ylabel("Value") ax.set_title("Convergence Process of Value Function") plot_value_function_seq(mcm, ax) plt.show() #Compute Reservation Wage def compute_reservation_wage(mcm, max_iter=500, tol=1e-6): c, β, w, q = mcm.c, mcm.β, mcm.w, mcm.q n = len(w) v = w / (1 - β) v_next = np.empty_like(v) j = 0 error = tol + 1 while j < max_iter and error > tol: for j in range(n): v_next[j] = np.max(mcm.state_action_values(j, v)) error = np.max(np.abs(v_next - v)) j += 1 v[:] = v_next return (1 - β) * (c + β * np.sum(v * q)) compute_reservation_wage(mcm) ``` ![image](https://hackmd.io/_uploads/ByUpenEX1g.png) ![image](https://hackmd.io/_uploads/B1NeuGDEkg.png) ``` --- Observations and Interpretation --- A distribution chart offers a graphical depiction of the probable wage distribution. The function values asymptotically approach a limit after repeated iterations. The reservation wage is obtained through the refinement of the recursive calculation, which is as follows: - Reservation Wage: 75.53162788038951 ``` ### Bellman Equation The Bellman equation underpins the optimality principle in dynamic programming. It defines the optimal value of a decision problem at any point as the sum of the immediate reward from initial choices and the future value of the resulting subproblem. In discrete time, any multi-stage optimization problem can be solved by analyzing the appropriate Bellman equation; whereas in continuous case, an analogous partial differential Bellman equation can be employed. #### Discrete State $$ V(x) = \max_{a \in \Gamma(x)} \left\{ R(x, a) + \beta V(T(x, a)) \right\} $$ where * $V$: Value function, * $\Gamma$: Set of actions available to be taken, * $R$: Return for taking action $a$ at state $x$, * $T:$ The state resulting from $(x, a)$. #### Continuous State $$ V(x) = \max_{a \in \Gamma(x)} \left\{ R(x, a) + \beta \int V(T(x, a)) dF(x) \right\} $$ where * $V$: Value function, $\Gamma:$ Set of actions available to be taken, * $R$: Return for taking action $a$ at state $x$, * $T:$ The state resulting from $(x, a)$, * $F$: The distribution of next period's interest rate. #### McCall's Model * Discrete State : $$v^*(w)= \max_{\text{accept, reject}}\;\left\{\frac{w}{1 - \beta}, \, c + \beta \sum_{w' \in \mathbb{W}} v^*(w') q (w') \right\}$$ * Continuous State : $$V\left(w\right)=\max_{\text{accept, reject}}\;\left\{ \frac{w}{1-\beta},c+\beta\int V\left(w'\right)dF\left(w'\right)\right\}$$ At its core, the Q-learning algorithm leverages the Bellman equation to iteratively update its Q-values. These Q-values represent the expected cumulative reward obtained by taking a specific action in a given state. By continuously interacting with the environment, the Q-learning algorithm refines its Q-values based on the Bellman equation, gradually converging to the optimal values. ## Q-learning Q-learning is a model-free reinforcement learning algorithm that estimates the optimal action-value function, or Q-function, through interactions with an environment. The Q-function represents the expected cumulative discounted reward for taking a particular action in a given state. By iteratively updating the Q-function based on observed rewards and transitions, Q-learning enables agents to learn policies that maximize long-term rewards. ![image](https://hackmd.io/_uploads/SykZDmsm1x.png) ### Quality Function In the previous paragraph, we introduced the value function, which is solely a function of the state. However, we seek to define a Quality Function $Q$ that maps both the given state and the chosen action directly to the reward obtained by the unemployed individual after taking that action. In essence, we aim for a Quality Function that encapsulates the roles of both the value function $V$ and the policy function $\sigma$. $$ Q: \mathcal{W}\ \times \mathcal{A}\ \rightarrow \mathbb{R} $$ State space：$\mathcal{W}=\{w_1,w_2,...,w_n\}$, Action space：$\mathcal{A}=\{\text{accept}, \text{reject}\}$ The Quality Function $Q$ satisfies the following two mathematical equations: \begin{split} Q\left(w,\text{accept}\right) & =\frac{w}{1-\beta} \\ Q\left(w,\text{reject}\right) & =c+\beta\int\max_{\text{accept, reject}}\left\{ \frac{w'}{1-\beta},Q\left(w',\text{reject}\right)\right\} dF\left(w'\right) \end{split} Since $Q\left(w,\text{reject}\right)$ is disassociated from the current state $w$, we can express it in a more abbreviated form： $$ Q_r := Q\left(w,\text{reject}\right) \quad \forall \, w\in\mathcal{W}. $$ The Optimal Value Function and the Quality Function are connected via the following mathematical expressions in continuous time: $$ V(w) = \max_{\textrm{accept},\textrm{reject}} \left\{ Q(w, \text{accept} \right), Q\left(w,\text{reject} \right)\} $$ $$ \text{or} $$ $$ V(w) =\max\left\{ \frac{w}{1-\beta},Q_{r}\right\} $$ ### Adaptive Learning Function The Q-function described above conforms to these two Bellman equations. \begin{split} w & + \beta \max_{\textrm{accept, reject}} \left\{ Q (w, \textrm{accept}), Q(w, \textrm{reject}) \right\} - Q (w, \textrm{accept}) = 0 \cr c & +\beta\int\max_{\text{accept, reject}}\left\{ Q(w', \textrm{accept}),Q\left(w',\text{reject}\right)\right\} dF\left(w'\right) - Q\left(w,\text{reject}\right) = 0 \cr \end{split} To facilitate computation, we transform the continuous Bellman equation into a discrete form by removing the integral and approximating the equality. \begin{split} w & + \beta \max_{\textrm{accept, reject}} \left\{ Q (w, \textrm{accept}), Q(w, \textrm{reject}) \right\} - Q (w, \textrm{accept}) = 0 \cr c & +\beta \max_{\text{accept, reject}}\left\{ Q(w', \textrm{accept}),Q\left(w',\text{reject}\right)\right\} - Q\left(w,\text{reject}\right) \approx 0 \cr \end{split} Q-learning is predicated on drawing a substantial sample $F$ of salaries from the unemployed. To apply the law of large numbers and maintain a stable average salary with increasing sample size, we adjust the variables and functions within our model. Regard $w$ as $w_t$, and $w'$ as $w_{t+1}$. Furthermore, we introduce a time variable $t$ to the quality function $Q$, denoting the modified function as $\hat Q_t$. Lastly, to quantify the error, we introduce a variable $\textrm{diff}$ to represent it. \begin{split} w & + \beta \max_{\textrm{accept, reject}} \left\{ \hat Q_t (w_t, \textrm{accept}), \hat Q_t(w_t, \textrm{reject}) \right\} - \hat Q_t(w_t, \textrm{accept}) = \textrm{diff}_{\textrm{accept},t} \cr c & +\beta \max_{\text{accept, reject}}\left\{ \hat Q_t(w_{t+1}, \textrm{accept}),\hat Q_t\left(w_{t+1},\text{reject}\right)\right\} - \hat Q_t\left(w_t,\text{reject}\right) = \textrm{diff}_{\textrm{reject},t} \cr \end{split} Our simplified adaptive learning scheme is governed by the following recurrence relation. $$ \hat Q_{t+1} = \hat Q_t + \alpha \ \textrm{diff}_t $$ where $\alpha \in (0,1)$ is a small gain parameter that governs the rate of learning. ### Algorithm With the foundational Q-learning equations established, we proceed to a detailed examination of the algorithm. We will first provide an overview of all the steps involved in machine learning Q-learning, and then explore in greater detail. We assume a finite action space and a finite state space. 1. **Initialize Q-table**：All possible states and actions need to be represented within the table. 2. **Compute Q-Value**：For a randomly chosen action, compute the Q-values of all possible states. 3. **Find Maximum Q-value**：Identify the maximum Q-value, along with its corresponding state and action. 4. **Update Q-value**：Incorporate the latest Q-value, state, and action into the Q-table to refine the value estimates. 5. **Repeat Step 2~4**：Iteratively interact with the environment to update the Q-values, striving to achieve the goal state. ![image](https://hackmd.io/_uploads/BJ_5dv8EJg.png =75%x) #### Temporal Difference Error Adhering to the Adaptive Learning Function detailed in the preceding section, our unemployed individuals will engage in ongoing actions and receive corresponding monetary incentives throughout the learning process. $$ \widetilde{Q}^{new}\left(w,a\right)=\widetilde{Q}^{old}\left(w,a\right)+\alpha \widetilde{TD}\left(w,a\right) $$ where \begin{split} \widetilde{TD}\left(w,\text{accept}\right) & = \left[ w+\beta\max_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w,a'\right) \right]-\widetilde{Q}^{old}\left(w,\text{accept}\right) \\ \widetilde{TD}\left(w,\text{reject}\right) & = \left[ c+\beta\max_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w',a'\right) \right]-\widetilde{Q}^{old}\left(w,\text{reject}\right),\;w'\sim F \end{split} The $\widetilde{TD}(w,a)$ variable, which we have equated with the $\textrm{diff}$ variable in our earlier discussion, is instrumental in propagating updates to the temporal difference error. #### Random Experiment While McCall's model optimizes for maximum monetary rewards, the choice of optimal decision can vary across different models. For instance, when risk is the primary concern, such as CAPM, minimizing the Q-value might be the optimal strategy. It's important to note that this report is confined to discrete-time Q-values, but the framework can be extended to continuous-time scenarios with appropriate parameter adjustments. * Discrete State : **McCall's model：**$\textrm{argmax}_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w,a'\right)$ **CAPM：**$\textrm{argmin}_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w,a'\right)$ * Continuous State : **McCall's model：**$\textrm{argmax}_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w',a'\right)$ **CAPM：**$\textrm{argmin}_{a'\in\mathcal{A}}\widetilde{Q}^{old}\left(w',a'\right)$ #### Data-Driven Observations and Calculations Using a simulated dataset and Python's VFI method, we will iteratively calculate the Q-table. Additionally, we will analyze the optimal decisions and maximum Q-values of McCall job seekers under different wage levels and iteration numbers through Q-learning. ```python= import numpy as np from numba import jit, float64, int64 from numba.experimental import jitclass from quantecon.distributions import BetaBinomial import matplotlib.pyplot as plt np.random.seed(123) # Set parameters n, a, b = 50, 250, 150 q_new = BetaBinomial(n, a, b).pdf() w_min, w_max = 20, 100 w_new = np.linspace(w_min, w_max, n+1) params=[ ('c', float64), ('β', float64), ('w', float64[:]), ('q', float64[:]), ('eps', float64), ('δ', float64), ('lr', float64), ('T', int64), ('quit_allowed', int64)] # Assign parameters to class @jitclass(params) class McCallModel: def __init__(self, c=25, β=0.99, w=w_default, q=q_default): self.c, self.β = c, β self.w, self.q = w, q def state_action_values(self, i, v): c, β, w, q = self.c, self.β, self.w, self.q accept = w[i] / (1 - β) reject = c + β * np.sum(v * q) return np.array([accept, reject]) def VFI(self, eps=1e-5, max_iter=500): n = len(self.w) v = self.w / (1 - self.β) v_next = np.empty_like(v) flag=0 for i in range(max_iter): for j in range(n): v_next[j] = np.max(self.state_action_values(j, v)) if np.max(np.abs(v_next - v))<=eps: flag=1 break v[:] = v_next return v, flag def plot_value_function_seq(mcm, ax, num_plots=8): len(mcm.w) v = mcm.w / (1 - mcm.β) v_next = np.empty_like(v) for i in range(num_plots): ax.plot(mcm.w, v, '-', alpha=0.4, label=f"iterate {i}") for i in range(n): v_next[i] = np.max(mcm.state_action_values(i, v)) v[:] = v_next ax.legend(loc='lower right') class Qlearning_McCall: def __init__(self, c=25, β=0.99, w=w_default, q=q_default, eps=0.1, δ=1e-5, lr=0.5, T=10000, quit_allowed=0): self.c, self.β = c, β self.w, self.q = w, q self.eps, self.δ, self.lr, self.T = eps, δ, lr, T self.quit_allowed = quit_allowed def draw_offer_index(self): q = self.q return np.searchsorted(np.cumsum(q), np.random.random(), side="right") def temp_diff(self, qtable, state, accept): c, β, w = self.c, self.β, self.w if accept==0: state_next = self.draw_offer_index() TD = c + β*np.max(qtable[state_next, :]) - qtable[state, accept] else: state_next = state if self.quit_allowed == 0: TD = w[state_next] + β*np.max(qtable[state_next, :]) - qtable[state, accept] else: TD = w[state_next] + β*qtable[state_next, 1] - qtable[state, accept] return TD, state_next def run_one_epoch(self, qtable, max_times=20000): c, β, w = self.c, self.β, self.w eps, δ, lr, T = self.eps, self.δ, self.lr, self.T s0 = self.draw_offer_index() s = s0 accept_count = 0 for t in range(max_times): accept = np.argmax(qtable[s, :]) if np.random.random()<=eps: accept = 1 - accept if accept == 1: accept_count += 1 else: accept_count = 0 TD, s_next = self.temp_diff(qtable, s, accept) qtable_new = qtable.copy() qtable_new[s, accept] = qtable[s, accept] + lr*TD if np.max(np.abs(qtable_new-qtable))<=δ: break if accept_count == T: break s, qtable = s_next, qtable_new return qtable_new def run_epochs(N, qlmc, qtable): for n in range(N): if n%(N/10)==0: print(f"Progress: EPOCHs = {n}") new_qtable = qlmc.run_one_epoch(qtable) qtable = new_qtable return qtable def valfunc_from_qtable(qtable): return np.max(qtable, axis=1) def compute_error(valfunc, valfunc_VFI): return np.mean(np.abs(valfunc-valfunc_VFI)) # Create an Instance of Qlearning_McCall qlmc = Qlearning_McCall() # Construst Q-table mcm = McCallModel(w=w_new, q=q_new) valfunc_VFI, flag = mcm.VFI() valfunc_VFI # Compute Q-value def plot_epochs(epochs_to_plot, quit_allowed=1): qlmc_new = Qlearning_McCall(w=w_new, q=q_new, quit_allowed=quit_allowed) qtable = np.zeros((len(w_new),2)) epochs_to_plot = np.asarray(epochs_to_plot) fig, ax = plt.subplots(figsize=(10,6)) ax.plot(w_new, valfunc_VFI, '-o', label='VFI') max_epochs = np.max(epochs_to_plot) for n in range(max_epochs + 1): if n%(max_epochs/10)==0: print(f"Progress: EPOCHs = {n}") if n in epochs_to_plot: valfunc_qlr = valfunc_from_qtable(qtable) error = compute_error(valfunc_qlr, valfunc_VFI) ax.plot(w_new, valfunc_qlr, '-o', label=f'QL:epochs={n}, mean error={error}') new_qtable = qlmc_new.run_one_epoch(qtable) qtable = new_qtable ax.set_xlabel('Wages') ax.set_ylabel('Optimal Value') ax.set_title("Q-value Under Optimal Decision") ax.legend(loc='lower right') plt.show() plot_epochs(epochs_to_plot=[100, 1000, 10000, 100000, 200000]) ``` ``` --- Q-table --- array([ 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7553.16274943, 7600. , 7760. , 7920. , 8080. , 8240. , 8400. , 8560. , 8720. , 8880. , 9040. , 9200. , 9360. , 9520. , 9680. , 9840. , 10000. ]) ``` ![image](https://hackmd.io/_uploads/BysjeGvNJl.png) It can be seen from the figure that the Q-learning algorithm's performance in learning the Q-table for rarely sampled wages is relatively poor. However, with sufficient iterations, the learned Q-table closely approximates the true optimal value function. ## Reference * Intermediate Quantitative Economics with Python. QuantEcon. https://python.quantecon.org/intro.html * Contraction Mapping Principle. BYJU's. https://byjus.com/maths/contraction-mapping-principle/ * Bellman Equation. Wikipedia. https://en.wikipedia.org/wiki/Bellman_equation * RL:Q-Learning — CS note by Jimmy Hsieh. Medium. https://medium.com/@Jimmy_9566/rl-q-learning-cs%E7%AD%86%E8%A8%98-f182ab5252fa