# Reinforcement Learning
## What is Reinforcement Learning?
Reinforcement learning (RL) is a type of machine learning where an agent learns how to interact with an environment by taking actions and receiving feedback in the form of rewards. The agent's goal is to learn a strategy, called a policy, that maximizes cumulative rewards over time. This learning process is similar to how humans learn, by trying different things and observing the outcomes.

### Key Characteristics of RL
* **Goal-oriented:** The agent's objective is to maximize rewards.
* **Sequential decision making:** The agent's actions affect not only immediate rewards but also future situations.
* **Trial-and-error learning:** The agent explores the environment and learns from its experiences.
* **Interaction with an environment:** The agent actively interacts with the environment, receiving feedback and adapting its behavior.
### Why is RL Important?
RL is becoming increasingly important because it can solve complex problems that are difficult to address with traditional programming methods. It has been successfully applied in various areas:
* **Game playing:** RL agents have defeated human champions in games like Go, chess, and Atari games.
* **Robotics:** RL enables robots to learn complex motor skills and navigate challenging environments.
* **Control systems:** RL optimizes control strategies for things like autonomous vehicles and industrial processes.
* **Personalized recommendations:** RL can tailor recommendations to individual preferences.
* **Finance:** RL can develop trading strategies and optimize portfolio management.
### Mathematics in RL
Mathematics plays a crucial role in the foundation and development of RL algorithms:
* **Probability:** Understanding probability is essential for dealing with the uncertainty in RL environments.
* **Optimization:** RL agents aim to find the optimal policy, which involves solving optimization problems.
* **Dynamic programming:** This mathematical technique is used for planning and making sequential decisions in RL.
## Markov Decision Processes (MDPs)
Markov Decision Processes (MDPs) are a powerful framework for modeling sequential decision-making problems where actions have both immediate and long-term consequences. They are widely used in various fields, including robotics, finance, and artificial intelligence.

### Key Components of an MDP
An MDP formally describes an environment where an agent learns to make optimal decisions. It consists of five key elements:
* **State Space $\mathcal{S}$:** The set of all possible situations the agent can find itself in. For example, in a chess game, the state space would be all possible arrangements of pieces on the board.
* **Action Space $\mathcal{A}$:** The set of all actions the agent can take. In chess, this would be all legal moves for a given state.
* **Transition Probability Function $P$:** This function, denoted as $P(s'|s, a)$, specifies the probability of transitioning to a new state $s'$ when taking action $a$ in the current state $s$. It captures the inherent uncertainty in the environment.
* **Reward Function $R$:** This function, denoted as $R(s, a, s')$, defines the immediate reward the agent receives for taking action $a$ in state $s$ and transitioning to state $s'$. Rewards guide the agent towards desired behavior.
* **Discount Factor $\gamma$:** A value between 0 and 1 that determines the importance of future rewards relative to immediate rewards. A higher discount factor prioritizes long-term gains.
#### Markov Property
The Markov Property is fundamental to MDPs. It states that the future is independent of the past given the present. In other words, the current state contains all the information necessary to predict the future, and the history of how the agent reached that state is irrelevant. This property simplifies the problem significantly.
### Policies and Value Functions
* **Policy $\pi$:** A policy dictates the agent's behavior, mapping states to actions. A deterministic policy assigns a specific action to each state, while a stochastic policy assigns probabilities to different actions in each state.
* **Value Functions:** Value functions assess the long-term value of being in a particular state or taking a specific action under a given policy.
* **State-value function $V^{\pi}(s)$:** The expected cumulative discounted reward when starting in state $s$ and following policy $\pi$.
* **Action-value function $Q^{\pi}(s,a)$:** The expected cumulative discounted reward when starting in state $s$, taking action $a$, and then following policy $\pi$.
#### Bellman Equations
The Bellman equations provide a recursive relationship for calculating value functions. They express the value of a state or state-action pair in terms of the values of possible successor states.
* Bellman Equation for $V^{\pi}(s)$: $$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(s,a) \sum_{s'\in\mathcal{S}} P(s'|s,a) \left[R(s,a,s') + \gamma V^\pi(s') \right]$$
* Bellman Equation for $Q^\pi(s,a)$: $$Q^\pi(s,a) = \sum_{s'\in\mathcal{S}} P(s'|s,a) [R(s,a,s') + \gamma \sum_{a'\in\mathcal{A}} \pi(s',a')Q^\pi(s',a') ]$$
### Optimal Value Functions and Policies
An optimal policy maximizes the expected cumulative discounted reward. The optimal state-value function $V^*(s)$ and optimal action-value function $Q^*(s,a)$ give the maximum possible expected return from a state or a state-action pair, respectively.
#### Bellman Optimality Equations
These equations define the optimal value functions:
* For $V^*(s)$: $$V^*(s) = \max_{a \in \mathcal{A}} \sum_{s' \in \mathcal{S}} P(s,a,s') \left[R(s,a,s') + \gamma V^*(s')\right]$$
* For $Q^*(s,a)$: $$Q^*(s,a) = \sum_{s' \in \mathcal{S}} P(s,a,s') \left[R(s,a,s') + \gamma \max_{a' \in \mathcal{A}} Q^*(s',a')\right]$$
#### Solving MDPs
Various techniques can be used to find optimal policies in MDPs, including dynamic programming (value iteration, policy iteration), and reinforcement learning (Q-learning, SARSA).
### Example: Simple Inventory MDP
Let's adapt the inventory management scenario above into a formal MDP. We'll simplify some aspects to make the core concepts clear.
#### MDP Formulation
* **State Space $\mathcal{S}$:** The state will be represented by the current inventory level $\alpha$ at the end of the day. We'll assume a simplified scenario where there's no on-order inventory ($\beta = 0$ always). This means the store receives its orders instantaneously.
* $\mathcal{S} = \{0, 1, 2, \dots, C\}$, where $C$ is the capacity of the store.
* **Action Space $\mathcal{A}$:** The action is the amount of inventory to order at the beginning of each day.
* $\mathcal{A} = \{0, 1, 2, \dots, C\}$ (we can order any amount up to the store's capacity).
* **Transition Probability Function $P$:** This function, $P(s'|s, a)$, defines the probability of transitioning to state s' (inventory level tomorrow) given the current state s (inventory level today) and action a (amount ordered today). This probability is determined by the demand distribution.
* Let $Poisson(\lambda, i)$ denote the probability of demand being $i$ from a Poisson distribution with parameter $\lambda$.
* If we order $a$ units and the demand is $i$, the next state s' will be:
* $s' = \min(s + a - i, C)$ (we can't exceed capacity)
* Therefore:
* $P(s'|s, a) = Poisson(\lambda, s + a - s')$ if $0 \leq s + a - s' \leq s + a$
* $P(s'|s, a) = \sum_{k=s+a}^\infty Poisson(\lambda, k)$ if $s' = 0$ (demand exceeds available stock)
* $P(s'|s, a) = 0$ otherwise.
* **Reward Function $R$:** The reward $R(s, a, s')$ is the negative of the costs incurred in a day.
* $R(s, a, s') = - (h \cdot s + p \cdot \max(s + a - s' - s, 0))$
* $h \cdot s$: Holding cost for having $s$ units of inventory overnight.
* $p \cdot \max(s + a - s' - s, 0))$: Stockout cost for any unmet demand (demand exceeding available stock).
* **Discount Factor $\gamma$:** A value between 0 and 1.
#### Goal
The goal is to find an optimal policy $\pi^*$ that maps states (inventory levels) to actions (order amounts) in order to maximize the expected discounted cumulative reward (minimize the total discounted cost) over time.
#### Finding the Optimal Policy
To find the optimal policy, you can use techniques like:
* **Value Iteration:** Iteratively update the value function until it converges to the optimal value function, and then derive the optimal policy from it.
* **Policy Iteration:** Start with an arbitrary policy, evaluate it, and then improve it iteratively until it converges to the optimal policy.
* **Q-learning:** A model-free reinforcement learning algorithm that can learn the optimal policy directly through interaction with the environment.
#### Example Calculations
Let's say:
* $C$ (capacity) = 5
* $h$ (holding cost per unit) = $1
* $p$ (stockout cost per unit) = $5
* $\lambda$ (average daily demand) = 2
* $\gamma$ = 0.9
Now, you can use the Bellman equations and the transition probabilities to calculate the value functions and eventually find the optimal policy. This might involve some computation, but it demonstrates how the MDP framework can be applied to a practical inventory management problem.
#### Important Notes
* This is a simplified model. In real-world scenarios, you might have lead times for orders, different demand distributions, and other factors to consider.
* The complexity of solving the MDP grows with the size of the state and action spaces.
* You can use software tools and libraries to help with the calculations and finding the optimal policy.
## Dynamic Programming
Dynamic Programming (DP) is a powerful algorithmic paradigm for solving problems by breaking them down into smaller, overlapping subproblems and storing their solutions to avoid redundant computations. In reinforcement learning, DP provides a foundation for finding optimal policies in Markov Decision Processes (MDPs) when we have complete knowledge of the environment's dynamics (transition probabilities and rewards). This makes DP a planning method, as opposed to a learning method that must discover the environment's dynamics through interaction.
### Core Concepts in Dynamic Programming
* **Principle of Optimality:** This principle, formulated by Richard Bellman, states that any optimal policy has the property that, regardless of the initial state and decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. This principle is the foundation of DP.
* **Bellman Equation:** The Bellman equation is a recursive equation that expresses the relationship between the value of a state and the values of its successor states. It plays a central role in DP algorithms.
* **Prediction:** This involves evaluating a given policy by calculating its value function, which estimates the expected future rewards the agent will obtain when following that policy.
* **Control:** This focuses on finding an optimal policy that maximizes the expected cumulative reward.
### Fixed-Point Theory and DP
Fixed-point theory provides a mathematical framework for understanding the convergence of DP algorithms.
* **Fixed Point:** A fixed point of a function is a value that remains unchanged when the function is applied to it. Formally, for a function $f(x)$, a fixed point x* satisfies $f(x^*) = x^*$.
* **Iterative Methods:** Many DP algorithms are iterative. They start with an initial guess and repeatedly apply an update rule (based on the Bellman equation) until the value function converges to a fixed point, which corresponds to the true value function or the optimal value function.
* **Banach Fixed-Point Theorem:** This theorem provides conditions under which a function has a unique fixed point and guarantees that iterative methods will converge to that fixed point. It essentially requires the function to be a contraction mapping, meaning it "shrinks" the distance between points. This theorem assures us that value iteration and policy iteration will converge to the correct solutions under appropriate conditions.
### Dynamic Programming Algorithms
#### Policy Evaluation
Policy evaluation aims to determine the value function $V^{\pi}(s)$ for a given policy $\pi$. This tells us how good the policy is by estimating the expected cumulative discounted reward from each state when following $\pi$. The algorithm uses an iterative approach based on the Bellman expectation equation:
$$
V^{\pi}(s) = R^{\pi}(s) + \gamma \sum_{s' \in S} P^{\pi}(s' | s) V^{\pi}(s').
$$
where:
* $V^{\pi}(s)$ is the value of state $s$ under policy $\pi$
* $R^{\pi}(s)$ is the expected immediate reward in state $s$ under policy $\pi$
* $\gamma$ is the discount factor
* $P^{\pi}(s'|s)$ is the probability of transitioning to state $s'$ from state $s$ under policy $\pi$
The algorithm starts with an initial guess for $V^{\pi}(s)$ (usually 0) and iteratively updates it using the Bellman expectation equation until convergence.
:::info
**Algorithm: Policy Evaluation**
**Input:**
- `MDP`: The Markov Decision Process `(states, actions, transition probabilities, rewards)`
- `pi`: The policy to be evaluated
- `gamma`: Discount factor (`0 < gamma < 1`)
- `theta`: A small threshold determining the accuracy of the estimation
**Output:**
- `V`: Estimated value function
**Initialization:**
- `V(s) = 0` for all states `s`
**Algorithm:**
1. Repeat until convergence:
- `delta = 0` (Track the maximum change in value estimates)
- For each state `s`:
- `v = V(s)` (Store the old value estimate)
- `V(s) = R^pi(s) + gamma * sum_{s'} P^pi(s' | s) * V(s')` (Update using the Bellman equation)
- `delta = max(delta, |v - V(s)|)` (Update the maximum change)
- If `delta < theta`: (Check for convergence)
- break
:::
#### Policy Iteration
Policy iteration is an algorithm that alternates between policy evaluation and policy improvement to find an optimal policy.
1. **Policy Evaluation:** Evaluate the current policy $\pi$ and obtain its value function $V^\pi$.
2. **Policy Improvement:** Create a new policy $\pi'$ that is greedy with respect to $V^\pi$. This means selecting the action in each state that maximizes the expected value.
3. **Repeat:** Iterate steps 1 and 2 until the policy converges (no further improvements can be made).
Policy iteration is guaranteed to converge to an optimal policy for finite MDPs.
:::info
**Algorithm: Policy Iteration**
**Input:**
- `MDP`: The Markov Decision Process `(states, actions, transition probabilities, rewards)`
- `gamma`: Discount factor (`0 < gamma < 1`)
**Output:**
- `pi`: Optimal policy
- `V`: Optimal value function
**Initialization:**
- `pi(s) = a random action for all states s` (Initialize with an arbitrary policy)
- `V(s) = 0` for all states `s`
**Algorithm:**
1. Repeat until policy converges:
- Policy Evaluation:
- Compute the value function `V^pi` for the current policy `pi`
- Policy Improvement:
- For each state `s`:
- `pi'(s) = argmax_a { R(s, a) + gamma * sum_{s'} P(s, a, s') * V^pi(s') }` (Choose the action that maximizes expected value)
- If `pi' == pi`: (Check if the policy has converged)
- break
- else:
- `pi = pi'` (Update the policy)
2. Return `pi`, `V^pi`
:::
#### Value Iteration
Value iteration directly finds the optimal value function $V^*$ by iteratively applying the Bellman optimality equation:
$$
V^*(s) = \max_{a \in \mathcal{A}} \left\{ R(s, a) + \gamma \sum_{s' \in \mathcal{N}} P(s, a, s') V^*(s') \right\}
$$
This equation expresses the optimal value of a state in terms of the optimal values of its successor states. The algorithm starts with an initial guess for $V^*$ and repeatedly updates it using the Bellman optimality equation until it converges. Once $V^*$ is found, the optimal policy can be extracted by choosing the action in each state that maximizes the expected value.
Value iteration is also guaranteed to converge to the optimal value function for finite MDPs.
:::info
**Value Iteration Algorithm**
**Input:**
- `MDP`: The Markov Decision Process `(states, actions, transition probabilities, rewards)`
- `gamma`: Discount factor (`0 < gamma < 1`)
- `theta`: A small threshold determining the accuracy of the estimation
**Output:**
- `V`: Optimal value function
- `pi`: Optimal policy
**Initialization:**
- V(s) = 0 for all states s
**Algorithm:**
1. Repeat until convergence:
- `delta = 0` (Track the maximum change in value estimates)
- For each state `s`:
- `v = V(s)` (Store the old value estimate)
- `V(s) = max_{a} { R(s, a) + gamma * Σ_{s'} P(s, a, s') * V(s') }` (Update using the Bellman optimality equation)
- `delta = max(delta, |v - V(s)|)` (Update the maximum change)
- If `delta < theta`: (Check for convergence)
- break
2. For each state `s`: (Extract the optimal policy)
- `pi(s) = argmax_a { R(s, a) + gamma * sum_{s'} P(s, a, s') * V(s') }`
3. Return `V`, `pi`
:::
#### Generalized Policy Iteration (GPI)
Generalized Policy Iteration (GPI) is a general framework that encompasses both policy iteration and value iteration. It views the process of finding an optimal policy as a interplay between policy evaluation and policy improvement. GPI allows for flexibility in how these two processes are interleaved. For example, you might perform partial policy evaluation or use function approximation to represent the value function.


:::info
**Algorithm: GPI with Monte Carlo Evaluation**
**Input:**
- `environment`: The environment to interact with
- `gamma`: Discount factor (`0 < gamma <= 1`)
- `epsilon`: Exploration probability
**Output:**
- `Q`: Estimated action-value function
- `pi`: Learned policy
**Initialization:**
- `Q(s, a) = 0` for all state-action pairs `(s, a)`
- `pi(s) = a` random action for all states `s`
**Algorithm:**
1. Repeat for each episode:
- Generate an episode following policy `pi`: `[(S_0, A_0, R_1), (S_1, A_1, R_2), ..., (S_T-1, A_T-1, R_T)]`
- `G = 0`
- For each step `t = T-1, T-2, ..., 0`:
- `G = gamma * G + R_{t+1}`
- If `(S_t, A_t)` is NOT visited earlier in the episode:
- Append `G` to `Returns(S_t, A_t)` (Maintain a list of returns for each state-action pair)
- `Q(S_t, A_t) = average(Returns(S_t, A_t))`
- `pi(S_t) = argmax_a Q(S_t, a)` (with epsilon-greedy exploration)
:::
### Limitations of Dynamic Programming
* **Curse of Dimensionality:** DP suffers from the curse of dimensionality, meaning that the computational and memory requirements grow exponentially with the size of the state space. This limits its applicability to problems with large state spaces.
* **Model-Based:** DP requires a complete and accurate model of the environment (transition probabilities and rewards). In many real-world problems, this model is not available.
## Temporal-Difference (TD) Learning
Temporal-Difference (TD) learning is a fundamental concept in reinforcement learning. It enables an agent to learn directly from its experiences, continuously refining its knowledge through a process of trial and error. Unlike Monte Carlo methods, which rely on complete episodes to update value estimates, TD learning can update its estimates after every time step. This ability to learn incrementally makes TD learning more efficient and adaptable, especially in complex and dynamic environments.
### TD Learning: A Model-Free Approach
TD learning is **model-free**, meaning it doesn't require a prior model of the environment's dynamics (transition probabilities and rewards). It learns directly from the agent's interactions with the environment, making it suitable for situations where the environment is complex, unknown, or difficult to model explicitly.
### TD(0): The Foundation of TD Learning
The most basic TD learning algorithm is TD(0). It updates the value function $V(S_t)$ of the current state $S_t$ based on the immediate reward $R(t+1)$ and the estimated value $V(S_{t+1})$ of the next state $S_{t+1}$. The TD(0) update rule can be expressed as:
$$
V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]
$$
where:
* $\alpha$ is the learning rate, a small positive number that controls the step size of the updates.
* $\gamma$ is the discount factor, a value between 0 and 1 that determines the weight given to future rewards.
The term $R_{t+1} + \gamma V(S_{t+1})$ is the **TD target**, an estimate of the return based on the immediate reward and the current estimate of the next state's value. The difference between the TD target and the current state's value estimate, $R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$, is the **TD error**. It represents the "surprise" the agent experiences when its prediction differs from the actual outcome.
:::info
#### Algorithm: TD(0) Prediction
**Input:**
- `pi`: Policy to be evaluated
- `episodes`: A list of episodes, where each episode is a sequence of `(state, reward, next_state)` triples
- `gamma`: Discount factor (`0 < gamma <= 1`)
- `alpha`: Learning rate
**Output:**
- `V`: Estimated value function
**Initialization:**
- `V(s) = 0` for all states `s`
**Algorithm:**
1. For each `episode` in `episodes`:
- For each step `t` in `range(len(episode) - 1)`:
- `s_t, r_{t+1}, s_{t+1} = episode[t]`
- `delta = r_{t+1} + gamma * V(s_{t+1}) - V(s_t)`
- `V(s_t) = V(s_t) + alpha * delta`
2. Return `V`
:::
### Bootstrapping: Learning from Estimates
TD learning utilizes a concept called **bootstrapping**, where it updates estimates based on other estimates. In TD(0), the value estimate of the current state is updated using the estimated value of the next state. This approach is similar to dynamic programming, but instead of relying on a complete model of the environment, TD learning leverages the agent's actual experiences.
### Connecting TD Learning to Dynamic Programming
TD learning can be viewed as a bridge between Monte Carlo methods and dynamic programming. Like Monte Carlo methods, it learns directly from experience, but like dynamic programming, it uses bootstrapping to update estimates. However, unlike dynamic programming, TD learning doesn't require a full model of the environment.
### SARSA: On-Policy TD Control
SARSA (State-Action-Reward-State-Action) is an **on-policy** TD control algorithm. It learns the value function for the current policy being followed by the agent. The SARSA update rule is:
$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + γ Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]
$$
where $Q(S_t, A_t)$ is the estimated action-value function for taking action $A_t$ in state $S_t$.
:::info
**Algorithm: SARSA**
**Input:**
- `environment`: The environment to interact with
- `gamma`: Discount factor (`0 < gamma <= 1`)
- `alpha`: Learning rate
- `epsilon`: Exploration probability
- `num_episodes`: Number of episodes to run
**Output:**
- `Q`: Estimated action-value function
- `pi`: Learned policy
**Initialization:**
- `Q(s, a) = 0` for all state-action pairs `(s, a)`
- `pi(s) = a random action for all states s`
**Algorithm:**
1. For `episode` in `range(num_episodes)`:
- Observe the initial state `S`
- Choose action `A` using an epsilon-greedy policy based on `Q(S, .)`
- While `S` is not terminal:
- Take action `A`, observe reward `R` and next state `S'`
- Choose action `A'` using an epsilon-greedy policy based on `Q(S', .)`
- `Q(S, A) = Q(S, A) + alpha * (R + gamma * Q(S', A') - Q(S, A))`
- `S = S'`
- `A = A'`
2. Return `Q`
:::
### Q-learning: Off-Policy TD Control
Q-learning is an **off-policy** TD control algorithm. It learns an optimal policy's value function, even if the agent is following a different exploratory policy. The Q-learning update rule is:
$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)]
$$
The key difference between SARSA and Q-learning lies in how they estimate the value of the next state-action pair. SARSA uses the Q-value of the actual next action taken by the agent, while Q-learning uses the maximum Q-value over all possible actions in the next state.
:::info
**Algorithm: Q-Learning**
**Input:**
- `environment`: The environment to interact with
- `gamma`: Discount factor (`0 < gamma <= 1`)
- `alpha`: Learning rate
- `epsilon`: Exploration probability
**Output:**
- `Q`: Estimated action-value function
**Initialization:**
- `Q(s, a) = 0` for all state-action pairs `(s, a)`
**Algorithm:**
1. Repeat for each episode:
- Observe the initial state `S`
- Repeat for each step of episode:
- Choose action `A` using an epsilon-greedy policy based on `Q(S, .)`
- Take action `A`, observe reward `R` and next state `S'`
- `Q(S, A) = Q(S, A) + alpha * (R + gamma * max_{a'} Q(S', a') - Q(S, A))`
- `S = S'`
:::
### TD(λ): Bridging TD and Monte Carlo
TD(λ) introduces **eligibility traces** to TD learning. Eligibility traces
$$
E_t(s) = \begin{cases}
\gamma \lambda E_{t-1}(s) + 1 &\text{if } s = S_t \\
\gamma \lambda E_{t-1}(s) &\text{if } s \neq S_t
\end{cases}
$$
act as a short-term memory, keeping track of recently visited states. This allows TD(λ) to assign credit to states visited several steps earlier,
$$
V(s) \leftarrow V(s) + \alpha \delta_t E_t(s) \quad \forall s
$$
bridging the gap between TD(0), which focuses on the immediate next state, and Monte Carlo methods, which consider the entire episode.
The parameter $\lambda$ in TD(λ) controls the balance between these two approaches:
* $\lambda = 0$: TD(λ) is equivalent to TD(0).
* $\lambda = 1$: TD(λ) becomes similar to a Monte Carlo method.
* $0 < \lambda < 1$: TD(λ) blends TD and Monte Carlo.
TD(λ) offers several advantages:
* **Efficiency:** Eligibility traces enable more efficient credit assignment.
* **Flexibility:** The $\lambda$ parameter allows for adjusting the learning process.
* **Generality:** TD(λ) encompasses a range of algorithms from TD(0) to Monte Carlo.
:::info
**Algorithm: TD($\lambda$) Prediction**
**Input:**
- `pi`: Policy to be evaluated
- `episodes`: A list of episodes, where each episode is a sequence of `(state, reward)` pairs
- `gamma`: Discount factor (`0 < gamma <= 1`)
- `lambda`: Trace decay parameter (`0 <= lambda <= 1`)
- `alpha`: Learning rate
**Output:**
- `V`: Estimated value function
**Initialization:**
- `V(s) = 0` for all states `s`
**Algorithm:**
1. For each `episode` in `episodes`:
- `E(s) = 0` for all states `s` (Initialize eligibility traces)
- For each step `t` in `range(len(episode) - 1)`:
- `s_t, r_{t+1} = episode[t]`
- `delta = r_{t+1} + gamma * V(s_{t+1}) - V(s_t)`
- `E(s_t) = E(s_t) + 1`
- For each state `s`:
- `V(s) = V(s) + alpha * delta * E(s)`
- `E(s) = gamma * lambda * E(s)`
2. Return `V`
:::
## Function Approximation
In reinforcement learning (RL), we often encounter problems with enormous state spaces. Consider a robot navigating a cluttered room or an AI agent playing a video game with millions of possible screen configurations. In these situations, traditional tabular methods, where we store a separate value for each state, become computationally intractable due to the **curse of dimensionality**. This is where function approximation becomes essential.
### Function Approximation: A Scalable Solution
Function approximation allows us to represent the value function compactly using a parameterized function, such as:
* **Linear Function Approximators:** These use a weighted sum of features to represent the value function. They are relatively simple to implement and are often effective for problems with linear or near-linear relationships between features and values.
* **Neural Networks:** These are powerful function approximators that can capture complex non-linear relationships. Their ability to learn intricate patterns has led to remarkable success in various RL domains, including game playing, robotics, and natural language processing.
By using function approximation, we gain several crucial advantages:
* **Generalization:** The function approximator can learn to generalize from a limited set of experiences to unseen states, enabling the agent to make reasonable predictions and decisions in novel situations.
* **Handling Continuous States:** Function approximation can effectively represent value functions for continuous state spaces, which are common in real-world problems where states are often described by continuous variables (e.g., position, velocity, temperature).
* **Efficiency:** A compact function approximator requires significantly less memory than a table storing a separate value for every state, making it feasible to handle large state spaces.
### Combining Function Approximation with RL: Challenges
Integrating function approximation with RL introduces some important challenges:
* **The Deadly Triad:** Combining function approximation, bootstrapping (updating estimates based on other estimates), and off-policy learning (learning about one policy while following another) can lead to instability and divergence in the learning process. This is known as the "deadly triad" because these three elements together can cause the learning algorithm to become unstable and fail to converge.
* **Overfitting:** Function approximators, especially complex ones like neural networks, can overfit to the training data, leading to poor generalization to new, unseen states.
### Addressing the Challenges
Several techniques can help mitigate the challenges of using function approximation in RL:
* **Gradient TD Methods:** These methods provide more stable updates for the parameters of the function approximator, reducing the instability caused by the deadly triad. They achieve this by using gradient descent to minimize the error between the estimated value function and the TD target.
* **Regularization:** Techniques like weight decay or dropout can help prevent overfitting by discouraging the function approximator from relying too heavily on any particular features in the training data.
* **Careful Parameter Tuning:** Selecting appropriate learning rates, exploration strategies, and network architectures is crucial for successful learning with function approximation.
In summary, function approximation is a vital tool for scaling RL to complex problems with large state spaces. While it presents some challenges, various techniques can help address these issues and enable effective learning with function approximators.
## Reference
- Rao, A., & Jelvis, T. (2022). Foundations of Reinforcement Learning with Applications in Finance (1st ed.). Chapman and Hall/CRC. [link](https://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf)
## Further Reading
### Markov Decision Processes
* Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc.
* Bäuerle, N., & Rieder, U. (2011). Markov Decision Processes with Applications to Finance. Springer.
### Dynamic Programming
* Bertsekas, D. P. (2017). Dynamic Programming and Optimal Control, Vol. I (4th ed.). Athena Scientific.
* Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Vol. II (4th ed.). Athena Scientific.
* Powell, W. B. (2022). Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions. John Wiley & Sons.
### Reinforcement Learning
* Bertsekas, D. P. (2023). A Course in Reinforcement Learning. Athena Scientific.
* Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
* Bertsekas, D. P. (2019-2024). Reinforcement Learning. [Course lectures](https://youtube.com/playlist?list=PLmH30BG15SIoXhxLldoio0BhsIY84YMDj&si=6za52g4m_wgY1hUP). Arizona State University, Tempe, AZ.
* Szepesvári, C. (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 9. Morgan & Claypool Publishers.