## Question Setup
For these conceptual questions, you will be working with an MDP with three discrete states $S=\{s_1, s_2, s_3\}$ and two discrete actions $A=\{a_1, a_2\}$.
You are using a policy estimated by a logistic regression for each state. The parameters of our logistic regressions are as follows: $\theta = [[0.5, -0.5], [0.2, 0.2], [0.1, 0.7]]$. The probabilities of taking an action can be computed as $\pi(a | s)=\text{softmax}(\theta_{s,a})$. For $s_1$, the probability of taking $a_1$ or $a_2$ can be found by taking the softmax of $[0.5, -0.5]$. For $s_2$, it is the softmax of $(0.2, 0.2)$, and so on.
You are using a discount factor of $\gamma=0.9$.
You have collected three trajectories so far. Here, we'll write trajectories as a list of (state, action, reward) pairs.
1. $\tau_1 = [(s_1, a_1, +2), (s_2, a_2, +1), (s_3, a_1, +3)]$
2. $\tau_2 = [(s_1, a_2, -1), (s_3, a_2, +2)]$
3. $\tau_3 = [(s_2, a_1, +1), (s_3, a_2, +2)]$
### Questions
1. Calculate the probabilities of taking each action from each state. (6 total probabilities)
2. Calculate the discounted returns $G_t$ for each timestep in each trajectory. (compute 7 $G_t$ total)
3. Calculate a value estimate of each state, $V^\pi(s)$. $V^\pi(s)$ can be estimated by calculated the average discounted return for each state. (One value for each state)
4. Compute the policy gradient $\nabla \ln \pi(a|s, \theta)$ for each (state, action) pair in the collected trajectories. You will only compute the gradient for (state, action) pairs that appear in your trajectories, which will result in 7 values total.
5. Calculate the REINFORCE gradient estimate $\nabla J(\theta)$ for each trajecotory. (One gradient for each trajectory)
6. Compute the mean gradient estimate across all trajectories. (For each element of the gradient, take the mean over all 3 trajectories)
7. Compute the variance of the gradient estimates across all trajectories. (For each element of the gradient, take the variance over all 3 trajectories)
8. Recompute the gradient estimate for each trajectory with REINFORCE using a baseline function $b(s) = V(s)$, which you computed in 3.
9. Compute the mean and variance of these gradient estimates and compare your answer to the result of question 7.
10. What is the advantage of using a baseline function in REINFORCE?
<!---
# REINFORCE Problem Set - Answer Key
# REINFORCE Problem Set - Answer Key
## 1. Calculate policy probabilities π(a|s)
For each state s, we compute softmax of the corresponding parameters:
$\pi(a|s) = \frac{e^{\theta_{s,a}}}{\sum_{a'} e^{\theta_{s,a'}}}$
State $s_1$, parameters [0.5, -0.5]:
- $\pi(a_1|s_1) = \frac{e^{0.5}}{e^{0.5} + e^{-0.5}} = \frac{1.6487}{1.6487 + 0.6065} = \frac{1.6487}{2.2552} = 0.7310$
- $\pi(a_2|s_1) = \frac{e^{-0.5}}{e^{0.5} + e^{-0.5}} = \frac{0.6065}{2.2552} = 0.2690$
State $s_2$, parameters [0.2, 0.2]:
- $\pi(a_1|s_2) = \frac{e^{0.2}}{e^{0.2} + e^{0.2}} = \frac{1.2214}{1.2214 + 1.2214} = 0.5000$
- $\pi(a_2|s_2) = \frac{e^{0.2}}{e^{0.2} + e^{0.2}} = \frac{1.2214}{2.4428} = 0.5000$
State $s_3$, parameters [0.1, 0.7]:
- $\pi(a_1|s_3) = \frac{e^{0.1}}{e^{0.1} + e^{0.7}} = \frac{1.1052}{1.1052 + 2.0138} = \frac{1.1052}{3.1190} = 0.3543$
- $\pi(a_2|s_3) = \frac{e^{0.7}}{e^{0.1} + e^{0.7}} = \frac{2.0138}{3.1190} = 0.6457$
## 2. Calculate discounted returns $G_t$
For each trajectory, we calculate returns by working backwards from the final state:
Trajectory 1: [(s₁,a₁,+2), (s₂,a₂,+1), (s₃,a₁,+3)]
- $G_3 = R_3 = 3$
- $G_2 = R_2 + \gamma G_3 = 1 + 0.9 \times 3 = 1 + 2.7 = 3.7$
- $G_1 = R_1 + \gamma G_2 = 2 + 0.9 \times 3.7 = 2 + 3.33 = 5.33$
Trajectory 2: [(s₁,a₂,-1), (s₃,a₂,+2)]
- $G_2 = R_2 = 2$
- $G_1 = R_1 + \gamma G_2 = -1 + 0.9 \times 2 = -1 + 1.8 = 0.8$
Trajectory 3: [(s₂,a₁,+1), (s₃,a₂,+2)]
- $G_2 = R_2 = 2$
- $G_1 = R_1 + \gamma G_2 = 1 + 0.9 \times 2 = 1 + 1.8 = 2.8$
## 3. Calculate value estimate for each state
We calculate the average return from each state:
State $s_1$:
- Trajectory 1: $G_1 = 5.33$
- Trajectory 2: $G_1 = 0.8$
- Average: $V(s_1) = \frac{5.33 + 0.8}{2} = \frac{6.13}{2} = 3.065$
State $s_2$:
- Trajectory 1: $G_2 = 3.7$
- Trajectory 3: $G_1 = 2.8$
- Average: $V(s_2) = \frac{3.7 + 2.8}{2} = \frac{6.5}{2} = 3.25$
State $s_3$:
- Trajectory 1: $G_3 = 3$
- Trajectory 2: $G_2 = 2$
- Trajectory 3: $G_2 = 2$
- Average: $V(s_3) = \frac{3 + 2 + 2}{3} = \frac{7}{3} = 2.33$
## 4. Compute policy gradients ∇log π(a|s,θ)
For softmax parameterization with a tabular representation, the gradient of the log policy with respect to θ has a specific structure:
For a given state s and action a, and for each parameter θ_{s',a'}:
- If s' = s and a' = a: $\nabla_{\theta_{s',a'}} \log \pi(a|s,\theta) = 1 - \pi(a|s,\theta)$
- If s' = s and a' ≠ a: $\nabla_{\theta_{s',a'}} \log \pi(a|s,\theta) = -\pi(a'|s,\theta)$
- If s' ≠ s: $\nabla_{\theta_{s',a'}} \log \pi(a|s,\theta) = 0$
Let's compute these gradients for each (state, action) pair in our trajectories. We'll represent the gradient as a 3×2 matrix corresponding to the shape of θ, where each element (i,j) represents $\nabla_{\theta_{i,j}} \log \pi(a|s,\theta)$.
Trajectory 1:
1. (s₁, a₁):
- $\nabla_{\theta_{1,1}} \log \pi(a_1|s_1,\theta) = 1 - 0.731 = 0.269$
- $\nabla_{\theta_{1,2}} \log \pi(a_1|s_1,\theta) = -0.269$
- All other elements are 0
2. (s₂, a₂):
- $\nabla_{\theta_{2,1}} \log \pi(a_2|s_2,\theta) = -0.5$
- $\nabla_{\theta_{2,2}} \log \pi(a_2|s_2,\theta) = 1 - 0.5 = 0.5$
- All other elements are 0
3. (s₃, a₁):
- $\nabla_{\theta_{3,1}} \log \pi(a_1|s_3,\theta) = 1 - 0.3543 = 0.6457$
- $\nabla_{\theta_{3,2}} \log \pi(a_1|s_3,\theta) = -0.6457$
- All other elements are 0
So the complete gradients for each step in trajectory 1 are:
- Step 1: $\begin{bmatrix} 0.269 & -0.269 \\ 0 & 0 \\ 0 & 0 \end{bmatrix}$
- Step 2: $\begin{bmatrix} 0 & 0 \\ -0.5 & 0.5 \\ 0 & 0 \end{bmatrix}$
- Step 3: $\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0.6457 & -0.6457 \end{bmatrix}$
Trajectory 2:
1. (s₁, a₂):
- $\nabla_{\theta_{1,1}} \log \pi(a_2|s_1,\theta) = -0.731$
- $\nabla_{\theta_{1,2}} \log \pi(a_2|s_1,\theta) = 1 - 0.269 = 0.731$
- All other elements are 0
2. (s₃, a₂):
- $\nabla_{\theta_{3,1}} \log \pi(a_2|s_3,\theta) = -0.3543$
- $\nabla_{\theta_{3,2}} \log \pi(a_2|s_3,\theta) = 1 - 0.6457 = 0.3543$
- All other elements are 0
So the complete gradients for each step in trajectory 2 are:
- Step 1: $\begin{bmatrix} -0.731 & 0.731 \\ 0 & 0 \\ 0 & 0 \end{bmatrix}$
- Step 2: $\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ -0.3543 & 0.3543 \end{bmatrix}$
Trajectory 3:
1. (s₂, a₁):
- $\nabla_{\theta_{2,1}} \log \pi(a_1|s_2,\theta) = 1 - 0.5 = 0.5$
- $\nabla_{\theta_{2,2}} \log \pi(a_1|s_2,\theta) = -0.5$
- All other elements are 0
2. (s₃, a₂):
- $\nabla_{\theta_{3,1}} \log \pi(a_2|s_3,\theta) = -0.3543$
- $\nabla_{\theta_{3,2}} \log \pi(a_2|s_3,\theta) = 1 - 0.6457 = 0.3543$
- All other elements are 0
So the complete gradients for each step in trajectory 3 are:
- Step 1: $\begin{bmatrix} 0 & 0 \\ 0.5 & -0.5 \\ 0 & 0 \end{bmatrix}$
- Step 2: $\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ -0.3543 & 0.3543 \end{bmatrix}$
## 5. Calculate REINFORCE gradient estimates
The REINFORCE gradient for a trajectory is:
$\nabla J(\theta) = \sum_t \nabla_\theta \log \pi(a_t|s_t,\theta) \cdot G_t$
We multiply each gradient matrix by its corresponding return and sum them to get the total gradient for each trajectory.
Trajectory 1:
- Step 1: $\begin{bmatrix} 0.269 & -0.269 \\ 0 & 0 \\ 0 & 0 \end{bmatrix} \times 5.33 = \begin{bmatrix} 1.434 & -1.434 \\ 0 & 0 \\ 0 & 0 \end{bmatrix}$
- Step 2: $\begin{bmatrix} 0 & 0 \\ -0.5 & 0.5 \\ 0 & 0 \end{bmatrix} \times 3.7 = \begin{bmatrix} 0 & 0 \\ -1.85 & 1.85 \\ 0 & 0 \end{bmatrix}$
- Step 3: $\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0.6457 & -0.6457 \end{bmatrix} \times 3 = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 1.937 & -1.937 \end{bmatrix}$
Total gradient for trajectory 1:
$\nabla J(\theta)_1 = \begin{bmatrix} 1.434 & -1.434 \\ -1.85 & 1.85 \\ 1.937 & -1.937 \end{bmatrix}$
Trajectory 2:
- Step 1: $\begin{bmatrix} -0.731 & 0.731 \\ 0 & 0 \\ 0 & 0 \end{bmatrix} \times 0.8 = \begin{bmatrix} -0.585 & 0.585 \\ 0 & 0 \\ 0 & 0 \end{bmatrix}$
- Step 2: $\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ -0.3543 & 0.3543 \end{bmatrix} \times 2 = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ -0.709 & 0.709 \end{bmatrix}$
Total gradient for trajectory 2:
$\nabla J(\theta)_2 = \begin{bmatrix} -0.585 & 0.585 \\ 0 & 0 \\ -0.709 & 0.709 \end{bmatrix}$
Trajectory 3:
- Step 1: $\begin{bmatrix} 0 & 0 \\ 0.5 & -0.5 \\ 0 & 0 \end{bmatrix} \times 2.8 = \begin{bmatrix} 0 & 0 \\ 1.4 & -1.4 \\ 0 & 0 \end{bmatrix}$
- Step 2: $\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ -0.3543 & 0.3543 \end{bmatrix} \times 2 = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ -0.709 & 0.709 \end{bmatrix}$
Total gradient for trajectory 3:
$\nabla J(\theta)_3 = \begin{bmatrix} 0 & 0 \\ 1.4 & -1.4 \\ -0.709 & 0.709 \end{bmatrix}$
## 6. Compute mean gradient estimate
The mean gradient is the element-wise average of the three gradient matrices:
$\bar{\nabla} J(\theta) = \frac{1}{3}(\nabla J(\theta)_1 + \nabla J(\theta)_2 + \nabla J(\theta)_3)$
$= \frac{1}{3} \left( \begin{bmatrix} 1.434 & -1.434 \\ -1.85 & 1.85 \\ 1.937 & -1.937 \end{bmatrix} + \begin{bmatrix} -0.585 & 0.585 \\ 0 & 0 \\ -0.709 & 0.709 \end{bmatrix} + \begin{bmatrix} 0 & 0 \\ 1.4 & -1.4 \\ -0.709 & 0.709 \end{bmatrix} \right)$
$= \frac{1}{3} \begin{bmatrix} 0.849 & -0.849 \\ -0.45 & 0.45 \\ 0.519 & -0.519 \end{bmatrix} = \begin{bmatrix} 0.283 & -0.283 \\ -0.15 & 0.15 \\ 0.173 & -0.173 \end{bmatrix}$
## 7. Compute variance of gradient estimates
The variance is computed element-wise across the three gradient matrices:
$Var[\nabla J(\theta)] = \frac{1}{3} \sum_{i=1}^3 (\nabla J(\theta)_i - \bar{\nabla} J(\theta))^2$
Let's compute this for each element:
Element (1,1):
$Var_{1,1} = \frac{1}{3}[(1.434 - 0.283)^2 + (-0.585 - 0.283)^2 + (0 - 0.283)^2]$
$= \frac{1}{3}[1.3255^2 + (-0.868)^2 + (-0.283)^2]$
$= \frac{1}{3}[1.757 + 0.753 + 0.080]$
$= \frac{2.59}{3} = 0.863$
Element (1,2):
$Var_{1,2} = \frac{1}{3}[(-1.434 - (-0.283))^2 + (0.585 - (-0.283))^2 + (0 - (-0.283))^2]$
$= \frac{1}{3}[(-1.151)^2 + 0.868^2 + 0.283^2]$
$= \frac{1}{3}[1.325 + 0.753 + 0.080]$
$= \frac{2.158}{3} = 0.719$
Element (2,1):
$Var_{2,1} = \frac{1}{3}[(-1.85 - (-0.15))^2 + (0 - (-0.15))^2 + (1.4 - (-0.15))^2]$
$= \frac{1}{3}[(-1.7)^2 + 0.15^2 + 1.55^2]$
$= \frac{1}{3}[2.89 + 0.0225 + 2.4025]$
$= \frac{5.315}{3} = 1.772$
Element (2,2):
$Var_{2,2} = \frac{1}{3}[(1.85 - 0.15)^2 + (0 - 0.15)^2 + (-1.4 - 0.15)^2]$
$= \frac{1}{3}[1.7^2 + (-0.15)^2 + (-1.55)^2]$
$= \frac{1}{3}[2.89 + 0.0225 + 2.4025]$
$= \frac{5.315}{3} = 1.772$
Element (3,1):
$Var_{3,1} = \frac{1}{3}[(1.937 - 0.173)^2 + (-0.709 - 0.173)^2 + (-0.709 - 0.173)^2]$
$= \frac{1}{3}[1.764^2 + (-0.882)^2 + (-0.882)^2]$
$= \frac{1}{3}[3.112 + 0.778 + 0.778]$
$= \frac{4.668}{3} = 1.556$
Element (3,2):
$Var_{3,2} = \frac{1}{3}[(-1.937 - (-0.173))^2 + (0.709 - (-0.173))^2 + (0.709 - (-0.173))^2]$
$= \frac{1}{3}[(-1.764)^2 + 0.882^2 + 0.882^2]$
$= \frac{1}{3}[3.112 + 0.778 + 0.778]$
$= \frac{4.668}{3} = 1.556$
So the complete variance matrix is:
$Var[\nabla J(\theta)] = \begin{bmatrix} 0.863 & 0.719 \\ 1.772 & 1.772 \\ 1.556 & 1.556 \end{bmatrix}$
The total variance (sum of all element variances) is: 8.238
## 8. Compute baseline-adjusted gradient estimates
Using $b(s) = V(s)$ as the baseline, we adjust the returns by subtracting the baseline value for each state:
Trajectory 1:
- Step 1 (s₁, a₁): $G_1 - V(s_1) = 5.33 - 3.065 = 2.265$
- Step 2 (s₂, a₂): $G_2 - V(s_2) = 3.7 - 3.25 = 0.45$
- Step 3 (s₃, a₁): $G_3 - V(s_3) = 3 - 2.33 = 0.67$
Gradient with baseline for trajectory 1:
- Step 1: $\begin{bmatrix} 0.269 & -0.269 \\ 0 & 0 \\ 0 & 0 \end{bmatrix} \times 2.265 = \begin{bmatrix} 0.609 & -0.609 \\ 0 & 0 \\ 0 & 0 \end{bmatrix}$
- Step 2: $\begin{bmatrix} 0 & 0 \\ -0.5 & 0.5 \\ 0 & 0 \end{bmatrix} \times 0.45 = \begin{bmatrix} 0 & 0 \\ -0.225 & 0.225 \\ 0 & 0 \end{bmatrix}$
- Step 3: $\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0.6457 & -0.6457 \end{bmatrix} \times 0.67 = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0.433 & -0.433 \end{bmatrix}$
Total baseline-adjusted gradient for trajectory 1:
$\nabla J_B(\theta)_1 = \begin{bmatrix} 0.609 & -0.609 \\ -0.225 & 0.225 \\ 0.433 & -0.433 \end{bmatrix}$
Trajectory 2:
- Step 1 (s₁, a₂): $G_1 - V(s_1) = 0.8 - 3.065 = -2.265$
- Step 2 (s₃, a₂): $G_2 - V(s_3) = 2 - 2.33 = -0.33$
Gradient with baseline for trajectory 2:
- Step 1: $\begin{bmatrix} -0.731 & 0.731 \\ 0 & 0 \\ 0 & 0 \end{bmatrix} \times (-2.265) = \begin{bmatrix} 1.656 & -1.656 \\ 0 & 0 \\ 0 & 0 \end{bmatrix}$
- Step 2: $\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ -0.3543 & 0.3543 \end{bmatrix} \times (-0.33) = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0.117 & -0.117 \end{bmatrix}$
Total baseline-adjusted gradient for trajectory 2:
$\nabla J_B(\theta)_2 = \begin{bmatrix} 1.656 & -1.656 \\ 0 & 0 \\ 0.117 & -0.117 \end{bmatrix}$
Trajectory 3:
- Step 1 (s₂, a₁): $G_1 - V(s_2) = 2.8 - 3.25 = -0.45$
- Step 2 (s₃, a₂): $G_2 - V(s_3) = 2 - 2.33 = -0.33$
Gradient with baseline for trajectory 3:
- Step 1: $\begin{bmatrix} 0 & 0 \\ 0.5 & -0.5 \\ 0 & 0 \end{bmatrix} \times (-0.45) = \begin{bmatrix} 0 & 0 \\ -0.225 & 0.225 \\ 0 & 0 \end{bmatrix}$
- Step 2: $\begin{bmatrix} 0 & 0 \\ 0 & 0 \\ -0.3543 & 0.3543 \end{bmatrix} \times (-0.33) = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 0.117 & -0.117 \end{bmatrix}$
Total baseline-adjusted gradient for trajectory 3:
$\nabla J_B(\theta)_3 = \begin{bmatrix} 0 & 0 \\ -0.225 & 0.225 \\ 0.117 & -0.117 \end{bmatrix}$
## 9. Compute variance of baseline-adjusted estimates
First, we compute the mean baseline-adjusted gradient:
$\bar{\nabla} J_B(\theta) = \frac{1}{3}(\nabla J_B(\theta)_1 + \nabla J_B(\theta)_2 + \nabla J_B(\theta)_3)$
$= \frac{1}{3} \left( \begin{bmatrix} 0.609 & -0.609 \\ -0.225 & 0.225 \\ 0.433 & -0.433 \end{bmatrix} + \begin{bmatrix} 1.656 & -1.656 \\ 0 & 0 \\ 0.117 & -0.117 \end{bmatrix} + \begin{bmatrix} 0 & 0 \\ -0.225 & 0.225 \\ 0.117 & -0.117 \end{bmatrix} \right)$
$= \frac{1}{3} \begin{bmatrix} 2.265 & -2.265 \\ -0.45 & 0.45 \\ 0.667 & -0.667 \end{bmatrix} = \begin{bmatrix} 0.755 & -0.755 \\ -0.15 & 0.15 \\ 0.222 & -0.222 \end{bmatrix}$
Now we compute the element-wise variance:
Element (1,1):
$Var_{1,1} = \frac{1}{3}[(0.609 - 0.755)^2 + (1.656 - 0.755)^2 + (0 - 0.755)^2]$
$= \frac{1}{3}[(-0.146)^2 + 0.901^2 + (-0.755)^2]$
$= \frac{1}{3}[0.021 + 0.812 + 0.57]$
$= \frac{1.403}{3} = 0.468$
Element (1,2):
$Var_{1,2} = \frac{1}{3}[(-0.609 - (-0.755))^2 + (-1.656 - (-0.755))^2 + (0 - (-0.755))^2]$
$= \frac{1}{3}[0.146^2 + (-0.901)^2 + 0.755^2]$
$= \frac{1}{3}[0.021 + 0.812 + 0.57]$
$= \frac{1.403}{3} = 0.468$
Element (2,1):
$Var_{2,1} = \frac{1}{3}[(-0.225 - (-0.15))^2 + (0 - (-0.15))^2 + (-0.225 - (-0.15))^2]$
$= \frac{1}{3}[(-0.075)^2 + 0.15^2 + (-0.075)^2]$
$= \frac{1}{3}[0.006 + 0.0225 + 0.006]$
$= \frac{0.0345}{3} = 0.0115$
Element (2,2):
$Var_{2,2} = \frac{1}{3}[(0.225 - 0.15)^2 + (0 - 0.15)^2 + (0.225 - 0.15)^2]$
$= \frac{1}{3}[0.075^2 + (-0.15)^2 + 0.075^2]$
$= \frac{1}{3}[0.006 + 0.0225 + 0.006]$
$= \frac{0.0345}{3} = 0.0115$
Element (3,1):
$Var_{3,1} = \frac{1}{3}[(0.433 - 0.222)^2 + (0.117 - 0.222)^2 + (0.117 - 0.222)^2]$
$= \frac{1}{3}[0.211^2 + (-0.105)^2 + (-0.105)^2]$
$= \frac{1}{3}[0.045 + 0.011 + 0.011]$
$= \frac{0.067}{3} = 0.022$
Element (3,2):
$Var_{3,2} = \frac{1}{3}[(-0.433 - (-0.222))^2 + (-0.117 - (-0.222))^2 + (-0.117 - (-0.222))^2]$
$= \frac{1}{3}[(-0.211)^2 + 0.105^2 + 0.105^2]$
$= \frac{1}{3}[0.045 + 0.011 + 0.011]$
$= \frac{0.067}{3} = 0.022$
So the complete variance matrix with baseline is:
$Var[\nabla J_B(\theta)] = \begin{bmatrix} 0.468 & 0.468 \\ 0.0115 & 0.0115 \\ 0.022 & 0.022 \end{bmatrix}$
The total variance with baseline (sum of all element variances) is: 1.004
Comparison of variances:
- Without baseline: 8.238
- With baseline: 1.004
- Variance reduction: 87.8%
The advantage of using a baseline function in REINFORCE is that it significantly reduces the variance of the gradient estimates without changing the expected gradient. This leads to more stable and efficient learning. By subtracting state-dependent values (the baseline) from the returns, we reduce the "noise" in the gradient estimates, allowing the algorithm to learn more effectively from fewer samples.
-->