---
title: 'HW7 Programming: Reinforcement Learning'
---
# HW7: Reinforcement Learning
:::danger
Programming assignment due **Friday, May 8th, 2026 at 11:59 PM EST.** **NO LATE DAYS! NO EXCEPTIONS :)**
:::

Bruno is stuck in the CIT, his thick paws hovering over a keyboard not designed for paws. He is really hungry for snack during final periods and heard there is a competition on the main green for who can balance a pole in the cart the longest: the cart-pole challenge. Bruno tried hacking it maually, but the infinite permutations of the Providence wind soon shred his notes. He decided to pivot to Deep Q-Networks, shuffling his clumsy stumbles into a replay buffer to digest past failures. When the pole wobbles, he has to figure out how to recalibrate his silicon brain, hunting for that sugary reward signal. He experiments with REINFORCE, trusting his gut, but the wind of wild variance swings leaves him dizzy. Finally, he recruits a "critic" to whisper steadying advice, until he finds the perfect balance. Through grit and gradient descent, Bruno bridges the gap between raw instinct and a belly full of honey.
## Assignment Overview
This assignment is an introduction to reinforcement learning. Throughout this assignment, you will be implementing a few different deep-learning networks that are utilized within the reinforcement learning framework. That being said, this is not a comprehensive guide. This assignment does not review the basics of reinforcement learning. If you require a recollection of earlier material, you should reference lecture material or this very helpful textbook guide [here](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf).
:::danger
**THIS ASSIGNMENT IS OPTIONAL! Completion of this assignment is not required for the Spring 2026 rendition of the course. However, if you do, you will be awarded 2.5 points on your assignment category grade.**
:::
## Getting Started
Please click <ins>[here](https://classroom.github.com/a/_Yg7bm4n)</ins> to get the stencil code. Reference this <ins>[guide](https://hackmd.io/gGOpcqoeTx-BOvLXQWRgQg)</ins> for more information about GitHub and GitHub Classroom.
:::danger
**Do not change the stencil except where specified**. Changing the stencil's method signatures or removing pre-defined functions could result in incompatibility with the autograder and result in a low grade.
**This assignment uses PyTorch.** Make sure your environment has PyTorch and Gynasium installed.
:::
## Deep Q-Networks (DQN)
### Why Tables Break
For very simple environments, a Q-function can be represented as a table with one entry per $(s, a)$ pair. That works well for small, discrete grid worlds where the state space might only have a few dozen entries total.
However, consider **CartPole-v1**: the state is a 4-dimensional vector of continuous values (cart position, cart velocity, pole angle, pole angular velocity). The state space is infinite. You cannot build a table over it.
The fix is to replace the table with a **neural network** $Q(s, a;\theta)$ that maps an observation vector to Q-values for each action. This is the core idea behind DQN.
### Function Approximation
Instead of storing $Q(s,a)$ directly, we parameterize it:
$$Q(s, a;\theta) \approx Q^*(s,a)$$
where $\theta$ are the parameters of a neural network. Learning now means updating $\theta$ so the approximation improves.
The natural loss function is the squared Bellman residual:
$$\mathcal{L}(\theta) = \mathbb{E}\!\left[\left(y - Q(s,a;\theta)\right)^2\right]$$
where the target $y$ is:
$$y = r + \gamma \max_{a'} Q(s',a';\theta)$$
Here is the problem: $y$ depends on $\theta$ too. Updating $\theta$ changes both the prediction and the target simultaneously, which creates a moving target and can cause instability or divergence.
DQN addresses this with two key stabilization tricks.
### The DQN Network
The network maps an observation vector directly to Q-values for every action simultaneously. The output has one scalar per action, so during action selection the agent just takes the argmax.
Suggested architecture is around 2-3 linear layers that properly map your state_dim to n_actions. Do not forget your activation functions!
:::info
**TODO 1.1: Implement `DQN.__init__` and `DQN.forward` in `dqn.py`**
Define your network as an `nn.Sequential` matching the architecture above. `forward` simply passes `x` through your network and returns the result.
:::
### Trick 1: Experience Replay
Standard Q-learning updates on each transition as it arrives. Consecutive transitions are highly correlated because the agent is in roughly the same region of state space for many steps in a row. Training on correlated data breaks the IID assumption that gradient descent needs.
**Experience replay** stores transitions in a buffer $\mathcal{D}$ and trains on randomly sampled mini-batches:
$$\text{sample } (s, a, r, s', \text{done}) \sim \mathcal{D}$$
This breaks the correlation, stabilizes training, and lets the network learn from each transition multiple times.
The buffer has a fixed capacity. When it fills, old transitions are overwritten. This is a circular buffer, and Python's `collections.deque(maxlen=N)` handles the eviction automatically.
:::info
**TODO 1.2: Implement `ReplayBuffer.push` and `ReplayBuffer.sample` in `dqn.py`**
`push` stores one transition as a tuple `(state, action, reward, next_state, done)` in `self.buffer`.
`sample` draws `batch_size` random transitions without replacement using `random.sample`, then unpacks them with `zip(*transitions)`. Return five numpy arrays with these dtypes:
| Array | dtype |
|---|---|
| `states` | `float32` |
| `actions` | `int64` |
| `rewards` | `float32` |
| `next_states` | `float32` |
| `dones` | `float32` (1.0 = done, 0.0 = not done) |
:::warning
The datatypes really matter, so make sure you set them properly :)
:::
### Trick 2: Target Network
To stabilize the target $y$, DQN maintains a **separate copy** of the Q-network called the target network $Q(s,a;\theta^-)$. The loss becomes:
$$\mathcal{L}(\theta) = \mathbb{E}\!\left[\left(\underbrace{r + \gamma \max_{a'} Q(s',a';\theta^-)}_{\text{target}} - Q(s,a;\theta)\right)^2\right]$$
The target network parameters $\theta^-$ are held **frozen** during gradient updates. Every $C$ steps, we copy $\theta$ into $\theta^-$. Because $\theta^-$ is fixed, the target $y$ is stable across the mini-batch update.
Gradients flow only through $Q(s,a;\theta)$, not through the target.
:::info
**TODO 1.3: Implement `compute_dqn_loss` in `dqn.py`**
The following function is a little tricky because of its shape issues. Follow the directions carefully.Given a batch of tensors already on device, compute the DQN loss in three steps:
1. **Current Q-values**: Your policy network produces a tensor of the shape `(B, n_actions)`. Use `.gather(1, actions.unsqueeze(1)).squeeze(1)` to select only the Q-value for the action that was actually taken, giving shape `(B,)`.
2. **Targets**: Inside `torch.no_grad()`, call `target_net(next_states).max(dim=1)[0]` to get the max Q-value per row (shape `(B,)`). Compute `targets = rewards + gamma * next_q * (1.0 - dones)`. The `(1.0 - dones)` term zeros out the bootstrap for terminal transitions.
3. Return `nn.functional.mse_loss(q_values, targets)`.
:::
### The Full DQN Loop
Putting it together:
1. Collect transitions with $\varepsilon$-greedy action selection and store in $\mathcal{D}$.
2. Every step, sample a mini-batch from $\mathcal{D}$.
3. Compute targets using $\theta^-$.
4. Compute the MSE loss, backpropagate, update $\theta$.
5. Every $C$ episodes, copy $\theta \to \theta^-$.
The outer training loop in `train_dqn` is given. You are responsible for the functions it calls.
:::info
**TODO 1.4: Implement `train_step` in `dqn.py`**
Orchestrate one gradient update:
1. Return `None` if `len(buffer) < batch_size`. We cannot form a full batch yet.
2. Call `buffer.sample(batch_size)`. Convert each numpy array to a torch tensor with the correct type and move it to `device`:
- `states`, `rewards`, `next_states`, `dones` all into torch tensors of the type `torch.FloatTensor`
- `actions` into `torch.LongTensor` (required for `.gather`)
3. Call `compute_dqn_loss` with the tensors, `policy_net`, `target_net`, and `gamma`.
4. Zero the optimizer's gradients, call `loss.backward()`, and step the optimizer.
5. Return `loss.item()` for logging.
:::
### Verifying Deep Q-Networks
Run `python main.py train --model dqn` after completing all four TODOs.
**Reward curve**: CartPole-v1 has a maximum episode length of 500 steps (reward 500). A random agent averages around 20-25. A trained DQN should reach a moving average of 80-100+ within 400-600 episodes.
**Loss curve**: The DQN loss should generally trend downward but will be noisy. Spikes are normal because they often correspond to target network syncs.
:::warning
**Common mistakes**:
- Forgetting to detach the target (passing gradients through $\theta^-$). Training will be slower or unstable.
- Applying the bootstrap when `done=True`. Q-values near episode ends will be inflated.
- Moving tensors to the wrong device. PyTorch will throw a device mismatch error.
- Using `torch.LongTensor` for actions is required; `.gather` will fail with float actions.
:::
## Running Your Code
All training and visualization is handled through a single entry point: `main.py`. You do not need to run `dqn.py`, `reinforce.py`, or `actor_critic.py` directly.
### Training
The simplest invocation is just:
```bash
python main.py train --model dqn
```
By default this trains on `CartPole-v1` for 1000 episodes with discount factor 0.99, renders the trained agent at the end, and saves the checkpoint under `checkpoints/CartPole-v1/dqn/<timestamp>/`. You do not need to change any of that unless you want to.
Checkpoints are saved to `checkpoints/<env>/<algorithm>/<run>/`.
### Visualizing
After training, the checkpoint is automatically saved to a fixed `checkpoints/<env>/<model>/best/` directory. To render it, just pass `--env` and `--model`:
```bash
# Single agent (CartPole, DQN)
python main.py visualize --env CartPole-v1 --model dqn
# Single agent (CartPole, REINFORCE)
python main.py visualize --env CartPole-v1 --model reinforce
# Side-by-side comparison of two agents
python main.py visualize \
--checkpoint-a checkpoints/CartPole-v1/reinforce/best \
--checkpoint-b checkpoints/CartPole-v1/dqn/best
```
:::warning
**These models take longer to run.** This is becuase we interacting with a real environment and must wait for the episodes to rollout. Do not be alarmed by the long training times. You can adjust these by decreasing the `--episodes` flag from 1000.
:::
If you need to inspect a specific timestamped run instead of the best one, use `--checkpoint` with the full run path:
```bash
python main.py visualize --checkpoint checkpoints/CartPole-v1/dqn/<run>
```
The GIF and a JSON summary are saved to `outputs/visualizations/`. The defaults (1 episode, 500 max steps) are fine for this assignment's purposes.
### Available Environments
The primary environments used in this assignment are:
- **`CartPole-v1`**: A classic control problem where the agent must balance a pole on a moving cart by pushing it left or right.
- **`LunarLander-v2`**: A continuous state space where the agent must safely land a spacecraft using its main and side thrusters.
---
### Applying it to your DQN
Once your agent finishes training, try visualizing it:
```bash
# Default: CartPole-v1, 1000 episodes, renders the agent live when done
python main.py train --model dqn
# Skip the live render if you just want the checkpoint
python main.py train --model dqn --no-render
# Visualize the best checkpoint
python main.py visualize --env CartPole-v1 --model dqn
```
---
## Policy Gradient Methods (REINFORCE)
### A Different Kind of Agent
The previous sections learned a Q-function and derived a policy from it: always take $\arg\max_a Q(s,a)$. The policy was implicit.
Policy gradient methods flip this. We parameterize the policy directly as $\pi_\theta(a \mid s)$ and optimize its parameters $\theta$ to maximize expected return. There is no Q-function and no value function. Just a network that maps states to probability distributions over actions.
This matters because it handles continuous action spaces naturally, it can represent stochastic policies, and it is the basis for most modern RL algorithms.
### The Policy Network
The policy network is a **stochastic** classifier: given a state, it outputs a probability distribution over actions rather than a single deterministic action. Sampling from this distribution is what makes the policy exploratory.
In practice, the network outputs **logits** and we pass them to `torch.distributions.Categorical`. This distribution handles the softmax and sampling internally. The key thing the training loop needs is `dist.log_prob(action)`, which is differentiable with respect to the network parameters.
We suggest an architecture consisting of two linear layers with ReLU activations, with the output of the second layer being the logits.
The `forward` method should return a `Categorical` distribution, not the raw logits, so the training loop can call `dist.sample()` and `dist.log_prob(action)` directly.
:::info
**TODO 2.1: Implement `PolicyNetwork.__init__` and `PolicyNetwork.forward` in `reinforce.py`**
Define your network as an `nn.Sequential` matching the architecture above. `forward` should pass `x` through the network and return `torch.distributions.Categorical(logits=self.net(x))`.
:::
### The Policy Gradient Theorem
We want to maximize:
$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^T r_t\right]$$
where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots)$ is a trajectory sampled from the policy.
The **policy gradient theorem** resolves the problem of differentiating an expectation over trajectories. The key identity is:
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t\right]$$
where $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$ is the discounted return from step $t$.
### REINFORCE
**REINFORCE** is the simplest algorithm based on this theorem. Run one full episode, collect the trajectory, compute returns for each step, and apply the gradient update:
$$\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t$$
In PyTorch, minimizing the *negative* of $J$ is equivalent to gradient ascent on $J$:
$$\mathcal{L} = -\sum_t \log \pi_\theta(a_t \mid s_t) \cdot G_t$$
Call `.backward()` on this loss and the gradients $\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ are computed automatically, because `dist.log_prob(action)` is differentiable with respect to the network parameters.
### Computing Returns
The discounted return at step $t$ is:
$$G_t = r_t + \gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} + \cdots = r_t + \gamma \cdot G_{t+1}$$
This recurrence means we can compute all returns in a single backward pass through the reward list. Starting with $G = 0$ after the last step:
```
G = 0
for r in reversed(rewards):
G = r + gamma * G
prepend G to the output list
```
The result is a list $[G_0, G_1, \dots, G_{T-1}]$ of length $T$.
:::info
**TODO 2.2: Implement `compute_returns` in `reinforce.py`**
Given a list of per-step rewards and a discount factor $\gamma$, return a float32 tensor of shape `(T,)` containing $G_0, G_1, \dots, G_{T-1}$.
Use the backward accumulation pattern above. `returns.insert(0, G)` prepends to the list, so after the loop you can call `torch.tensor(returns, dtype=torch.float32)`.
:::
:::info
**TODO 2.3: Implement `reinforce_loss` in `reinforce.py`**
Implement the REINFORCE loss:
$$\mathcal{L} = -\sum_{t=0}^T \log\pi_\theta(a_t \mid s_t) \cdot G_t$$
Stack `log_probs` into a single tensor of shape `(T,)` and multiply element-wise by `returns`, sum, and negate. Both tensors must be on the same device.
:::
### The Variance Problem
REINFORCE is unbiased, meaning the gradient estimate is correct in expectation. But it has high variance. A single trajectory is a noisy estimate of the true gradient. The returns $G_t$ can be large and fluctuate wildly between runs.
High variance means you need many samples to get a reliable gradient signal, which makes training slow and unstable.
#### Baselines
A **baseline** $b$ can be subtracted from the returns without introducing bias:
$$\nabla_\theta J = \mathbb{E}_\tau\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - b)\right]$$
This works because the expected gradient is unchanged, but the variance of $(G_t - b)$ can be much smaller than the variance of $G_t$ alone. A common practical choice is an exponential moving average of past returns.
:::info
**TODO 2.4: Implement `reinforce_loss_with_baseline` in `reinforce.py`**
Same as `reinforce_loss`, but subtract `baseline` (a scalar float) from each return before multiplying:
$$\mathcal{L} = -\sum_t \log\pi_\theta(a_t \mid s_t) \cdot (G_t - b)$$
:::
### The Gradient Update
After computing the loss, the gradient update follows the same pattern as supervised learning: zero the gradients, backpropagate, and step the optimizer.
:::info
**TODO 2.5: Implement `train_step` in `reinforce.py`**
Given the collected trajectory for one episode:
1. If `use_baseline` is True, call `reinforce_loss_with_baseline(log_probs, returns, baseline_val)`. Otherwise call `reinforce_loss(log_probs, returns)`.
2. Zero the optimizer's gradients with `optimizer.zero_grad()`.
3. Call `loss.backward()`.
4. Step the optimizer with `optimizer.step()`.
5. Return `loss.item()` for logging.
:::
### Verifying REINFORCE
To train both variants side by side and generate a comparison plot:
```bash
python main.py train --model reinforce --compare-baseline
```
Or train a single variant:
```bash
python main.py train --model reinforce # with baseline (default)
python main.py train --model reinforce --no-baseline
```
**Training curves**: The un-baselined REINFORCE will likely plateau around an average reward of 40-90. The baseline version should get slightly higher (100-150) and learn faster.
**Return distribution histograms**: The baseline variant should show a distribution shifted slightly higher. Without the baseline, there is more spread because the agent occasionally collapses before recovering.
:::warning
**Note on Scalar Baselines**: You will likely observe a "boom and bust" cycle where the baseline agent suddenly crashes to a low reward before recovering. This happens because subtracting a single scalar baseline from a decreasing sequence of returns inadvertently penalizes actions taken late in long episodes. This fundamental instability demonstrates why we need the state-dependent baselines introduced in the next section!
:::
:::success
**Expected timescale**: REINFORCE on CartPole is highly unstable. It typically reaches its plateau somewhere between 300 and 800 episodes. There is significant run-to-run variance, which is normal and reflects the high variance of the estimator.
:::
:::warning
**Device mismatch**: `compute_returns` returns a CPU tensor. The training loop calls `.to(device)` on it immediately after. If you get a device error in the loss function, trace it back to whether returns ended up on the right device.
:::
To visualize a trained checkpoint:
```bash
# Run both variants and get a comparison plot in one go (recommended)
python main.py train --model reinforce --compare-baseline
# Or train just one variant
python main.py train --model reinforce # with baseline (default)
python main.py train --model reinforce --no-baseline
# Visualize the best checkpoint (no path needed)
python main.py visualize --env CartPole-v1 --model reinforce
```
## Actor-Critic (A2C)
### REINFORCE's Variance Problem, Revisited
In the previous section you saw that REINFORCE can converge, but it is noisy. Even with a scalar baseline, the returns $G_t$ are high-variance estimates of how good an action was.
The root cause is that REINFORCE waits until the end of an episode to compute returns. For a 500-step CartPole episode, $G_0$ includes 500 steps of compounding variance.
We want a lower-variance signal. Instead of using the full return $G_t$, we can use the **advantage** $A(s_t, a_t)$, which tells us how much better action $a_t$ was compared to the average action in state $s_t$.
### The Advantage Function
The **advantage function** is defined as:
$$A(s,a) = Q(s,a) - V(s)$$
$Q(s,a)$ is the expected return starting from $s$, taking action $a$, then following $\pi$. $V(s)$ is the expected return starting from $s$ and following $\pi$ from the start. Their difference measures whether $a$ was better or worse than the policy's average.
$$A(s,a) > 0 \implies \text{action } a \text{ was better than average, so increase its probability}$$
$$A(s,a) < 0 \implies \text{action } a \text{ was worse than average, so decrease its probability}$$
The advantage naturally has lower variance than the raw return because $V(s)$ absorbs the contribution of the state itself.
### One-Step TD Error as an Advantage Estimate
Computing $Q(s,a)$ exactly requires knowing the full future trajectory. Instead, we use a **one-step TD approximation**:
$$Q(s,a) \approx r + \gamma\, V(s')$$
This gives us a one-step TD estimate of the advantage:
$$\delta_t = r_t + \gamma\, V(s_{t+1}) - V(s_t) \approx A(s_t, a_t)$$
This is the **TD error** $\delta_t$. It uses only the immediate reward and value estimates at the current and next state, so no full return is needed.
This is the key algorithmic shift from REINFORCE to actor-critic: instead of Monte Carlo returns, we use one-step bootstrap estimates.
### The Actor-Critic Architecture
We now need two things:
1. **Actor**: the policy $\pi_\theta(a \mid s)$
2. **Critic**: a value function $V_\phi(s)$
In practice, they often share a backbone with separate output heads. This is the **shared-trunk** design: one feature extractor, two output layers. We can use one-network as the base to learn the dynamics of the environment and agent actions. Then we can use two separate heads to split into the action-space and value-fuction.
:::info
NOTE: The critic head outputs shape `(B, 1)`. Squeeze the last dimension to get `(B,)` so downstream code can treat it as a plain scalar per state.
:::
:::info
**TODO 3.1: Implement `ActorCritic.__init__` and `ActorCritic.forward` in `actor_critic.py`**
Define three attributes:
- `self.trunk`: an `nn.Sequential` with two Linear+ReLU layers
- `self.actor_head`: Linear layer to output to action space
- `self.critic_head`: Linear layer to output to the value function
In `forward`, pass `x` through `self.trunk` to get `h`, then return `(self.actor_head(h), self.critic_head(h).squeeze(-1))`.
:::
### Computing the Advantage
The one-step TD advantage is:
$$\delta_t = r_t + \gamma \cdot V(s_{t+1}) \cdot \mathbb{1}[\text{not done}] - V(s_t)$$
When the episode ends (`done=True`), there is no next state and the bootstrap term is zero.
There is an important gradient-flow consideration. `value` ($V(s_t)$) carries a gradient through the critic head. The advantage $\delta_t$ is used in the **actor loss**. If you do not detach `value` before computing the advantage, gradients from the actor loss will flow backward into the critic parameters through the advantage, creating an unintended coupling. The critic should only learn from the critic loss (computed separately via `td_target` and `value`). Detach `value` here to prevent the actor from accidentally updating the critic.
:::info
**TODO 3.2: Implement `compute_advantage` in `actor_critic.py`**
Compute the one-step TD advantage:
$$\delta = r + \gamma \cdot \text{next\_value} \cdot (1 - \text{done}) - \text{value.detach()}$$
`next_value` is already detached (computed under `torch.no_grad()` in the training loop). Cast `done` to `float` with `float(done)` before arithmetic.
:::
### The Actor and Critic Losses
#### Actor Loss
Use the TD error as a proxy for the advantage, but treat it as a constant (do not let gradients from the actor loss flow back into the critic):
$$\mathcal{L}_\text{actor} = -\log\pi_\theta(a \mid s) \cdot \delta_t.\text{detach()}$$
#### Critic Loss
The critic is trained to minimize the squared TD error:
$$\mathcal{L}_\text{critic} = \left(r + \gamma V(s') - V(s)\right)^2 = \left(\text{td\_target.detach()} - V(s)\right)^2$$
The `td_target` must be detached so the critic minimizes the error relative to a fixed target, not a moving one.
#### Combined Loss
Both networks share parameters, so we combine losses into one:
$$\mathcal{L} = \mathcal{L}_\text{actor} + c_v \cdot \mathcal{L}_\text{critic}$$
where $c_v$ (here `critic_coeff`) scales the relative contribution.
:::info
**TODO 3.3: Implement `actor_critic_loss` in `actor_critic.py`**
Return both losses as separate tensors:
- `actor_loss = -log_prob * advantage.detach()`
- `critic_loss = (td_target.detach() - value) ** 2`
The `.detach()` calls are not optional. Think through what would happen without each one before writing the code.
:::
:::info
**TODO 3.4: Implement `train_step` in `actor_critic.py`**
1. Call `actor_critic_loss` to get `(actor_loss, critic_loss)`.
2. Compute `loss = actor_loss + critic_coeff * critic_loss`.
3. Zero the optimizer's gradients.
4. Call `loss.backward()`.
5. Step the optimizer.
6. Return `loss.item()` for logging.
:::
### Why This Works
The critic provides a low-variance baseline. Instead of using a scalar moving average, the critic **learns a state-dependent baseline** $V(s)$. The result is that the gradient signal the actor receives is much more informative, making convergence faster and more reliable.
### Verifying Actor-Critic
Run `python main.py train --model a2c` after completing all four TODOs.
**A2C vs REINFORCE**: Once you have trained both, compare their training curves with the `compare-baseline` mode or by plotting checkpoints side by side. A2C should reach the CartPole maximum faster and with less variance.
**Gradient flow**: If your agent never improves, add a print to check whether `advantage` has `requires_grad=True` when it arrives at `actor_critic_loss`. If it does, you are missing a `.detach()` call somewhere.
:::success
**Typical convergence on CartPole**: A2C usually reaches a reward of 350-450+ within 800-1000 episodes. On LunarLander-v2, expect 800-1500 episodes before reward exceeds 200.
:::
:::warning
**A common bug**: Computing `td_target` with grad and then using it in `critic_loss = (td_target - value)^2` without detaching it. This makes the critic chase its own tail. Always treat the TD target as a fixed number.
:::
Visualize the trained A2C agent:
```bash
# Default: CartPole-v1, 1000 episodes
python main.py train --model a2c
# If you want to run longer or on a harder environment
python main.py train --model a2c --episodes 1500 --env LunarLander-v2
# Visualize the best checkpoint (no path needed)
python main.py visualize --env CartPole-v1 --model a2c
# Or for LunarLander
python main.py visualize --env LunarLander-v2 --model a2c
```
---
## Comparing Agents
Once you have trained multiple agents, it is instructive to see how their learned behaviors differ in the same environment. The `visualize` subcommand supports a side-by-side comparison mode when you provide two checkpoints:
```bash
python main.py visualize \
--checkpoint-a checkpoints/CartPole-v1/reinforce/best \
--checkpoint-b checkpoints/CartPole-v1/a2c/best
```
This runs both agents in the same environment in parallel and renders the frames side by side. The resulting GIF and a JSON summary are saved to `outputs/visualizations`.
Both checkpoints must correspond to agents trained on the same environment. To use a different environment than the one stored in the checkpoint metadata, pass `--env`:
```bash
python main.py visualize \
--checkpoint-a checkpoints/CartPole-v1/dqn/best \
--checkpoint-b checkpoints/CartPole-v1/a2c/best \
--env CartPole-v1
```
## Submission
All we ask is you upload all of your model files and the output visualizations. The easiest method is Github, but you can also upload a `.zip` of the entire repository.