# NERS Paper Walkthrough
[NERS Paper Link](https://openreview.net/forum?id=gJYlaqL8i8)
[NERS Code Link](https://github.com/youngmin0oh/NERS)
[CCLF Code](https://hackmd.io/@jeffreymo/B1yVnWBY5)
[CCLF Paper](https://hackmd.io/@jeffreymo/HJV-Zy__c)
###### tags: `Jeffrey`, `Bajaj - CVC Lab` `CVC Lab`
# High Level Overview of NERS
## Primary Contribution of NERS
**Neural Experience Replay Sampler (NERS) is a proposed framework that increases the sample efficiency of a replay buffer by incorporating three neural networks $f_l$, $f_g$, and $f_s$ in the buffer to sort the importance of EACH TRANSITION in an episode rather than the importance of an episode in the replay buffer.** The relative importance of each episode allows for the agent to sample the most important transitions rather than just random transitions. NERS can be plugged into any RL framework (e.g., [Soft Actor Critic (SAC)](https://arxiv.org/abs/1801.01290), [Rainbow](https://arxiv.org/abs/1710.02298), or [Twin Delayed Deep Deterministic (TD3)](https://arxiv.org/abs/1802.09477)). In the experiments list below, the Rainbow framework is used.
### General List of Referenced Variables
| Variable | Variable Name | Corresponding Equation | Description |
| -------- | -------- | ---------------------- | ------------ |
|$s_t$|State| See defining environment|State|
|$a_t$|Action| |Action|
|$r_t$|Reward| |Reward|
|$\gamma$|Discount Factor| $\gamma \in [0,1]$|Discount Factor|
|$\pi_{\psi}$|Policy |$\pi_{psi}(a \vert s)$|A policy (i.e., actor) with parameters $\psi$|
|$Q_{\theta}(s, a)$|Online Q-network|**Neural Network**|The online Q-function (i.e., critic) with parameters $\theta$|
|$\hat{Q}(s,a)$|Target Q-network|**Neural Network**|The Target Q-network with parameters $\hat{\theta}$|
| $\mathcal{B}$|Replay Buffer|$\mathcal{B}_i=(s_i, a_i, r_i, s_{i+1})$| Replay buffer that stores $s, a, r, s_{i+1}$ at every timestep $i$|
|$I$|Index Set|$I \in \vert \mathcal{B} \vert$|A sampled episode in the replay buffer|
|$\mathcal{P}$|Priority Set|$\mathcal{P}_{\mathcal{B}}=\{\sigma_1,...,\sigma_{\vert \mathcal{B} \vert}\}$|A set of priorities where $\sigma$ is the priority of a transition at each index of $\vert \mathcal{B}\vert$|
|$\alpha$ | Alpha | **Hyperparameter** |Prioritised experience replay exponent |
|$\beta$ |Beta |**Hyperparameter**| Initial prioritised experience replay importance sampling weight |
> Ignore this, for editing purposes:
> |Variable|VariableName|Equation|Description|
### List of Referenced Equations
|Eqn#| Equation | Used in |
| -- | -------- | -------- |
| 1 | $p_i= \frac{\sigma_i^{\alpha}}{\sum_{k \in [\vert \mathcal{B} \vert]} \sigma_k^{\alpha}}$ | $p_i$ denotes the probability a transition will be sampled in the training batch |
| 2 | $\delta_{k(i)}= r_{k(i)} + \gamma max_a Q_{\hat{\theta}}(s_{k(i)+1}, a)-Q_{\theta}(s_{k(i)}, a_{k(i)})$ | $\delta_{k(i)}$ denotes the temporal difference error, a metric for how 'suprising' or 'unexpected' a transition is. |
| 3 | $\boldsymbol{D}^{cat}(I) := \{f_{l,1}\bigoplus f_g^{avg},..., f_{l, \vert I \vert} \bigoplus f_g^{avg}\}\in \mathbb{R}^{\vert I \vert \times (d_l+d_g)}$ | $\bigoplus$ denotes concatination. $f_g^{avg}$ denotes the "global context," an average of the outputs of $f_g$. $\boldsymbol{D}^{cat}(I)$ is the input to $f_s$.|
| 4 | $f_s(\boldsymbol{D}^{cat}(I))=\{\sigma_i\}_{i \in I} \in \mathbb{R}^{\vert I \vert}$ | $f_s$ generates a score set. |
| 5 | $w_i=(\frac{1}{\vert \mathcal{B} \vert p(i)})^{\beta}$ | $w_i$ is the importance sampling weight which is used when updating the critic. |
| 6 | $r^{re} := \mathbb{E}_{\pi}[\sum_{t \in t_{ep}}r_t]-\mathbb{E}_{\pi'}[\sum_{t \ t_{ep}} r_t]$ | The replay reward is used for measuring how much actions of the sampling policy help the learning of the agent for each episode. We try to maximize the replay reward.|
| 7 | $\nabla_\phi\mathbb{E}_{I_{train}}[r^{re}]=\mathbb{E}_{I_{train}}[r^{re} \sum_{i \in I_{train}}\nabla_{\phi}\log p_i(D(\mathcal{B}, I_{train}))]$ | This is the equation used to maximize $r^{re}$ |
## NERS Breakdown

NERS incorporates 3 neural networks into the Replay Buffer: the local context network $f_l$, the global context network $f_g$, and the scoring network $f_s$. $f_l$ and $f_g$ will take a set of transitions that are sampled from $\mathcal{B}$. From the sampled transitions, NERS will extract an array of features consisting of $[s_t, a_t, r_t, s_{t+1}, \delta_i, r_t + \gamma max_a Q_{\hat{\theta}}(s, a)]$.
### The Local Network
* The local network $f_l$ takes in the feature set as input.
* The purpose of $f_l$ is to reduce the dimensions of the feature data.
* The output of $f_l$ is half of the input dimension of $f_s$
### The Global Network
* The global network $f_g$ takes in the feature set as input.
* The purpose of $f_g$ is to reduce the dimensions of the feature data.
* ALL outputs of the $f_g$ will be averaged to create a global context array that will be concatinated with the output of each $f_l$
* The output of $f_g$ is half of the input dimension of $f_s$
### The Scoring Network
* The scoring network takes in each of the concatinated arrays.
* The purpose of $f_s$ is to calculate score of each transition.
* The output of $f_s$ is a 1D array with a length corresponding to the number of transitions originally sampled.
### **Training Batch**
The score of each transition added as a label to each sampled transition in $\mathcal{B}$. The probability of selecting a transition in the training batch depends on the score of the transition with a score $\sigma_i$:
$$
p_i=\frac{\sigma_i^{\alpha}}{\sum_{k \in [\vert \mathcal{B} \vert]}}\sigma_k^{\alpha}
$$
# Using NERS in Code

## **Algorithm 1** Training NERS: batch size $m$ and sample size $n$
1. Initalize NERS parameters $\phi$, a replay buffer $\mathcal{B} \leftarrow \varnothing$, and index set $\mathcal{I} \leftarrow \varnothing$
2. **For** each timestep $t$ **do**
3. $\quad$ Choose $a_t$ from the actor and collect a sample $(s_t, a_t, r_t, s_{t+1})$ from the environment
4. $\quad$ Update replay buffer $\mathcal{B} \leftarrow \mathcal{B} \cup \{(s_t, a_t, r_t, s_{t+1})\}$ and priority set $\mathcal{P}_{\mathcal{B}} \leftarrow \mathcal{P}_{\mathcal{B}} \cup \{1.0\}$
5. **For** each gradient step **do**
6. $\quad \quad$ Sample an index $\mathcal{I}$ by the given set $\mathcal{P}$ and Eq. 1 with $\vert I \vert = m$
7. $\quad \quad$ Calculate a score set $\{ \sigma_k \}_{k \in I}$ and weights $\{w_i\}_{i \in I}$ with Eq. 4 and 5, respectively
8. $\quad \quad$ Train the actor and critic using batch $\{\mathcal{B_i}\}_{i \in I} \subset \mathcal{B}$ and corresponding weights $\{w_i\}_{i \in I}$
9. $\quad \quad$ Collect $\mathcal{I} \leftarrow \mathcal{I} \bigcup I$ and update $\mathcal{P}_{\mathcal{B}}(I)$ by the score set $\{\sigma_k\}_{k \in I}$
10. $\quad$ **end**
11. $\quad$ **for** the end of an episode do
12. $\quad$ $\quad$ Choose a subset $I_{train}$ from $\mathcal{I}$ uniformly at random such that $\vert I_{train} \vert = n$
13. $\quad$ $\quad$ Calculate $r^{re}$ as in Eq. 6
14. $\quad$ $\quad$ Update sampling policy $\phi$ using the gradient with respect to $I_{train}$
15. $\quad$ $\quad$ Empty $\mathcal{I}$ i.e., $\mathcal{I} \leftarrow \varnothing$
16. $\quad$ **end**
17. **end**
## Defining the Architecture of NERS in the Code
Mentioned before, NERS consists of three major networks: the local network $f_l$, the global network $f_g$, and the scoring network $f_s$.
# NERS Code Walkthrough
## Initalizing the environment
First, we need to fill in the arguments/hyperparameters at the top of the `main.py` piece of code. Then, we will initalize the environment `env` using the following piece of code:
```python=
env = Env(args)
```
The environment **MUST** have the following defined functions:
* `_get_state`: Returns an `(84,84)` grayscale image of the environment in the form of a `torch.Tensor`
* `_reset_buffer`: Resets the frame buffer (`frames = self.window (hyperparameter`) to a series of arrays with shape `(84,84)`
* `reset`: Resets the environment to its native state
* `step(action)`: Repeats the action `self.window` times to the environment
* `train`: Indicates a terminal signal due to life lost
* `eval`: Indicates a terminal signal
* `action_space`: Returns number of actions (discrete)
* `render`: Displays image of env (optional)
* `close`: Closes env
After the initalization of the environment, we will then record the action space `action_space` of `env` using the command:
```python=
action_space = env.action_space()
```
### Environment Reward Function Calculation
The environment reward function varies per environment. To see the default Ms.Pacman environment, [see here.](##Defining-the-Environment)
## Initalization of the Agent
We initalize the agent `dqn` used using the following command:
```python=
dqn = Agent(args, env)
```
The line `Agent(args, env)` will call the `DQN()` function from `memory.py` and will create two networks: an online network `online_net` and a target network `target_net`. The inialization and architecture of each network is shown in the following lines of code:
```python=
class DQN(nn.Module):
def __init__(self, args, action_space):
self.convs = nn.Sequential(nn.Conv2d(args.history_length, 32, 5, stride=5, padding=0), nn.ReLU(),
nn.Conv2d(32, 64, 5, stride=5, padding=0), nn.ReLU())
self.conv_output_size = 576
self.fc_h_v = NoisyLinear(self.conv_output_size, args.hidden_size, std_init=args.noisy_std)
self.fc_h_a = NoisyLinear(self.conv_output_size, args.hidden_size, std_init=args.noisy_std)
self.fc_z_v = NoisyLinear(args.hidden_size, 51, std_init=args.noisy_std)
self.fc_z_a = NoisyLinear(args.hidden_size, action_space * self.atoms, std_init=args.noisy_std)
def forward(self, x, log=False):
x = self.convs(x)
x = x.view(-1, self.conv_output_size)
v = self.fc_z_v(F.relu(self.fc_h_v(x))) # Value stream
a = self.fc_z_a(F.relu(self.fc_h_a(x))) # Advantage stream
v, a = v.view(-1, 1, 51), a.view(-1, self.action_space, 51)
q = v + a - a.mean(1, keepdim=True) # Combine streams
if log: # Use log softmax for numerical stability
q = F.log_softmax(q, dim=2) # Log probabilities with action over second dimension
else:
q = F.softmax(q, dim=2) # Probabilities with action over second dimension
return q
```
The optimizer used `dqn` is the Adam optimizer. This is initalized using the following lines of code:
```python=
self.optimiser = optim.Adam(self.online_net.parameters(), lr=0.0001, eps=1.5e-4)
```
Below is an image of the network:

## Initalization of NERS
Calling the `NERS(args, 1000000)` command will create **a segment tree with a capacity of 1000000 filled with zeros**. This is seen in the following command:
```python=
self.sum_tree = np.zeros((2 * 1000000 - 1, ), dtype=np.float32)
```
Furthermore, the `NERS()` function will also create the following networks:
### Data/State preprocessing networks:
* `self.convs`: a convolution network that will take the raw image array as an input and flatten the array
* `self.feature_net2`: a series of linear layers that will reshape the output of `self.convs` into an array of 32 features.
### The Local Network
In the code, $f_l$ is a neural network with a sequential architecture, consisting of 4 linear layers. It takes an input of 69 features, represented by $s_t$ (32 features), $s_{t+1}$ (32 features), $a_t$ (1 feature), $r_t$ (1 feature), $\delta_t$ (1 feature), $Q$-value (1 feature), and $t$ (1 feature). $f_l$ aims to reduce the dimensionality of the input features from $69$ to $64$ or half of the input feature size of $f_s$(**hyperparameter**).
The initalization of the local network `local_net` is shown below in code:
```python=
self.local_net = nn.Sequential(nn.Linear(self.input_size, hiddensize), nn.ReLU(),
nn.Linear(hiddensize, hiddensize * 2), nn.ReLU(),
nn.Linear(hiddensize * 2, hiddensize), nn.ReLU(),
nn.Linear(hiddensize, hiddensize // 2)).to(device=self.device)
```
### The Global Network
In the code, $f_g$ is a neural network with a sequential architecture, consisting of 4 linear layers. It takes an input of 69 features, represented by $s_t$ (32 features), $s_{t+1}$ (32 features), $a_t$ (1 feature), $r_t$ (1 feature), $\delta_t$ (1 feature), $Q$-value (1 feature), and $t$ (1 feature). Like $f_l$, $f_g$ aims to reduce the dimensionality of the input features from $69$ to $64$ or half of the input feature size of $f_s$(**hyperparameter**). Note, this is the exact same architecture as $f_l$.
$f_g$ aims to capture the same information as $f_l$, however, the output arrays of $f_g$ will be averaged to create a global context among the sampled transitions such that $f_g^{avg}(\boldsymbol{D}(\mathcal{B},I))= \frac{\sum f_g(\boldsymbol{D}(\mathcal{B},I))}{|I|}$.
The initalization of the global network `global_net` is shown below in code:
```python=
self.global_net = nn.Sequential(nn.Linear(self.input_size, hiddensize), nn.ReLU(),
nn.Linear(hiddensize, hiddensize * 2), nn.ReLU(),
nn.Linear(hiddensize * 2, hiddensize), nn.ReLU(),
nn.Linear(hiddensize, hiddensize // 2)).to(device=self.device)
```
### The Scoring Network
In the code, $f_s$ is a network with a sequential architecture consisting of 3 linear layers. It takes an input of 128 features (the concatinated array described in Eq. 3). Considering both local ($f_l$) and global contexts, $f_s$ will output a 1D array with a length of 32 (number of sampled transitions, **hyperparameter**), each index of the array will have a float representing the score/priority of the transition.
The initalization of the scoring network `score_net` is shown below in code:
```python=
self.score_net = nn.Sequential(nn.Linear(2 * (hiddensize // 2), hiddensize), nn.ReLU(),
nn.Linear(hiddensize, hiddensize // 2), nn.ReLU(),
nn.Linear(hiddensize // 2, 1), nn.Sigmoid()).to(device=self.device)
```
### The Replay Buffer
In the code, $\mathcal{B}$ is stored as a segment tree with the values of the segment tree corresponding to the priorities. The initalization of the segment tree with a capacity of storing a 1000000 transitions `self.transitions` is initalized with the following lines of code:
```python=
self.transitions = SegmentTree(capacity)
```
### Filling the Initial Buffer
For the first **$500$** steps (hyperparameter), the agent will take random actions and add each transition to $\mathcal{B}$. This is shown in the below lines of code. Each transition's action and reward is logged as a `None`. [Q: Why is this?]
```python=
while T < args.evaluation_size:
if done:
state, done = env.reset(), False
next_state, _, done = env.step(np.random.randint(0, action_space))
val_mem.append(state, None, None, done)
```
## Interactions with the Environment
To gather information, the agent must perform an action and observe a given reward. The following piece of code is used to call `dqn()` to return an action given a state, applies the action, and observes the returned reward.
```python=
action = dqn.act(state) # Choose an action greedily (with noisy weights)
next_state, reward, done = env.step(action) # Step
```
This code chooses an action from the `dqn.act()` function which applies the **most greedy action** returned from the DQN.
The action will then be applied to the environment in which the next state, reward, and a terminal bool value will be returned.
## Updating the Replay Buffer
After observing a `next_state`, `reward`, and `done` value, the agent will store each variable as a transition in the replay buffer. Through the following lines of code:
```python=
mem.append(state, action, reward, done) # Append transition to memory
```
This calls the `append()` function found in the segmentation tree.
### Appending the Transition
Appending the transition involves passing the current state through a convolution layer. The resulting convolved state will then be appending into the segmentation tree with a priority of 1.
```python=
# Found in NERS.memory.ReplayMemory()
def append(self, state, action, reward, terminal):
state = state[-1].mul(255).to(dtype=torch.uint8, device=torch.device('cpu')) # Only store last frame and discretise to save memory
self.td[self.transitions.index] = 1.0
self.q[self.transitions.index] = 1.0
self.transitions.append(Transition(self.t, state, action, reward, not terminal), self.transitions.max) # Store new transition with maximum priority
self.t = 0 if terminal else self.t + 1 # Start new episodes with t = 0
```
## Training NERS:
We evaluate the `dqn` at every `1` step (**hyperparameter**):
```python=
if T % args.replay_frequency == 0:
dqn.learn(mem)
```
### Sampling an Index
We sample a transition by calling the `sample()` function in `mem`.
For each sample in the $\text{batch} = 32$, we (uniform) randomly sample from $\mathcal{B}$. **After a sample is select, the `find` function from `SegmentTree` will traverse the tree to return the `prob`, index, and tree index.**
```python=
def sample(self, batch_size):
p_total = self.transitions.total() # Retrieve sum of all priorities (used to create a normalised probability distribution)
segment = p_total / batch_size # Batch size number of segments, based on sum over all probabilities
batch = [self._get_sample_from_segment(segment, i) for i in range(batch_size)] # Get batch of valid samples
probs, idxs, tree_idxs, states, actions, returns, next_states, nonterminals = zip(*batch)
states, next_states, = torch.stack(states), torch.stack(next_states)
actions, returns, nonterminals = torch.cat(actions), torch.cat(returns), torch.stack(nonterminals)
```
### Calculating the Score Set and Weights
After sampling the transition, the `sample` function in calculate the score set from the sampled transitions `probs`:
```python=
probs = np.array(probs, dtype=np.float32) / p_total # Calculate normalised probabilities
```
Using `probs`, the `sample` function will then calculate the weights of the sampled transitions
```python=
weights = (capacity * probs) ** -self.priority_weight # Compute importance-sampling weights w
weights = torch.tensor(weights / weights.max(), dtype=torch.float32, device=self.device) # Normalise by max importance-sampling weight from batch
return tree_idxs, states, actions, returns, next_states, nonterminals, weights
```
After calculating the score and weights of the transition, the `sample()` function will return `tree_idxs, states, actions, returns, next_states, nonterminals, weights` in a tuple that will be unpacked by the `learn()` function in `agent.py`.
```python=
idxs, states, actions, returns, next_states, nonterminals, weights = mem.sample(self.batch_size)
```
### Training the Actor and Critic Using Transitions and Weights
We train the actor and critic using batch $\{B_i\}_{i \in I}$ and the importance sampling weights $\{w_i\}_{i \in I}$ returned from the sampling.
The training of the actor and critic is separated into three sections:
[Q: Is this learning a possible stochastic transition function? Later used for the Bellman opperator as $p(s'|s,a)$ is needed]
#### Calculating the $n^{th}$ Next State Probability
```python=
pns = self.online_net(next_states) # Probabilities p(s_t+n, ·; θonline)
dns = self.support.expand_as(pns) * pns # Distribution d_t+n = (z, p(s_t+n, ·; θonline))
argmax_indices_ns = dns.sum(2).argmax(1) # Perform argmax action selection using online network: argmax_a[(z, p(s_t+n, a; θonline))]
self.target_net.reset_noise() # Sample new target net noise
pns = self.target_net(next_states) # Probabilities p(s_t+n, ·; θtarget)
pns_a = pns[range(self.batch_size), argmax_indices_ns] # Double-Q probabilities p(s_t+n, argmax_a[(z, p(s_t+n, a; θonline))]; θtarget)
```
#### Computing the Bellman Opperator
A Bellman opperator is mapping of one value function to another. We define the Bellman opperator $T$ being applied to a critic function $z$ as $T_z$:
$$
(T_z) = R^{n} + (\gamma^n)z
$$
Where R is the batched of returned rewards, $\gamma$ is a hyperparameter, $z$ is the value function, and $n$ is used for [$n$ step returns](https://towardsdatascience.com/introduction-to-reinforcement-learning-rl-part-7-n-step-bootstrapping-6c3006a13265).
[Q: What is this equation? I haven't seen it before]
We further define as
$$
b = \frac{T_z - V_{min}}{\Delta z}
$$
and the upper and lower limits of $b$ as $b_u, b_l$ respectively.
Below is the code for this equation:
```python=
Tz = returns.unsqueeze(1) + nonterminals * (self.discount ** self.n) * self.support.unsqueeze(0) # Tz = R^n + (γ^n)z (accounting for terminal states)
Tz = Tz.clamp(min=self.Vmin, max=self.Vmax) # Clamp between supported values
# Compute L2 projection of Tz onto fixed support z
b = (Tz - self.Vmin) / self.delta_z # b = (Tz - Vmin) / Δz
l, u = b.floor().to(torch.int64), b.ceil().to(torch.int64) # b_l, b_u
```
#### Computing the Distribution of the Bellman Opperator
[Q: I'm not sure what this section is]
```python=
m = states.new_zeros(self.batch_size, self.atoms)
offset = torch.linspace(0, ((self.batch_size - 1) * self.atoms), self.batch_size).unsqueeze(1).expand(self.batch_size, self.atoms).to(actions)
m.view(-1).index_add_(0, (l + offset).view(-1), (pns_a * (u.float() - b)).view(-1)) # m_l = m_l + p(s_t+n, a*)(u - b)
m.view(-1).index_add_(0, (u + offset).view(-1), (pns_a * (b - l.float())).view(-1)) # m_u = m_u + p(s_t+n, a*)(b - l)
qs = self.evaluate_qs(states)
```
### Updating the Priority Set (score set)
After training the network, we update the priority set with the newly trained information.
First, we use the [selected transitions ](###Sampling-an-Index) and concatinate each respective element into their own array (e.g., $s_t, s_{t+1}, a_t,...$). The $s_t$ and $s_{t+1}$ are passed into a feature net which outputs a 1D array of 32 elements.
A concatinated array consisting of $\{s_t, a_t, r_t, s_{t+1}, TD_t, Q_t, t\}$ are then used as inputs into $f_l$ and $f_g$. The outputs of each respective network are then used as inputs into $f_s$.
```python=
tensor_input = torch.cat(
(curr, actions, rewards, next, tds, qs, timesteps),
-1).to(device=self.device)
tensor_input = torch.tanh(tensor_input)
global_out = self.global_net(tensor_input)
local_out = self.local_net(tensor_input)
global_out_mean = torch.mean(global_out, 0)
res_out = torch.cat([local_out, torch.unsqueeze(global_out_mean, 0).expand_as(local_out)], -1)
prob = self.score_net(res_out).reshape(-1)
```
The `update_priorities()` function creates a list of the new priorities and their respective indices.
> Note, changing the priorities will also cause the importance sampling weight values to change as well. This is compensated when `sample()` is called.
## Reward Function and Optimization Sampling Policy
NERS is updated at each evaluation step. Let $\pi$ be the current policy used in the evaluation and $\pi'$ be the policy used in the previous evaluation. Moreover, let $t_{ep}$ be the timesteps in an episode. We define the replay reward $r^{re}$ as
$$
r^{re} := \mathbb{E}_{\pi}[\sum_{t \in t_{ep}}r_t]-\mathbb{E}_{\pi'}[\sum_{t \ t_{ep}} r_t]
$$
The replay reward is used for measuring how much actions of the sampling policy help the learning of the agent for each episode. **To maximize the sample efficiency for learning the agent's policy, the proposed method involves training the sampling policy to select transitions to maximize $r^{re}$.** Let $I_{train}$ denote the subindex of $I$ for total sampled transitions in an episode. We use the following formula to maximize $r^{re}$:
$$
\nabla_\phi\mathbb{E}_{I_{train}}[r^{re}]=\mathbb{E}_{I_{train}}[r^{re} \sum_{i \in I_{train}}\nabla_{\phi}\log p_i(D(\mathcal{B}, I_{train}))
$$
```python=
def learn(self, avg_reward, avg_Q):
if self.prev_score == np.inf:
self.prev_score = avg_reward
else:
self.curr_score = avg_reward
self.return_val = (self.curr_score - self.prev_score)
idxes = self.index_for_train[np.random.choice(len(self.index_for_train), self.args.batch_size)]
prob = self.get_prob_by_idxs(idxes) + 1e-6
loss = -torch.mean(torch.log(prob) * self.return_val)
self.optim.zero_grad()
loss.backward()
self.optim.step()
self.index_for_train = np.zeros((0), dtype=np.int)
```
## Defining the Environment
### Game Overview/Environment Dynamics
| Variable | Name | Description |
| -------- | ----------------- | -------- |
| $s_t$ | State at timestep $t$| An $[84,84]$ array. The first index of the array corresponds to the height of the image, the second index of the array corresponds to the width of the image. The array is returned directly from the Atari-py environment and represents each pixel's intensity at each snapshot of the screen (captured every 4 frames). The values in the array are floats divided by 255 to normalize the data between 0-1.|
| $a_t$ | Action at timestep $t$ | The given action is an integer that represents an action: left $0$, right $1$, up $2$, down $3$. An action will be repeated for 4 frames (frame skip, **hyperparameter**)|
| $r_t$ | Reward at timestep $t$ | The reward awarded to the agent for making an action. This value is returned after every timestep. The reward is calculated based off the difference in score as a result of an action. See below for scoring rubric. |
The experiment was ran in a game of Ms.Pacman (the default game included in the code). The objective of Ms.Pacman is to accumulate the most points (highest score, displayed at the bottom center of the screen). Points are accumulated through the collection of "pac-dots," illustrated by dashes on the field. Additionally, the player may aquire "Power Pellets" which will grant addition points and gain the ability to eat *vunerable ghosts* will which further accumulate points. Below is the scoring rubric:
* Pac-dot: 10 points
* Power Pellet: 50 points
* Vunerable Ghosts in succession:
* 1: 200 points
* 2: 400 points
* 3: 800 points
* 4: 1600 points
Each game was terminated after a terminal state, defined as the meeting of terminal conditions. The following are terminal conditions:
* Ran out of lives
* Exceeded maximum allowed timesteps
### Initalization of the Game
The player in each game is initalized at a set location (below the middle rectangle) with a score of 0. The player may make one of four moves (left, right, up, down); each action was performed 4 times, consistant with the number of frames skipped. If the player meets a wall, they will not be punished. However, they are unable to move into/through the wall. If the player runs through the exits displayed below, they will be teleported to the exit with the corresponding color.

### Interactions with the Environment
At every timestep, the agent will select an action greedily with the `sample_action()` function found in the `CCLF_sac.py` file. The selected action will be applied to the environment through the `env.step(action)` function which will return a new observation $o_{t+1}$, $r_t$, and $d_i$.
```python=
for step in range(num_train_steps):
action = agent.sample_action(obs)
...
next_obs, reward, done, _ = env.step(action)
# next_obs (np.array([84, 84])): Next observation that is returned
# as a result of the action
# reward (float): Float expressing the reward returned from the environment.
# This is the score from the Atari environment.
# done (bool): Has the environment reached terminal?
```