###### tags: `Research Paper`, `Doctor Bajaj`, `Personal Notes`, `Reinforcement Learning` # Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update [Paper Link](https://arxiv.org/pdf/1805.12375.pdf) # Introduction ## Main problems to solve 1. Low chance of sampling a transtion with a nonzero reward. **Solved by sampling transitions in an episodic manner (groups and in a linear/chronological fasion** 2. If the initial values in training are initialized as 0, there is no point in updating those values for one-step transitions with 0 reward if future transitions with nonzero rewards have not been updated yet. Since the later values have not achieved a nonzero reward, we can assume every value till then is 0. (e.g., $r_1 =0, r_2=0$ There is no point in updating any $r$ until one *r* reaches a nonzero value. **Solved by updating the transitions in a backwards manner** **Since the transitions are held in a backwards manner, the zero values at the beginning will not be updated** ___ # Background ## Reinforcement Learning ### Markov Decision Process (MDP): The Markov Decision Process (MDP) is a way to frame reinforcement learning problems. In the MDP, each state $s_t$ is dependent on the previous state $s_{t-1}$. In the MDP framework, we define $S$ as the state space, $A$ as the action space, $P$ as the transition probability distribution, and $R$ as the reward function. Therefore, we may define a MDP as $M$ where $$ M = (s, a, p, r) $$ Furthermore, we may define the transition probability distribution $P$ where $$ P(s_{t+1}|s_t,a_t) $$ denotes a transition probability distribution for a given state $s_t$ and action $a_t$. For each state, the environment will produce a **transition probablility distribution** that indicates the liklihood of moving from $s_t$ to $s_{t+1}$ given the agent does an action $a$. From this, the agent will be rewarded with a reward $r$. ### Q-Learning The goal of Q-learning is to estimate the Q function which may be characterized by the Bellman optimality equation $$ Q^*(s_t, a) = \mathbb{E}[r_t + \gamma max_{a_{t+1}} Q^*(s_{t+1}, a_{t+1})] $$ Although generally effective, traditional Q-learning has a few inefficencies, two of which are discussed in the paper: 1. Each experience is only used once to update the Q-function (experience is wasted) 2. Experiences in a chronologically forward is not as effective since the value update of $s_t$ depends on $s_{t+1}$. #### Replay Buffer In [Lin, L-J](https://link.springer.com/article/10.1007/BF00992699), an experience replay was proposed to combat these inefficencies. The paper introduced a **replay buffer** in which the agent would store a transition $(s_t, a_t, r_t, s_{t+1})$ and sample from the said buffer. #### Deep Q-Networks (DQNs) Due to the large state spaces, it is often impractical to use a table to store Q-values. Thus, DQNs were proposed to approximate the Q-function. DQNs use the experience replay such that each transition is used multiple times. There are a few flaws of DQNs: 1. Since DQNs use a function approximator, consecutive states will output similar Q-values. 2. If DQNs update transitions in a backwards chronological order, they often overestimate and degrades the performance. Due to these flaws, DQNs often sample transitions uniformly at random which breaks down the correlations between consecutive transitions and reduces the variance of the update. ___ # Proposed Methods ![Traditional Episodic Backwards Update](https://i.imgur.com/zeniF8Z.png) > This image shows the difference between uniform sampling ($Q_U$) and episodic backwards update ($Q_E$) Rather than limiting the number of transitions, the paper proposes sampling a whole episode from the replay memory and propagating the values sequentially throughout the entire transitions of the sampled episode in a backward manner. ## Episodic Backwards Update for Tabular Q-learning #### Comparing the differences ![Simple Environment](https://i.imgur.com/B4xESDe.png) > In an environment with only 4 states, with state 1 being the starting state and state 4 being the terminal state. The goal of the agent is to reach the temrinal state with the reward for states 1, 2, and 3 being 0 and state 4 being 1. The optimal policy is to take the action "Right" in all states Assume in a replay buffer/experience memory, we have only one episode in which $$ s_1 \rightarrow s_2 \rightarrow s_3 \rightarrow s_2 \rightarrow s_3 \rightarrow s_4 $$ Using the **Nature DQN (ordinary DQN)**, key transitions such as $(s_1 \rightarrow s_2), (s_2 \rightarrow s_3)$, and $(s_3 \rightarrow s_4)$ may not be sampled for updates. Even if they are sampled, they may not be sampled in the correct order (e.g., transitioning from $s_3 \rightarrow s_4$ may occur before the transition from $s_1 \rightarrow s_2$). Using the episodic backwards update (EBU), we can speed up the reward propagation by updating all the transitions within an episode (this method is also computationally efficient). ![Experiment](https://i.imgur.com/fTcmfoi.png) > In the environment described above, agents using EBU took 5 updates of Q-values to reach the optimal solution. Agents that used Uniform Sampling took more than 40 transitions to learn the optimal path with probability close to 1. Another important note is that EBU differs from n-step Q-learning in the sense that the maximum number of steps in n-step Q-learning is fixed as *n*. However, EBU considers *T* future values, where $T$ is the length of the sampled episode. **N-step Q-learning takes a max operator at the *n*th step only, whereas EBU takes the max operator at every iterative backward step.** To avoid expoential decay of the Q-value, the learning rate $\alpha$ is set to 1 in a single episode update. ___ ## Episodic Backwards Update for Deep Q-Learning EBU differs from traditional deep reinforcement learning algorithms with one main feature: instead of sampling transitions uniformly at random, EBU uses all transitions within a sampled episode $E = (S, A, R, S')$. $E$ can be denotes as a set of four vectors with length $T$ such that $S = (s_1, s_2,...,s_T)$, $A=(a_1, a_2,..., a_T)$, $R=(r_1, r_2,...,r_T)$, and $S'=(s_2, s_3,...,s_{T+1})$. The temporary target Q-table, $\tilde{Q}$ is an $|A| \times T$ matrix where $A$ is the action space of the MDP. >In other words, the k-th column of $\tilde{Q}$ is a column vector containing $\hat{Q}(S_{i+1}, a; \hat{\theta})$ for all valid actions $a \in A$. ## Episodic Backwards Update Algorithm 1. **Initialize** a replay memory $D$ with a capacity of $N$, On-line action-value function $Q(\cdot;\theta$) (regular Q network), Target action-value function $\hat{Q}(\cdot;\hat{\theta})$ (target Q network) 2. **For episode in range(M):** 3. $\quad$ **for t=1 to Terminal:** 4. $\quad\quad$ Select action $a_t$: Epsilon greedy, best action according to $Q(\cdot;\theta$) 5. $\quad\quad$ Execute action $a_t$ and observe reward $r_t$ and next state $s_{t+1}$ 6. $\quad\quad$ Store transition $(s_t, a_t, r_t, s_{t+1})$ in replay memory $D$ 7. $\quad\quad$ Sample a random episode $E = \{S, A, R, S'\}$ from $D$, $T$ $\leftarrow$ length($E$) 8. $\quad\quad$ Generate temporary Q-table, $\tilde{Q} = \hat{Q}(S', \cdot\ ; \hat{\theta})$ *(creates a temp. Q table using the target $\quad\quad$ Q function as a model)* 9. $\quad\quad$ **Initalize** the target vector $y$= zeros($T$), $y_T \leftarrow r_T$ 10. $\quad\quad$ for $k=T-1$ to 1: 11. $\quad\quad\quad$ $\tilde{Q}[a_{k+1},k] \leftarrow \beta y_{k+1} + (1-\beta)\tilde{Q}[a_{k+1}, k]$ 12. $\quad\quad\quad$ $y_k \leftarrow r_k + \gamma max_a \tilde{Q}[a, k]$ 13. $\quad\quad$ **end** 14. $\quad\quad$ Perform a gradient descent step on $(y-Q(S, A; \theta))^2$ with respect to $\theta$ 15. $\quad$ **end** 16. **end** ## Steps of the Recursive Backwards Update ### The Algorithm ![Step 1: Lines 7-9 of the Episodic Backwards Update Algorithm](https://i.imgur.com/RkQWeqm.png) >Table 1 (top): The table extracted from episode $E$ from replay memory $D$ where $S$ is the state, $A$ is the action, $R$ is the reward, $S'$ is the next state, and $T$ is the terminal index. >Table 2 (bottom): The temporary Q table ($\tilde{Q}$) constructed from $\hat{Q}(S', \cdot;\hat{\theta})$ where $S$ is the state, $A$ is the action, $R$ is the reward, $T$ is the terminal index. $\tilde{Q}$ includes $n$ rows, where n is the number of total possible actions. Each column $k$ of $\tilde{Q}$ includes a $y_k$ representing the maximum value of the modified *k*th column of $\tilde{Q}$. To begin the recursive backwards update, we first select an episode $E$ from our replay memory $D$. We then extract the MDP data from $E$ into a table similar to the first table. The terminal reward $R_T$ for $S_T$ of $E$ is set as $y_T$ it is the maximum reward at $S_T$. >![Step 2: Lines 10-12, first iteration](https://i.imgur.com/PAxkPkW.png) Table 3: The first iteration of the backwards update. In the table, $\beta$ represents the diffusion factor and $\gamma$ represents the discount factor. During the first iteration of the backwards update, one element in the $k$th column is replaced (reflective of the action taken). It is recalculated by the output of the equation $\tilde{Q}_{k_a} \leftarrow \beta \gamma_T+(1-\beta)\tilde{Q}(S_T, a)$. The $y_k$ is then recomputed using the equation $R_{T-1}+\gamma max\tilde{Q}[:,T-1]$. > In other words, the T-1s may be substituted with $k$ >![Step 3: Lines 10-12, second iteration](https://i.imgur.com/3Sckd3n.png) Similar to the first iteration, the $k$th column is replaced (reflective of the action taken) and it is recalculated by the output of the equation $\tilde{Q}_{k_a} \leftarrow \beta \gamma_{T-1}+(1-\beta)\tilde{Q}(S_{T-1}, a)$. Again, $y_k$ (in this case, $k=T-2$) of this column is recalculated using the equation $R_{T-2}+\gamma max\tilde{Q}[:,T-2]$. ### The Difussion Factor ($\beta$) Since EBU uses a function approximator while updating corrated states in a sequence, there is a lot **overestimated values propagatingand compound through the recurisve max operators** ~~(in plain english: number too big when we recalculate the $y_k$)~~. By introducing the difussion factor $\beta$, we can weighted sum of the new backpropagted value and the pre-existing value estimate ~~(we get more a better number)~~. In essence, $\beta$ is a learning rate for the temporary Q-table which stablizes the learning process and decreases the overestimation error. ## Adaptive Episodic Backward Update for Deep Q-Learning The diffusion factor $\beta$ is an essential part of how the network is stablized and trained. However, $\beta$ vaires depending on the type of the environment and the degree of how much the network is trained. To further improve EBU, we can develop an adaptive tuning scheme for $\beta$ by creating an adaptive, single actor and multiple learner version of EBU. Generating $K$ learner networks with different $\beta$s and a single actor to output the policy, the actor will select one of the learner networks in a regular sequence for each episode. Each learner is trained in parallel with the same episodes sampled (from a shared experience replay). Using the same training data, each learner varies in how it interprets the data depending on the backproagated values. After each episode, the scores of each learner is recorded. After every fixed step, all learners will be synchronized with the parameters of a learner network with the best score. > Basically make $K$ learner networks with different $\beta$s and a single actor. Each learner is trained the same way (experience replay, training data, etc.). After each episode learner's scores are recorded and after $N$ number of episodes, the learner's parameters are synchronized based on the best performing one. # Theoretical Convergence ## Deterministic MDPs *Theorem 1*: Given a finite, deteministic and tabular MDP defined as $M=(S,A,P,R)$, the EBU algorithm for an adaptive diffusion factor converges to the optimal Q-function with probability 1 so long as: * The step size is a sequence of positive values (Robbins-Monro condition) * The sample trajectories are finite in length $l$: $\mathbb{E}[l]< \infty$ * Every (state, action) pair is visited infinitely often. ## Stochastic MDPs *Theorem 2*: Given a finite, tabular, and stochastic MDP defined as $M=(S, A, P, R)$, we define $R_{max}^{sto}$ as the maximal return of trajectory that starts from state $s \in S$ and action $a \in A$. Similarly, we define $r_{min}^{sto}(s,a)$ and $r_{mean}^{sto}(s,a)$ as the minimum and mean possible reward by selecting action $a$ in state $s$. Furthermore, we define $\mathcal{A}_{sub}(s)=\{a' \in \mathcal{A}|Q^*(s,a')< max_{a \in \mathcal{A}} Q^*(s, a)\}$ as the set of suboptimal actions in state $s \in S$. We then define $\mathcal{A}_{opt}(s) = \frac{\mathcal{A}}{\mathcal{A}_{sub}(s)}$. Under the conditions stated in Theorem 1, and $$ \beta \le \inf_{s \in S} \inf_{a' \in \mathcal{A}_{sub}(s)} \inf_{a \in \mathcal{A}_{opt}(s)} \frac{Q^*(s,a)-Q^*(s,a')}{R_{max}^{sto}(s, a')-Q^*(s, a')} $$ $$ \beta \le \inf_{s \in S} \inf_{a' \in \mathcal{A}_{sub}(s)} \inf_{a \in \mathcal{A}_{opt}(s)} \frac{Q^*(s,a)-Q^*(s,a')}{r_{mean}^{sto}(s, a)-r_{min}^{sro}(s, a)} $$ the EBU algorithm with an adaptive diffusion factor converges to the optimal Q-function with a probability of 1. >The main intuition is that $\beta$ acts as a learning rate of the backward target therefore mitgates the collision between the max operator and stochastic transitions. The extended details and full proof are found on page 17 of the [paper](https://arxiv.org/pdf/1805.12375.pdf) ___ # Experimental Results ## 2D MNIST Maze (Deterministic/Stochastic MDPs) ### Environment Introduction >![](https://i.imgur.com/OnmD9mm.png) MNIST Maze In this experiment, the agent was placed at (0,0) in a (10,10) grid with varying wall density (defined as the probability of having a wall at each position). The agent recieved the coordinates of the position in two MNIST images as the state representation. The reward for reaching the goal (9,9) was 1000, and a reward of -1 for bumping into a wall. For each wall density, 50 random mazes with different wall locations were generated. A total of 50 agents were trained, one for each maze over 200,000 steps. ### Performance Gauging >![MNIST Experimental Results](https://i.imgur.com/j8SbhIX.png) In the 3 graphs depicted, graphs (a) & (b) depict a deterministic model for all three algorithms with wall densities of $20$% and $50$% respectively. In higher wall densities, EBU significantly outperforms both other algorithms. In graph (c), a stochastic model is depicted with a wall density of $50$%. The performance metric, relative length was defined as $\ell_{rell} = \frac{\ell_{agent}}{\ell_{oracle}}$ indicating the ratio between the length of the agent's path ($\ell_{agent}$) and ground truth shortest path ($\ell_{oracle}$). The ECU algorithm was compared to a uniform random sampling one-step DQN and a *n*-step DQN. Of these three algorithms, the one-step DQN performed the worst in all configurations, implying the inefficiency of uniform sampling updates in environments with spares and delayed rewards. The introduction of increasing wall densities raised the importance for the agent to learn the correct decisions at bottleneck positions. In environments with low wall density, the *n*-step DQN showed the best performance, however, as wall density increased, EBU significantly outperformed the *n*-step DQN. In addition to the deterministic models, the researchers also ran experiments with stochastic transitions, assigning a 10% probability for each side action for all four valid actions. For example, when an agent took an action 'up', there was a 10% chance of transiting to the left state, and a 10% change of transiting to the right state. However, the EBU agent outperformed both the uniform random sampling one-step DQN and *n*-step DQN in this experiment. | Wall Density | EBU ($\beta=1.0$) | One-step DQN | N-Step DQN | -------- | ----------------- | -------------------- |---------------- | $20$% | $[5.44, 2.42]$ | $[14.40, 9.25]$ | $[3.26, 2.24]$ | $30$% | $[8.14, 3.03]$ | $[25.63, 21.03]$ | $[8.88, 3.32]$ | $40$% | $[8.61, 2.52]$ | $[25.45, 22.71]$ | $[8.96, 3.50]$ | $50$% | $[5.51, 2.34]$ | $[22.36, 16.62]$ | $[11.32, 3.12]$ >This table shows the relative lengths (Mean and Median) of 50 deterministic MNIST Mazes. Due to all three algorithms achieving $\ell_{rell}=1$ at the end of their training (200,000 steps), results at step 100,000 are shown. ## 49 Games of Atari 2600 Environment (Deterministic MDPs) ### Experimental set up In this experiment, researchers set up 49 Atari 2600 games which were evaluated in the [Nature DQN paper](https://www.nature.com/articles/nature14236). Both the constant and adaptive diffusion factor EBU algorithms were compared to four baseline algorithms: [Nature DQN paper](https://www.nature.com/articles/nature14236), [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952) (PER), [Retrace](https://arxiv.org/abs/1606.02647) ($\lambda)$, and [Optimality Tightening](https://arxiv.org/abs/1611.01606) (OT). The constant diffusion factor EBU, along with the baseline algorithms were trained for 10M frames (with an additional 20M frames for the adaptive EBU) for each of the 49 Atari games using the same network structure, hyperparameters, and evaluation methods used in the Nature DQN. ### Experimental Results and Discussion ![Results](https://i.imgur.com/mx5PSfJ.png) > This figure shows the relative scores of the adaptive EBU (4 random seeds) compared to the Nature DQN (8 random seeds) in percents both trained for 10M frames. The choice of such a small number of training steps was made to investiage the sample efficiency of each algorithm Comparing the adaptive EBU to the Nature DQN at 10M frames, the follow relative scoring metric was used: $$ \frac{Score_{Agent}-Score_{Baseline}}{max(Score_{Human}, Score_{Baseline})-Score_{Random}} $$ In this above equation, $Score_{Agent}$ denoted the EBU adaptive agent's score while $Score_{Baseline}$ denoted the score of the agent trained using a baseline algorithm. $Score_{Human}$ denoted the mean human score and $Score_{Random}$ denoted the mean result from 4 random seeds from the adaptive EBU algorithm and 8 random seeds from all other baselines. In 33 and 39 of the 49 Atari games, the EBU ($\beta$=0.5) and adapative EBU outperformed the Nature DQN, respectively. The most widely used metric for an "apple-to-apple" comparison is the usage of a **human-normalized score** defined as $$ \frac{Score_{Agent}-Score_{Random}}{|Score_{Human}-Score_{Random}|} $$ The results of the EBI outperforms the baselines in both the mean and median of the human-normalized scores. PER and Retrace ($\lambda$) did not show significantly improvements for a small number of training steps (10M frames). OT required about 3 times more training time than the Nature DQN as it needed to calculate the Q-values of neighboring states and compare them to the general penality term. However, EBU performs iterative episodic updates using the temporary Q-table that is shared by all transitions in the episode. Therefore, EBU has almost the same computational cost as that of Nature DQN. | Algorithm (frames) | Training Time (hours) | Mean (%) | Median (%) | -------------------------- | -------------- | -------- | ---------- | EBU ($\beta=0.5$) (10M) | 152 | 253.55 | 51.55 | EBU (adaptive $\beta$) (10M)| 203 | 275.78 | 63.80 | Nature DQN (10M | 138 | 133.95 | 40.42 | PER (10M) | 146 | 156.57 | 40.86 | Retrace ($\lambda$) (10M) | 154 | 93.77 | 41.99 | OT (10M) | 407 | 162.66 | 49.42 | EBU (adaptive $\beta$) (10M)| 405 | 347.99 | 92.50 | Nature DQN (200M) | --- | 241.06 | 93.52 > A summary of training time and human-normalized performace. Trainig time was measured in hours and refers to the total time required to train 49 games of 10M frames using a single NVIDIA TITAN Xp on a single random seed. The most significant experimental result was that EBU ($\beta=0.5$) achieved the mean human-normalized score in just 10M frames, as compared to the Nature DQN which required 200M frames. Although 10M frames were not enough to achieve the same median score, adaptive EBU trained for 20M frames to achieve the median normalized score. ## Comparing static and adaptive diffusion factors ($\beta$) ### Environment State Represenation Impact on Diffusion Factor Selection ![EBU 0.5, EBU 1, Nature DQN comparisons](https://i.imgur.com/awdCZWi.png) In environments such as the MNIST Maze, EBU ($\beta=1$) performs exceptionally well. **This is due to the MNIST images used for state representation allowing for consecutive states to exhibit little correlation. However, in the Atari domain, consecutive states are often different in scale by a few pixels only.** Therefore, EBU ($\beta=1.0$) underpeforms as compared to EBU ($\beta=0.5$) in most Atari games. This phenomenon may be further investigated by evaluating the Q-values learned at the end of each training epoch. In the Atari environments Gopher and Breakout, EBU ($\beta=1$) outputs considerbly high overestimated Q-values compared to EBU ($\beta=0.5$) and the Nature DQN. **Since EBU performs recursive max operations, it outputs higher Q-values than the Nature DQN. Sequentially updating correlated states with overestimated values may destabilize the learning process.** However, the results imply that EBU ($\beta=0.5$) is relatively free from the overestimation problem. ### Efficacy of Using an Adaptive Diffusion Factor ![Adapative Beta on Breakout](https://i.imgur.com/o0S6iaQ.png) In this experiment, an adaptive EBU agent was placed in an Atari course "Breakout". In the early stages, EBU adopted a high $\beta$ that was close to 1, allowing values to be directly propagated from the reward state to the state where the agent has to bounce the ball up. As the training proceeds, the agent encounters more rewards and various trajectories that may cause overestimation. As a result, the agent lowers the diffusion factor to a value of less than 0.5. >![](https://i.imgur.com/PXHTf2S.gif =200x) Breakout: The goal of the game is to shift the bottom platform such that the ball will clear every block on top. ___ # Specifics of the Atari Breakout Environment: ![](https://i.imgur.com/FRoL1Yj.png) The Atari Breakout Environment is an environment that is emulated using Stella emulator. The goal of the game is to break all the blocks at the top of the screen by bouncing the ball on a moving platform controlled by the agent. By hitting higher blocks (worth more points), the ball's velocity will increase, making it harder to control the ball. ## State: ### Atari Space: The environment from the Atari Learning Library (ALE) is a $210 \times 160 \times 3$ array. $210$ corresponds to the width of the images and the $160$ corresponds to the height. The last dimension encodes the RGB color of each pixel. ![Observed Atari space](https://i.imgur.com/Wa6ym5K.png) ### Input Space: ![Reshaped input space](https://i.imgur.com/wfjgbnX.png) The input space into the neural networks is a reshaped image of the Atari state. This image has the dimensions $84 \times 84 \times 1$ where $84$ coresponds to the height and width and the last dimension represents the grayscale value of each pixel (a number in the range [0, 1]). The input space image is the state that is most widely used throughout the code, being present in the replay buffer, as well as the input for training. ## Actions: ![Example of state-action sequence](https://i.imgur.com/NxMKEyE.gif) The action space in the breakout environment involved 3 main actions: First, we uniformally sample $k$ in the range ${2, 3, 4}$). Then we perform one of 3 actions: 1. **NOOP (No operation)**: This action signifies the agent did not input an action. Thus, the platform will not move. 2. **Player Move Right (Move Right)**: This action will move the platform to the right by $k$ pixels unless the platform will go off the screen. If the agent will go off the screen, a noop is performed. 3. **Player Move Left (Move Left)**: This action will move the platform to the left by $k$ pixels unless the platform will go off the screen. If the agent will go off the screen, a noop is performed. ## Reward: The maximum score for each game was 432 points, achieved by hitting all of the blocks with the ball. There are a total of 18 blocks per layer. Starting from the bottom, blocks in the blue and green layers are each worth one point. From there, the yellow and dark yellow blocks are both worth 4 points each. Above those blocks, the orange and red blocks are each worth 7 points. The reward for each step was the difference in score due to an action being taken. ## Transition Model: The transition model is given by the Atari environment. Each time an action is done, the environment will execute the action accordingly. After the action is executed, the model will return a new state which in turn, will be reshaped into a $84 \times 84 \times 1$ matrix to be processed. ## Replay Buffer: Mentioned more in deatail in the [Episodic Backwards Update (EBU) Procedure](https://hackmd.io/@lGJZLxPPTiG78bbvmGUmUw/BJw3p_Vcu), the replay buffer used consists of each state ($84 \times 84 \times 1$ matrix), action (`int`), reward (`int`), and terminal (`bool`) in an episode. ## Evaluation Functions: We define the loss function as $L_i(\theta_i)$. We then define the output Q-value of a Q-network parameterized by $\theta_i$ for a given state-action pair ($s, a$) as $Q(s, a; \theta_i)$. Similarly, we define the output Q-value of the Q-network parameterized by $\theta^-_i$ given the next state-action pair ($s', a'$) as $Q(s',a';\theta^-_i)$. Further, we define $r$ as the returned reward as a result of action $a$ and $\gamma$ as the learning rate (hyperparameter). Therefore, we define thefollowing loss function was used in the Q-network as $$ L_i(\theta_i)=\mathbb{E}_{(s, a, r, s')\sim U(D)}[(r+\gamma max_{a'} Q(s',a';\theta^-_i)-Q(s,a;\theta_i)^2] $$ The final network configuration is shown below: ![Overall Network Figure](https://i.imgur.com/4g2UdUs.png) ___ # Extra Material ## Adaptive Episodic Backward Update Algorithm ### The algorithm 1. **Initialize** a replay memory $D$ with a capacity of $N$, $K$ On-line action-value functions $Q_1(\cdot;\theta_1$), ..., $Q_K(\cdot;\theta_K$) (regular Q networks), $K$ Target action-value function $\hat{Q}_1(\cdot;\hat{\theta}_1$), ...,$\hat{Q}_K(\cdot;\hat{\theta}_K$) (target Q networks) Training score recorder $TS=$ zeros($K$) Diffusion factors $\beta_1$, ..., $\beta_K$ for each learner network 2. **For episode in range(M):** 3. $\quad$ $i = (episode-1)\%K +1$ 4. $\quad$ Select $Q_{actor}$ = $Q_i$ as the actor network for the current episode 4. $\quad$ **for t=1 to Terminal:** 5. $\quad\quad$ Select action $a_t$: Epsilon greedy, best action according to $Q_{actor}(s_t, a)$ 5. $\quad\quad$ Execute action $a_t$ and observe reward $r_t$ and next state $s_{t+1}$ 6. $\quad\quad$ Store transition $(s_t, a_t, r_t, s_{t+1})$ in replay memory $D$ 7. $\quad\quad$ Add training score for the current learner $TS[i]+=r_t$ 8. $\quad\quad$ Sample a random episode $E = \{S, A, R, S'\}$ from $D$, $T$ $\leftarrow$ length($E$) 9. $\quad\quad$ **for j=1 to K:** (this loop is parallel processed) 10. $\quad\quad\quad$ **Generate temporary target Q-table**, $\tilde{Q_j} = \hat{Q_i}(S', \cdot; \hat{\theta_j})$ 11. $\quad\quad\quad$ **Initalize** the target vector $y$= zeros($T$), $y_T \leftarrow r_T$ 12. $\quad\quad\quad$ **for $k=T-1$ to 1:** 13. $\quad\quad\quad\quad$ $\tilde{Q_j}[a_{k+1},k] \leftarrow \beta_j y_{k+1} + (1-\beta_j)\tilde{Q_j}[a_{k+1}, k]$ 14. $\quad\quad\quad\quad$ $y_k \leftarrow r_k + \gamma max_a \tilde{Q_j}[a, k]$ 15. $\quad\quad\quad$ **end** 16. $\quad\quad\quad$ Perform a gradient descent step on $(y-Q_j(S, A; \theta_j))^2$ with respect to $\theta_j$ 17. $\quad\quad$ **end** 18. $\quad\quad$ Every step $C$ reset $\hat{Q_1}=Q_1, ..., \hat{Q_K}=Q_K$ 19. $\quad$ **end** 20. Every $B$ steps, synchronize learner with the best score, $b=argmax_kTS[k]$ 23. $Q_1(\cdot;\theta_1)=Q_b(\cdot;\theta_b), ..., Q_K(\cdot;\theta_K)=Q_b(\cdot;\theta_b)$ and $\hat{Q_1}(\cdot;\theta_1)=\hat{Q_b}(\cdot;\hat{\theta_b}), ..., \hat{Q_K}(\cdot; \theta_K)=\hat{Q_b}(\cdot;\hat{\theta_b})$ 24. Reset the training score recorder $TS=$ zeros($K$) 25. **end** ### Key differences The adaptive EBU algorithm is generally the same as the static diffusion factor EBU except with the addition of $K$ on-line and target action-value $Q$ functions as well as a training score recorder. **For each episode,** the $Q_{actor}$ changes in accordance with a variable $i$ which labels which network to use for the current episode. **For each step,** a training score calculated as $$TS_i=\sum_{t=1}^{T} r_t$$ where $t$ is is a time step, $r_t$ is the reward for each time step and $T$ is the terminal step. Using parallel processing, each of the $K$ networks will generate a temporary Q-table and perform EBU. After $C$ steps, each target network is updated with the online network. **After $B$ steps, both networks for each of the $K$ agents will be updated to match the agent with the best performing score's network.** After this, the $TS$ will reset.