Episodic Backwards Update (EBU) Procedure

# Episodic Backwards Update (EBU) Procedure ###### tags: `Code walkthroughs`, `Doctor Bajaj` # Beginning to Run... ## run_EBU.py (beginning module): These are the default hyperparameters for the experimental procedure. The game we are running is Alien on the Atari learning environment using the arcade learning environment library (ALE). This code is called in the `run_EBU.py` file and will call `launcher.py`. ```python STEPS_PER_TEST = 50000 STEPS_PER_EPOCH = 62500 EPOCHS = 40 UPDATE_RULE = 'deepmind_rmsprop' BATCH_ACCUMULATOR = 'sum' LEARNING_RATE = .00025 DISCOUNT = .99 RMS_DECAY = .95 # (Rho) RMS_EPSILON = .01 MOMENTUM = 0 CLIP_DELTA = 1.0 EPSILON_START = 1.0 EPSILON_MIN = .1 EPSILON_DECAY = 1000000 PHI_LENGTH = 4 UPDATE_FREQUENCY = 1 REPLAY_MEMORY_SIZE = 1000000 BATCH_SIZE = 32 NETWORK_TYPE = "nature_dnn" FREEZE_INTERVAL = 10000 REPLAY_START_SIZE = 50000 RESIZE_METHOD = 'scale' RESIZED_WIDTH = 84 RESIZED_HEIGHT = 84 DEATH_ENDS_EPISODE = 'true' MAX_START_NULLOPS = 30 DETERMINISTIC = True CUDNN_DETERMINISTIC = True FLICKERING_BUFFER_SIZE = 2 METHOD = 'ot' ``` ## Using launcher.py: The file `launcher.py` has two primary functions: 1. Parse through the arguments defined in `run_EBU.py` and assign them to the network, experiment, and agent. 2. Create the Q-network, agent, and experiment. * The Q-network is created from q_network.py and uses the DeepQLearner class. It uses the arguements: `RESIZED_WIDTH, RESIZED_HEIGHT, num_actions, phi_length, discount, learning rate, rms_decay, rms_epsilon, momentum, clip_delta, freeze_interval, batch_size, network_type, update_rule, batch_accumulator, rng, double, and transition length` Running `launcher.py` will call ale_experiment.run() and begin the experiment. ___ # Running the Experiment In the `ale_experiment.py`, under the `ALEExperiment` class, there is a function called `run` which initalizes a single run of the experiment. Each run will last for a set number of `num_epochs (int)` which is a hyperparameter. In each epoch, the following functions will run: ## Using `run_epoch()`: The `run_epoch` function takes in two parameters `epoch` and `num_steps (int)`. However, only `num_steps` is used. The parameter `num_steps` indicates the number of steps per epoch. `run_epoch()` may be split into two processes. This first process is the training phase in which the game will run for `num_steps`. To run the game, the function `run_episode` is called which takes the parameters `max_steps (int)` and `testing (bool)`. The parameter `max_steps` is an integer used to indicate the maximum number of steps in the epoch while `testing` is a boolean value that indicates if the agent is testing or training. ## Using `run_episode()`: The `run_episode` function will run a single episode and takes in the parameters `max_steps (int)` and `testing (bool)`. By default, the parameter `testing` is set to `False`, which results in the network being trained before being tested. The `max_steps` parameter is an integer that indicates the maximum number of steps an agent may take before the game is terminated. At the start of the function, the agent will initalize an ALE environment. If `testing` is set to `True`, the environment will be configured such that a fresh environment will be avaliable by taking a random number of null actions. After doing so, the screen buffer will be filled by calling the `_act()` function twice with null actions. > The `_act(self, action)` function will perform the given action by calling `self.ale.action(action)` where `action` is an `int` given by a parameter of `_act`. This function calls self.ale.getScreenGrayscale() which **will place the resulting grayscaled image into a given empty array (screen_buffer)**. After doing so, the function will return the reward. > * The reward is calculated as follows: `reward_t score = b1 + b2 * 10 + b3 * 100 + b4 * 1000 + b5 * 10000` where `b1= (system, 0x8B)` , `b2= (system, 0x89)`, `b3= (system, 0x87)`,`b4= (system, 0x85)` and `b5= (system, 0x83)`. After inializing the environment, the agent will call the `start_episode` function which takes in the parameter `observation (height x width numpy array)`. The `observation` parameter will be taken from the first screen buffer, which was updated as a part of the `_act()` function. In the `start_episode()` function, the agent will initalize a few variables: `loss_averages ([])` as well as `step_counter`, `batch counter`, and `episode_reward` which are all initalized as 0. The function will return a random `action (int)` and store the `last_action` and `last_image` in the replay buffer[0]. To begin training the agent, we initalize a `while` loop which runs as long as the condition `terminal` is not met. If the agent is training, the `terminal` condition is met if the following conditions are true: 1. Running out of lives 2. If the agent dies In both the training and testing phase, the `terminal` condition will also return `True` if the ALE environment returns `game_over()` as True **OR** if the number of steps exceeds `max_steps`. **To read more about `terminal` conditions, see the next section.** Each step of the training begins with getting the reward of an action. This is achieved by calling the `_step()` function. After doing so, the agent will check for the `terminal` condition. If `terminal` is not met, the agent will add one to the current number of steps and choses the next action by calling the `step()` function which takes in the arguments `reward and observation`. In this case, the `reward` is the returned reward mentioned above and the `observation` is the observed state from the screen buffer. >The `_step()` function will call the `_act()` function for `frame_skip (int, hyperparameter)` number of times. And will return the summed reward for the `frame_skip`ped actions. >The `step()` function (not to be confused with the `_step()` function 😒) is a function that performs a step. **See the next section for details** Once the **any** of the `terminal` conditions are met, the `run_episode()` function will return wheter `terminal (bool)` and `num_steps (int)`. If `terminal` is returned as `True`, the agent died or the game ended. However, if the `terminal` condition is returned as `False`, that means the maximum step number was reached. ___ ## Taking a step: During the `run_episode()` function, each step is controlled by the `step()` function which is found in `ale_agents.py`. If the agent is being trained, the agent will choose an action using the epsilon greedy algorithm by calling `_choose_action()`. If the current step is a trainig/updating step, the network will call the `_do_training()` function which will initate the EBU algorithm. > The `choose_action()` function takes in the arguements `data_set (also known as replay_memory)`, `epsilon (int)`, `cur_image (grayscaled numpy array of image/state data)`, and `reward (int)`. The function then adds the current data to the `data_set` (also known as the `replay_memory`). The items added are: `last_image (grayscaled numpy array of image/state data)`, `last_action (int)`, `reward (int)`, `False`, and `start_index=self.start_index (int, parameter)`. This function will then select an action using the epsilon greedy algorithm. > The `_do_training()` function is the primary method to run the EBU algorithm. First, the agent will create a temporary Q-table `Q_tilde` which is first initalized as an empty numpy array. The agent will then sample a random episode from the `data_set` which will create the variables `epi_state (state)`, `epi_actions (action)`, `epi_rewards (reward)`, `batchnum (index of each episode)` and `epi_terminals (bool, if episode ends)`. After EBU is done, the network will call the `train()` function to train the target Q-network. ___ ## Terminal Conditions During the `run_episode()` function, there are a variety of `terminal` conditions. If any of these conditions are met, the agent will call the function `end_episode()` which takes the arguments `reward (int)` and `terminal (bool)`. ### Understanding `end_episode()` The function `end_episode()` is a function that takes in the arguements `reward (int)`, and `terminal (defaulted to True)`. If the agent was on its training phase, the sample is added to the replay buffer again. Since `terminal` is defaulted to `True`, the value of `q_return` is 0. For step in the episode, the agent will add the `return value (reward value * discount)` value in the `data_set (replay buffer)`. ## Testing The process of testing is realtively the same. However, the agent will not add any samples to the replay buffer and will not update the Q-network. ![](https://i.imgur.com/msJasKK.png)