PTD video script

# PTD video script Hello there, Did you know that TD learning can be viewed as a way to approximate dynamic programming algorithms in Markovian environments? Have you wondered what happens if the Markovian assumption is violated, which is the case when a function approximator is used? Turns out TD learning can create some problems when this assumption is not respected. Let me introduce the background briefly before we dive into the problem. Consider the policy evaluation setting in reinforcement learning. Given an MDP M - which is a tuple of five elements, a policy, and features, the goal is to compute the value function. Generally, a function approximator is used to estimate this quantity, and we consider linear approximations in this work. Temporal difference learning is typically used to solve this problem. TD learning updates a state whenever they are visited in proportion to the temporal-difference error. This can lead to some issues. To see this, consider the task shown. Here an agent starts in a fully observable state and chooses one of two available actions. Each action leads to a different long-term outcome, but the agent navigates through states that look similar before observing the outcome. This setting poses two challenges: 1. The starting state and the outcome state are temporally distant. Therefore, an efficient mechanism is required to propagate the credit between them. 2. With function approximation, updating one state affects the value prediction at other states. If the generalization is poor, TD-updates at partially observable states can introduce errors, which propagate to estimates at fully observable states. Turns out, most TD methods, including TD(lambda), fail to estimate correct values in this task as they use heuristics for credit assignment, and they modify the target used in the update and not the extent to which a state is updated. To address these issues, we introduce Preferential temporal difference learning. The main idea in PTD learning is to bootstrap and update according to preferences, which leads to a slightly different update rule. Here, $\beta$ is the preference function that assigns certain importance to each state. We assume that these are given. $G_t^{\beta}$ is a multi-step return similar to lambda-returns, which bootstraps according to the preference of the target states. We call this update rule the forward view as the agent needs to wait until the end of the episode to compute the target. Turns out we can write this update rule in a nice online fashion using eligibility traces. In fact, this update rule only needs a minor modification to be compatible with a non-linear function approximator. One way to understand these equations is by setting the preference function to the extreme values. If a state's preference is 1, then that state receives a complete update whenever it is visited. Also, other states bootstrap its value completely, thereby not depending upon the future information for updates. When a state's preference is 0, then that state is not updated even if it is visited. And, other states don't bootstrap from it. A similar effect can be achieved in the Emphatic TD algorithm. However, it uses two parameters instead and constructs a trajectory-dependent quantity to reweigh updates. At this point, you might be asking if this method converges. The short answer is yes. I will present the essence of the proof, but I encourage you to dive into the paper for details. We start by writing the update rule in a form that is easier to analyze. Here, b and A are used to write the update equation compactly, and they are commonly called key matrices. We then make some assumptions. We constrain features to be linearly independent of one another; we consider rapidly mixing Markov chains and assume that the step-size sequence obeys Robbins Munroe conditions. These assumptions are fairly standard in the literature. Under these assumptions, we show that the expected update rule is well defined, the expected key matrix is positive definite, and the sampling noise is controlled. These lemmas satisfy the required conditions for convergence; therefore, PTD converges. We ran a bunch of experiments to test PTD empirically. First, we tested PTD against TD(lambda) and ETD on the 19 state random walk task in the tabular setting. Upon testing on several fixed hyperparameter values, we observed that PTD had a favourable bias-variance trade-off compared to other methods. PTD also behaved well for a wide range of learning rates. We tested PTD on two versions of chain tasks against TD(lambda) and two versions of ETD in the linear setting. The two versions of ETD differ in the way the interest function is set. We observed that the adaptive ETD version and PTD estimate the value function better than the other algorithms for various chain lengths. We attribute this success to their ability to ignore noisy states while bootstrapping and updating. Finally, we considered a grid task - a slightly complex environment than the chain tasks with a neural network as the function approximator. Unsurprisingly, our observations remain the same. PTD and ETD adaptive algorithms continued to thrive when the grid size was increased and when the capacity of the function approximator was decreased. If these findings interest you, do check out our paper preferential temporal difference learning.