# [Off-Policy Actor-Critic](https://arxiv.org/abs/1205.4839) notes ## Introduction 1. The aim of off-policy methods is to learn the value of the optimal policy independent of agent's actions. 2. The well known off-policy algorithm, Q-learning, has convergent gaurantees in tabular settings but is sensitive to the choice of function approximation. 3. But the action-value methods has their own limitations: (a) Use of deterministic policies, (b) Scalability issues when it comes to large action spaces and (c) Sensitivity to the change in action-value function resulting in instability 4. Actor critic methods, which explicitly represent the policy independent of value function , subdues the above limitation but are restricted to on-policy methods. 5. This paper introduces the off-policy version of the actor-critic, deemed Off-PAC, along with the convergence proofs. ## Setting THe off-PAC has two components: 1. **Actor:** This is a policy function/network which outputs the action given the state. 2. **Critic:** This is a value function/network which learns the off-policy value-estimate of current policy which is different from behaviour policy. This is used to update the actor. ### Definitions 1. **Value function:** The value function for a state $s$ is defined as the expected return if you start from state $s$ and then on act according to the current policy $\pi$. $$V^{\pi}(s) = \mathbb{E}[r_{t + 1} + r_{t + 2} .... + r_{t + T} | s_t = s]$$ where T horizon length. 4. **Action-Value function:** The action value for a state-action pair is defined as the expected return if you start from state $s$, execute an action $a$ and then on act according to the current policy $\pi$.$$Q^{\pi}(s, a) = \mathbb{E}[r_{t + 1} + r_{t + 2} .... + r_{t + T} | s_t = s, a_t = a]$$. 5. **Objective Function:** Let $\pi_u$ be the policy parameterized by the weights $u$. The objective is to find $u$ which maximizes the expected return for all $s \in S$, $a \in A$. $$J(u) = \sum_{s \in S}d^{b}(s)V^{\pi_u}(s)$$ where $d^{b}(s) = \sum_{t = 0}^{\infty}Pr(s_t = s|s_0, b)$ which is the stationary distribution over states with respect to some behaviour policy $b$. ## Off-PAC Algorithm 1. **The Critic: Policy Evaluation**: In order to evaluate the policy, we need to learn it's value function $V^{\pi}(s)$. Instead of learning $V^{\pi}(s)$ directly, we learn a linear approximation $V^{\pi}(s): \hat{V}(s) = v^{T}x_s$ where $x_s$ is the feature vector of states and $v$ is weight vector. In deep reinforcement learning, this can be approximated by a neural network. The value function is optimized using gradient TD methods which reduce the mean squared projected bellman error. $$MSPBE(v) = ||\hat{V} - \Pi T_{\pi}^{\lambda, \gamma}\hat{V}||^{2}_D$$ where $\lambda$ is the decay of the eligibility trace, D is a matrix with the limiting stationary distribution $d^{b}(s)$ along the diagonal and $\Pi$ is the projection operator. 2. **Off-policy Policy-gradient Theorem:** Our aim is to choose parameters $u$ in such a way that it maximizes the objective function $J(u)$. Typically, we do a gradient based optimization even though there are other optimization techniques which exists. In gradient based optimization, the parameters are moved in the direction (or opposite if we want to minimize the objective) of the gradient of $J(u)$ . $$u_{t+1} = u_{t} + \alpha \nabla_{u}J(u_t)$$. In order to do this step we need an estimate of $\nabla_{u}J(u)$ which can be written as $$\nabla_{u}J(u) = \nabla_{u}\left[\sum_{s \in S}d^{b}(s)\sum_{a \in A}(\pi(a|s)Q^{\pi}(s,a)\right]$$. $$= \left[\sum_{s \in S}d^{b}(s)\sum_{a \in A}\nabla_{u}(\pi(a|s)Q^{\pi}(s,a) + \pi(a|s)\nabla_{u}Q^{\pi}(s,a)\right]$$ The second term in the summation is difficult to esimate and hence omitted. This approximate gradient is denoted by $g(u)$. The two theorems in the paper provide justification for the above approximation. 3. **Theorem 1**: This theorem states that, if we update our policy parameters by using the above gradient approximation, in tabular setting, there is an improvement in objevtive function $J(u')$ and hence in the value function. ![](https://i.imgur.com/LVRkwYy.png) 5. **Theorem 2**: This theorem states that, the local maxima obtained by using the approximate gradient is a subset of the true local optima obtained by using the exact gradient. In case of tabular setting, the both are equal. ![](https://i.imgur.com/CukAEeV.png) In case of the non-tabular settings, the above claim will likely hold true if the return is myopic because the changes in the policy effects the action-value function locally. But for $\gamma$ close to one setting, we the approximation might not hold true always. But these can be subdued by using some famous optimization techniques which ensure that we converge to local maxima. 6. **Algorithm**: We can rewrite the aprroximate gradient as $$g(u) = \mathbb{E}\left[\sum_{a \in \mathcal{A}}\nabla_u \pi(a|s)Q^{\pi}(s, a) \Bigg|s \sim d^{b} \right] \\ \\ = \mathbb{E} \left[\sum_{a \in \mathcal{A}} \frac{\pi(a|s)}{b(a|s)} \frac{\nabla_{u}\pi(a|s)}{\pi(a|s)}Q^{\pi}(s, a)\Bigg| s \sim d^{b}\right] \\ \\ = \mathbb{E}\left[\rho(s, a) \psi(s, a)Q^{\pi}(s, a) \Bigg| s \sim d^{b}, a \sim b(\cdot | s)\right]$$ We can see that the expectation involves the term $\rho(s, a) = \frac{\pi(a|s)}{b(a|s)}$ which basically the correction estimate as we are calculating the expectation by drawing samples from the behaviour policy $d^{b}.$ In actor critic methods, we use the critic function to approximate the state-value function $\hat{V}$ and the action-value function is replaced by the off-policy $\lambda$ return $R^{\lambda}_t$ (which accounts for further approximation). So, the final gradient approximation can be written as $$\hat{g(u)} = \mathbb{E}\left[ \rho(s, a)\psi(s, a)(R^{\lambda} - \hat{V}(s))\right]$$. The complete algorithm is described below ![](https://i.imgur.com/FmrEObh.png)