# A really tough problem of off-policy training on sequence models ### Background Let $Y = \{y_1, y_2, \ldots, y_N\}$ be a sequence. Denote policy $\pi_\theta(Y)$ as the policy (also, a probability) for generating seqeucne $Y$. The probability is computed by: $$ \pi_\theta(Y) = \prod_{i=1}^N p(y_i; \theta). $$ Let $R_Y$ be a reward evaluated for sequence Y. Our objective function is to maximize $$ J(\theta) = \mathbb{E}_{Y \sim \pi_\theta(Y)} \Big[ R_Y \Big] $$ With importance sampling, the objective function turns to $$ J(\theta) = \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \frac{\pi_\theta(Y)}{\pi_b(Y)} R_Y \Big] . $$ Here $\pi_b$ is a proposal distribuition. This type of off-policy training is useful when the action space of the original policy is very large, and you have a stable probability distribution to be used as proposal distribution. To compute the gradient, it is straight-forward: $$ \begin{align} \nabla_\theta =& \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \frac{1}{\pi_b(Y)} R_Y \nabla_\theta \pi_\theta(Y) \Big] \\ =& \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \frac{\pi_\theta(Y)}{\pi_b(Y)} R_Y \nabla_\theta \log \pi_\theta(Y) \Big] . \end{align} $$ The second step applies the log-derivate trick. ### The tough problem Altough the objective and the gradient can be easily derived, and easily implemented. The objective is crazily difficult to optimize. (1) When start with random parameters for $\theta$, also for any sequence $Y$, the probability of $\pi_\theta(Y)$ is extremely small. Because $\pi_\theta(Y)$ is computed by the product of multiple token probabilities $\prod_{i=1}^N p(y_i; \theta)$. When the sequence has 50 tokens, even each token probability has a value of 0.5 at the begining, the probability $\pi_\theta(Y)$ will be $0.5^{50} = \text{8e-16}$, which is basically zero in PyTorch. As $\pi_\theta(Y)$ is 0 for any sequence, the whole objective will be zero and never be optimized. (2) May be you think removing the importance weight $\frac{\pi_\theta(Y)}{\pi_b(Y)}$ can solve the problem. This also won't work, without the importance weight, the balance between reward and log probability is destroied. As the log probability is in the log scale, it will be more important to keep every token having a sufficiently high probability, rather than getting a high reward. The imbalance problem becomes more severe when you have some negative rewards. This is the case as we often give the reward a baseline, the reward can be a negative value. When you have negative reward, the objective will intentionally decrease the log probability for samples with negative reward. And because of the log scale, it is much much easier to decrease the log probability rathen than increase it. ### Discussion Some things to try: 1. Scale the importance weight exponentially by $\frac{\pi_\theta(Y)^\alpha}{\pi_b(Y)^\alpha}$, so that the weight won't be zero at begining. - This may be okay, but you may need to anneal $\alpha$ during training. If this isn't done correctly, the imbalance problem will happen 2. Pretrain the policy by $\max \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \log \pi_\theta(Y) \Big]$. - There is no theoretical guarantee that this is a better initialization However, both these two solutions are very weak. Even the policy assign each token in $Y$ a probability of 0.8, which is considerably high. When the sequence has 50 tokens, the policy probability is $0.8^{50}=\text{1e-5}$. Still almost zero :face_with_head_bandage: . ### Appendix 1. Approximation with Jensen's inequality We can find a lowerbound by applying Jensen's inequality on the logarithm objective: $$ \begin{align} J(\theta) =& \log \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \frac{\pi_\theta(Y)}{\pi_b(Y)} R_Y \Big] \\ \ge& \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \log \Big( \frac{\pi_\theta(Y)}{\pi_b(Y)} R_Y \Big) \Big] \\ =& \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \log \pi_\theta(Y) - \log \pi_b(Y) + \log R_Y \Big] \\ =& \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \log \pi_\theta(Y) \Big ] + \underbrace{H(\pi_b) + \mathbb{E}_{Y \sim \pi_b(Y)} \Big [ R_Y \Big ]}_{\text{constant}} \end{align} $$ We can see that optimizing the lowerbound derived using Jensen's inequality doesn't improve the reward. ### Appendix 2. Token-wise loss Suppose we can find a function `token_reward()` to correctly decompose the sequence reward $R_Y$ into token-wise reward $R_{y_i}$: $$ R_{y_i} = \text{token_reward}(R_Y, i) . $$ Then, $$ \begin{align} J(\theta) =& \mathbb{E}_{Y \sim \pi_\theta(Y)} \Big[ R_{y_1} + ... + R_{y_N} \Big] \\ =& \mathbb{E}_{y_1 \sim \pi_\theta(y_1)} \Big[ R_{y_1} \Big] + ... + \mathbb{E}_{y_N \sim \pi_\theta(y_N)} \Big[ R_{y_N} \Big] \\ =& \sum_{i=1}^N \mathbb{E}_{y_i \sim \pi_\theta(y_i)} \Big[ R_{y_i} \Big] \\ =& \sum_{i=1}^N \mathbb{E}_{y_i \sim \pi_b(y_i)} \Big[ \frac{\pi_\theta(y_i)}{\pi_b(y_i)} R_{y_i} \Big] \end{align} $$ The gradient is computed by: $$ \nabla_\theta = \sum_{i=1}^N \mathbb{E}_{y_i \sim \pi_b(y_i)} \Big[ \frac{\pi_\theta(y_i)}{\pi_b(y_i)} R_{y_i} \nabla_\theta \log \pi_\theta(y_i) \Big] $$ How to find the ==token-wise reward==? Given the following example: ```text Y = I went to that that park and walked dog dog, reward = 10 Y'= I went to that city park and walked my dog, reward = 20 Relative reward = 10 ``` It is convenient to assume: 1. The relative reward is caused only by the alternated tokens. That is, `city` and `my` in the example. All other tokens have zero reward. 2. All alternated tokens contribute equally to the relative reward. For this example, it indicates that `Reward(that→city)=5` and `Reward(dog→my)=5`. Let $c_1, ..., c_N$ indicates whether token $y_i$ changed from the baseline. Define `token_reward()` as $$ \text{token_reward}(R_Y, i) = \begin{cases} \frac{1}{\sum_i c_i} R_Y & \text{if } c_i = 1 \\ 0 & \text{if } c_i =0 \\ \end{cases} $$ 1 2 3 4 5 6 policy 1e-6 1e-6 proposal 0.9 0.9 0.9 0.9 0.9 0.9 - hierarchical action space $$ J(\theta) = \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \frac{\pi_\theta(Y)}{\pi_b(Y)} R_Y \Big] ^ \alpha \\ \ge \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \Big (\frac{\pi_\theta(Y)}{\pi_b(Y)} R_Y \Big)^ \alpha \Big] \\ = \mathbb{E}_{Y \sim \pi_b(Y)} \Big[ \Big (\frac{\pi_\theta(Y)}{\pi_b(Y)}\Big)^ \alpha R^\alpha_Y \Big] $$ ### board P_bert( <s> I saw a cat in the park </s> ) lm 1 0 1 0 1. 0 1. 0 rm 0. 1 0 1 0 1 0 1 P_bert(<s> <mask> saw <mask> cat <mask> the <mask> </s>) 1 0.6 1. 0.8. 1. 0.5. 1 x P_bert(<mask> I <mask> a <mask> in <mask> park <mask>) 0.9. 1. 0.8. 1. 0.6. .... = 0.9 0.6 0.8 0.6 0.5 <s> I saw a <mask> in the <mask> </s> p(Y'|Y, M) p_rf(Y'|Y) = \sum_M p(Y'|Y, M) p(M)