<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2310.10735) | [Code link](https://github.com/ryanshea10/personachat_offline_rl) | EMNLP 2023 :::success **Thoughts** This study use offline RL to improve the quality and utility of open-domain dialogue system. ::: ## Abstract In open-domain dialogue systems, maintaining a consistent persona is crucial. Unlike previous methods that rely on supervised learning or online RL, this study uses offline RL, which helps reduce the variance of importance weights during training. ## Background In recent years, large language models have been trained on vast amounts of unlabeled text data, with additional fine-tuning for dialogue tasks. However, consistency issues in social dialogue remain a challenge. Previous approaches have addressed this by: - Conditioning dialogue generation. - Using supervised learning combined with online RL. :::info [BlenderBot3](https://ai.meta.com/blog/blenderbot-3-a-175b-parameter-publicly-available-chatbot-that-improves-its-skills-and-safety-over-time/) It is an advanced open-domain dialogue system developed by Meta (formerly Facebook). It's designed to engage in more natural and human-like conversations by combining various techniques, including large-scale pretraining, fine-tuning, and reinforcement learning. BB3 is known for its ability to maintain a **consistent persona** and **generate contextually relevant responses** across a wide range of topics, making it one of the more sophisticated models in the field of conversational AI. ::: ## Method ![image](https://hackmd.io/_uploads/HkZJ3Kw5C.png) ### Offline RL Their offline RL training approach uses a policy gradient method to optimize the RL objective. $$ J(\theta) = \mathbb{E}_{\tau \sim p(\pi_\theta(\tau))} [\sum_{t=0}^T \gamma^t r(s_t, a_t)] $$ where $\tau$ is a trajectory of states $s_t$, and action $a_t$, and $\gamma$ is the discount factor. ### VaRMI Importance Sampling The biggest challenge with policy-gradient-based offline RL methods is the high variance in the gradient estimator. In this study, they address this by reducing the variance of importance weights during policy gradient offline RL training. Below are the example of two dialogues: ![image](https://hackmd.io/_uploads/B1PW2tP9R.png) ![image](https://hackmd.io/_uploads/SyQM2twcC.png) ## Experiment This study uses the DNLI dataset to evaluate the effectiveness of their approach. The evaluation metrics used in their study are: - **Hits@1**: Measures the accuracy of selecting the correct response from a set of candidates. - **Entail@1**: Assesses the percentage of responses that are logically entailed by the given context. - **Contradict@1**: Evaluates the percentage of responses that contradict the context. - **Rand@1**: Measures the performance against random baselines, where responses are chosen randomly. Below are the results comparing their importance sampling techniques to the BB3 and BB3+RL baselines. ![image](https://hackmd.io/_uploads/r1wCDiw50.png) Below are human evaluation results (range from 1 to 5) of their two importance sampling techniques vs the BB3-3B baseline. ![image](https://hackmd.io/_uploads/SkVkOjw5R.png)