Diversify Question Generation with Retrieval-Augmented Style Transfer - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2310.14503) | [Code link](https://github.com/gouqi666/RAST) | EMNLP 2023 :::success **Thoughts** The new task of generating questions in different styles has been addressed using a method that defines a two-stage training process (supervised and reinforcement learning) to produce style-transferred questions. This approach ensures both consistency and diversity in the generated questions. ::: ## Abstract Recent studies primarily focus on the given passage or the semantic word space for diverse content planning. However, the potential of external knowledge for enhancing expression diversity needs more consideration. This study on Retrieval-Augmented Style Transfer (RAST) aims to generate questions using diverse templates, leveraging external knowledge for greater expression variability. ## Background Question Generation (QG) is the task of generating questions based on a given answer and a grounding paragraph. ![image](https://hackmd.io/_uploads/H1SdZWG50.png) However, it faces two main issues: 1. **Inconsistency**: This results in context-irrelevant or answer-irrelevant questions. 2. **Lack of diversity**: QG systems often fail to generate multiple questions from the same pair of context and answer. ## Method RAST is a framework for Retrieval-Augmented Style Transfer that retrieves question style templates from an external set and utilizes them to generate questions with diverse expressions. It contains three main components: 1. **Vanilla Generator**: Responsible for initial template planning. 2. **Style Retriever**: Filters related style templates based on the initial template. 3. **Style-Based Generator**: Robustly combines a style template and the internal context to generate the final question. The retriever and the style-based generator in RAST can be jointly trained using a reinforcement learning (RL) based method. This RL approach directly maximizes a balanced combination of consistency and diversity rewards, effectively addressing the issues of inconsistency and lack of diversity. ### Overview In the question generation, a paragraph $c$ and answer $a$ are given to generate question $q$. Here, we can seen $x = \{ c, a \}$ as the input. Previous work model $p(y \mid x)$ for QG like: $$ \begin{align} p(y \mid x, \mathbf{Z}) & = \mathbb{E}_{z_0, z \in \mathbf{Z}} [p(z_0 \mid x) \times p(z \mid z_0) p(y \mid z, x)] \nonumber \\ & = \mathrm{vanilla \ QG} \times \mathrm{RAST} \nonumber \end{align} $$ where $\mathbf{Z}$ is the external corpus of question style templates, and $z_0$ is the the initial question template that can be predicted based on the context $x$. --- ![image](https://hackmd.io/_uploads/BJB-N-zq0.png) The system selects style templates from external knowledge $(z \in \text{top-}k \text{ from } \mathbf{Z})$ that are similar but not identical to the initial template $z$. These style templates are then utilized to generate questions. During the training procedure, with a given context $x$, the initial template $z_0$ is extracted from the ground truth question $y$ by masking context-sensitive information. In the inference procedure, the system uses vanilla question generation $p(y \mid x)$ to generate the best candidate question $y_0$, from which the initial template $z_0$ is extracted. ### Question Style Templates > Masking The question templates are then obtained from the collected questions by masking context-sensitive information. > Duplication removal Near-duplicate templates are removed by measuring pairwise Jaccard similarities. ### Retrieval-Augmented Style Transfer > Style Retrieval Model Query and sample styles are encoded as follwing: $$ \begin{align} q(z) & = BERT_1(z) \nonumber \\ q(z_0) & = BERT_2(z_0) \nonumber \\ p_\phi (z \mid z_0) & \approx \exp[q(z)^T q(z_0)] \nonumber \end{align} $$ where $\phi$ is the parameters of two encoders. > Style Transfer Model They use T5 as their style transfer model: $$ p_\theta(y \mid z, x) = \prod_{i=1}^T p_\theta (y_t \mid x, z, y_{1: t-1}) $$ where $T$ indicates the question length, and $\theta$ is the model parameters. ### Two-Stage Training 1. **Supervised Learning**: Used to initialize the style transfer model. 2. **Reinforcement Learning (RL)**: Applied to avoid exposure bias and to address the evaluation discrepancy between training and testing. #### Supervised Learning Since the original training procedure on $\{ (x, y_0, z_0) \}$ may suffer from overfitting, they introduce noise into the template $z_0$ to create a noisy template $\tilde{z_0}$ by: 1. Replacing [MASK] with a random entity. 2. Adding some nouns. 3. Deleting [MASK]. 4. Randomly choosing another template. They then train the model using cross-entropy loss: $$ L_\theta^{CE} = - \sum_i y_i \log p(\hat{y_i} \mid x, \tilde{z_0}) $$ where $y_i$ represents the ground truth label and $\hat{y_i}$ represents the predicted one at the time step $i$. #### Reinforcement Learning ##### RL for Style Retrieval and Transfer In RAST, the system is viewed as an agent interacting with an external environment composed of words and question templates. The combined policy involves selecting a style or predicting the next word. The reward function aimed to be minimized is given by: $$ L^{RL}(\theta, \phi) = - \mathbb{E}_{y^s \sim p_\theta, z^s \sim p_\phi} [r(y^s, z^s)] $$ where $y^s = (y_1^s, \dots, y_T^s)$ represents the words sampled from the style transfer model $p_\theta$, and $z^s$ denotes the template sampled from the style retrieval model $p_\phi$. ##### Reward Model > **Consistency Reward** To address the first issue, the inconsistency reward is inspired by a paper on question answering. They use a generative QA model based on T5. The consistency reward is measured as follows: $$ r_{cons}(y^s, z^s) = \exp(- L_{qa}) $$ where $$ L_{qa} = - \frac{1}{T_a} \sum_{i=1}^{T_a} \log p(a_i \mid c, y^s, a_{<i}) $$ Here, $T_a$ is the answer length, and $y^s$ is the sampled question from $p_\theta(y \mid z, x)$. > **Diversity Reward** To address the second issue, the diversity reward uses Jaccard Similarity, defined as: $$ r_{divs}(y^s, z^s) = \frac{z^s \cap y^s}{z^s \cup y^s} $$ Finally, the total reward is calculated as: $$ r(y^s, z^s) = r_{cons} + \lambda r_{divs} $$ where $\lambda \in [0, 1]$ is a balancing parameter. ![image](https://hackmd.io/_uploads/H1qf4bf5C.png) ## Experiment In this study, two public datasets were used: 1. **SQuAD** 2. **NewsQA** For evaluation, the following metrics were employed: 1. **Top-1 BLEU**: Measures the BLEU score of the top-generated question. 2. **Oracle BLEU**: Measures the highest BLEU score among all generated questions. 3. **Pairwise BLEU**: Measures the BLEU score between pairs of generated questions to assess diversity. 4. **Overall BLEU**: Calculated as $\text{Top-1} \times \frac{\text{Oracle}}{\text{Pairwise}}$. Note that all BLEU scores reported are BLEU-4. The comparison of different techniques on question generation for NewsQA and two splits of SQuAD is shown in the image below: ![image](https://hackmd.io/_uploads/HkLK_HGqA.png) Below are three RAST outputs, each demonstrating different question types. In each example, the given answer is highlighted in red within the source context: ![image](https://hackmd.io/_uploads/r1WG9BG50.png)