Collaborative Question Answering

# Collaborative Question Answering In Collaborative QA task, there are two agents $A_1$ and $A_2$ collaborate to answer a question through natural language conversations. Both agents have their own knowledge, which we call $K_1$ and $K_2$ respectively. To simplify the task, we only consider the questions that can be answered in one turn conversation. To be specific, $A_1$ receives the original question $Q$, and generates a new question $Q'$ to $A_2$. Then $A_2$ responds with an answer $A$. In our synthetic dataset using bAbI templates, $A_2$ gives the answer by selecting a node from $K_2$. As there is no ground truth of $Q'$, $A_1$ must learn to generate $Q'$ in an unsupervised manner. Borrowing ideas from [Li et al. 2017](https://arxiv.org/abs/1701.06547) and [Yu et al. 2017](https://arxiv.org/abs/1609.05473), $A_1$ can learn to generate $Q'$ by Adversarial Learning. In adversarial learning framework, there is a discriminator $D$ which can distinguish the input question is fake or not. Combine $A_2$ and $D$, $A_1$ can be taught to generate the 'right' question. However, the discrete nature of human language makes the framework non-differentiable. So we use Reinforcement Learning to train our model. [Li et al. 2017](https://arxiv.org/abs/1701.06547) tried two methods: 1. Monte Carlo (MC) Search 2. Making $D$ being able to assign a reward to a partially decoded sequence Though the second method is much more time-effective, it's not suitable for this task, because $A_2$ cannot accept a partially decoded question. ## Reinforcement Learning for Collaborative QA ### Components description **Action** The action space for $A_1$ consists of the token vocabulary $V$, and all nodes in $K_1$ for copying a node value. At timestep $t$, $A_1$ takes an action $Q'_t$ as the $t$-th token of $Q'$. Let the maximum length of $Q'$ is $T$. Then sample $Q'_{t+1:T}$ from a roll-out policy $A_1^r$ using MC search. At each timestep, we sample $N$ different $Q'$: $$ \{ Q'^{1},..., Q'^{N} \}= MC^{A_1^r}(Q'_{1:t}; N) $$ **State** At timestep $t$, the state of $A_1$ consists of the original question $Q$, its knowledge $K_1$, and the previously decoded tokens $Q'_{1:t-1}$, i.e. $s_t = [ Q, K_1, Q'_{1:t-1} ]$. **Policy** Without confusion, we define the policy of $A_1$ as its name $A_1(Q'_t|s_t)$. This policy model is stochastic. **Environment** The discriminator $D$ and the second agent $A_2$ form the environment to $A_1$: - **Discriminator** The discriminator $D$ takes a natural language question as input, and output a binary classification result to determine whether the question is fake or not. We call the probability of a question Q' is $True$ as $p_D (True|Q')$. - **$A_2$** The answer given by $A_2$ is a node $v$ selected from its knowledge graph $K_2$. Specifically, $A_2$ outputs a probability distribution on the nodes of $K_2$. We call the *ground truth* node as $v_{gt}$, and the probability of selecting it as $p_{A_2}(v_{gt}|Q', K_2)$. **Reward** There are two sources of reward: 1) the score given by $D$ and 2) the answering correctness of $A_2$. $$ \begin{aligned} &r_t = \frac{1}{N} \sum_{n=1}^{N} \left( r_{A_2}(Q'^{n}) + \beta \cdot r_{D}(Q'^{n}) \right), Q'^n \in MC^{A_1^r}(Q'_{1:t}; N) \\ &r_{A_2}(Q'^{n}) = p_{A_2}(v_{gt}|Q'^{n}, K_2) \\ &r_{D}(Q'^{n}) = p_D (True|Q'^n) \\ \end{aligned} $$ where $\beta$ is a hyper-parameter. ### Training $A_1$ via Policy Gradient Let $\theta$ be the parameters of $A_1$. We train $A_1$ using REINFORCE algorithm. $A_1$ needs to learn to generate $Q'$ to maximize the expected reward. Let the expected reward at timestep $t$ be $\bar{r}_t$: $$ \begin{aligned} J(\theta) &= \sum_{t=1}^{T} \mathbb{E}_{Q'_{1:t-1} \sim A_1} \left[ \sum_{Q'_t} \bar{r}_t \cdot A_1(Q'_t|s_t) \right] \\ &\simeq \sum_{t=1}^{T} \mathbb{E}_{Q'_{1:t-1} \sim A_1} \left[ \sum_{Q'_t} r_t \cdot A_1(Q'_t|s_t) \right] \\ \end{aligned} $$ we can estimate the gradient of $J(\theta)$ as follows: $$ \begin{aligned} \bigtriangledown_{\theta} J(\theta) &\simeq \sum_{t=1}^{T} \mathbb{E}_{Q'_{1:t-1} \sim A_1} \left[ \sum_{Q'_t} r_t \cdot \bigtriangledown_{\theta}A_1(Q'_t|s_t) \right] \\ &= \sum_{t=1}^{T} \mathbb{E}_{Q'_{1:t-1} \sim A_1} \left[ \sum_{Q'_t} A_1(Q'_t|s_t) \cdot r_t \cdot \bigtriangledown_{\theta}\log A_1(Q'_t|s_t) \right] \\ &= \sum_{t=1}^{T} \mathbb{E}_{Q'_t \sim A_1(Q'_t|s_t)} [ r_t \cdot \bigtriangledown_{\theta}\log A_1(Q'_t|s_t)] \\ \end{aligned} $$ Then we use gradient ascent to update $\theta$: $$ \theta \gets \theta + \alpha \cdot \bigtriangledown_{\theta}J(\theta) $$ ### Training the discriminator $D$ Let $\phi$ be the parameters of $D$, the objective function of $D$ is: $$ J(\phi) = - \mathbb{E}_{Q' \sim p_{data}}[\log D(Q')] - \mathbb{E}_{Q' \sim A_1}[\log (1-D(Q'))] $$ Then we use gradient descent to update $\phi$: $$ \phi \gets \phi - \alpha \cdot \bigtriangledown_{\phi}J(\phi) $$ The true data for training $D$ are synthesized using templates from the bAbI dataset. For example: - Where is John ? - Where is John located? - What is the location of John? ### Training $A_2$ Let $\eta$ be the parameters of $A_2$, and $V_2$ be the nodes in $K_2$. $A_2$ assigns a probability distribution $P_{\eta}(V_2|Q', K_2)$ to the nodes in $K_2$. The objective function for training $A_2$ is *cross entropy* loss: $$ J(\eta) = CrossEnt(P_{\eta}(V_2|Q', K_2), \mathbb{1}_{v_{gt}}) $$ $\mathbb{1}_{v_{gt}}$ is a one-hot vector, in which the number in the index of $v_{gt}$ is 1, and others are all zeros. We use gradient descent to update $\eta$: $$ \eta \gets \eta -\alpha \cdot \bigtriangledown_{\eta}J(\eta) $$ ### The Training Procedure ![](https://i.imgur.com/WDhytQv.png =400x)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.