# Collaborative Question Answering
In Collaborative QA task, there are two agents $A_1$ and $A_2$ collaborate to answer a question through natural language conversations. Both agents have their own knowledge, which we call $K_1$ and $K_2$ respectively. To simplify the task, we only consider the questions that can be answered in one turn conversation. To be specific, $A_1$ receives the original question $Q$, and generates a new question $Q'$ to $A_2$. Then $A_2$ responds with an answer $A$. In our synthetic dataset using bAbI templates, $A_2$ gives the answer by selecting a node from $K_2$.
As there is no ground truth of $Q'$, $A_1$ must learn to generate $Q'$ in an unsupervised manner. Borrowing ideas from [Li et al. 2017](https://arxiv.org/abs/1701.06547) and [Yu et al. 2017](https://arxiv.org/abs/1609.05473), $A_1$ can learn to generate $Q'$ by Adversarial Learning. In adversarial learning framework, there is a discriminator $D$ which can distinguish the input question is fake or not. Combine $A_2$ and $D$, $A_1$ can be taught to generate the 'right' question.
However, the discrete nature of human language makes the framework non-differentiable. So we use Reinforcement Learning to train our model. [Li et al. 2017](https://arxiv.org/abs/1701.06547) tried two methods:
1. Monte Carlo (MC) Search
2. Making $D$ being able to assign a reward to a partially decoded sequence
Though the second method is much more time-effective, it's not suitable for this task, because $A_2$ cannot accept a partially decoded question.
## Reinforcement Learning for Collaborative QA
### Components description
**Action** The action space for $A_1$ consists of the token vocabulary $V$, and all nodes in $K_1$ for copying a node value. At timestep $t$, $A_1$ takes an action $Q'_t$ as the $t$-th token of $Q'$. Let the maximum length of $Q'$ is $T$. Then sample $Q'_{t+1:T}$ from a roll-out policy $A_1^r$ using MC search. At each timestep, we sample $N$ different $Q'$:
$$ \{ Q'^{1},..., Q'^{N} \}= MC^{A_1^r}(Q'_{1:t}; N) $$
**State** At timestep $t$, the state of $A_1$ consists of the original question $Q$, its knowledge $K_1$, and the previously decoded tokens $Q'_{1:t-1}$, i.e. $s_t = [ Q, K_1, Q'_{1:t-1} ]$.
**Policy** Without confusion, we define the policy of $A_1$ as its name $A_1(Q'_t|s_t)$. This policy model is stochastic.
**Environment** The discriminator $D$ and the second agent $A_2$ form the environment to $A_1$:
- **Discriminator** The discriminator $D$ takes a natural language question as input, and output a binary classification result to determine whether the question is fake or not. We call the probability of a question Q' is $True$ as $p_D (True|Q')$.
- **$A_2$** The answer given by $A_2$ is a node $v$ selected from its knowledge graph $K_2$. Specifically, $A_2$ outputs a probability distribution on the nodes of $K_2$. We call the *ground truth* node as $v_{gt}$, and the probability of selecting it as $p_{A_2}(v_{gt}|Q', K_2)$.
**Reward** There are two sources of reward: 1) the score given by $D$ and 2) the answering correctness of $A_2$.
$$
\begin{aligned}
&r_t = \frac{1}{N} \sum_{n=1}^{N} \left( r_{A_2}(Q'^{n}) + \beta \cdot r_{D}(Q'^{n}) \right), Q'^n \in MC^{A_1^r}(Q'_{1:t}; N) \\
&r_{A_2}(Q'^{n}) = p_{A_2}(v_{gt}|Q'^{n}, K_2) \\
&r_{D}(Q'^{n}) = p_D (True|Q'^n) \\
\end{aligned}
$$
where $\beta$ is a hyper-parameter.
### Training $A_1$ via Policy Gradient
Let $\theta$ be the parameters of $A_1$. We train $A_1$ using REINFORCE algorithm. $A_1$ needs to learn to generate $Q'$ to maximize the expected reward. Let the expected reward at timestep $t$ be $\bar{r}_t$:
$$
\begin{aligned}
J(\theta) &= \sum_{t=1}^{T} \mathbb{E}_{Q'_{1:t-1} \sim A_1} \left[ \sum_{Q'_t} \bar{r}_t \cdot A_1(Q'_t|s_t) \right] \\
&\simeq \sum_{t=1}^{T} \mathbb{E}_{Q'_{1:t-1} \sim A_1} \left[ \sum_{Q'_t} r_t \cdot A_1(Q'_t|s_t) \right] \\
\end{aligned}
$$
we can estimate the gradient of $J(\theta)$ as follows:
$$
\begin{aligned}
\bigtriangledown_{\theta} J(\theta) &\simeq \sum_{t=1}^{T} \mathbb{E}_{Q'_{1:t-1} \sim A_1} \left[ \sum_{Q'_t} r_t \cdot \bigtriangledown_{\theta}A_1(Q'_t|s_t) \right] \\
&= \sum_{t=1}^{T} \mathbb{E}_{Q'_{1:t-1} \sim A_1} \left[ \sum_{Q'_t} A_1(Q'_t|s_t) \cdot r_t \cdot \bigtriangledown_{\theta}\log A_1(Q'_t|s_t) \right] \\
&= \sum_{t=1}^{T} \mathbb{E}_{Q'_t \sim A_1(Q'_t|s_t)} [ r_t \cdot \bigtriangledown_{\theta}\log A_1(Q'_t|s_t)] \\
\end{aligned}
$$
Then we use gradient ascent to update $\theta$:
$$ \theta \gets \theta + \alpha \cdot \bigtriangledown_{\theta}J(\theta) $$
### Training the discriminator $D$
Let $\phi$ be the parameters of $D$, the objective function of $D$ is:
$$ J(\phi) = - \mathbb{E}_{Q' \sim p_{data}}[\log D(Q')] - \mathbb{E}_{Q' \sim A_1}[\log (1-D(Q'))] $$
Then we use gradient descent to update $\phi$:
$$ \phi \gets \phi - \alpha \cdot \bigtriangledown_{\phi}J(\phi) $$
The true data for training $D$ are synthesized using templates from the bAbI dataset. For example:
- Where is John ?
- Where is John located?
- What is the location of John?
### Training $A_2$
Let $\eta$ be the parameters of $A_2$, and $V_2$ be the nodes in $K_2$. $A_2$ assigns a probability distribution $P_{\eta}(V_2|Q', K_2)$ to the nodes in $K_2$. The objective function for training $A_2$ is *cross entropy* loss:
$$ J(\eta) = CrossEnt(P_{\eta}(V_2|Q', K_2), \mathbb{1}_{v_{gt}}) $$
$\mathbb{1}_{v_{gt}}$ is a one-hot vector, in which the number in the index of $v_{gt}$ is 1, and others are all zeros. We use gradient descent to update $\eta$:
$$ \eta \gets \eta -\alpha \cdot \bigtriangledown_{\eta}J(\eta) $$
### The Training Procedure
