Structure Prediction as Sequence Generation with Arbitrary order
=
## Task Definition
Treating structure prediction as a sequence generation problem is widely used in machine learning tasks, for example, generating an action sequence to form a tree structure or graph structure. A typical use case is to convert natural language sentences into some structured data like the semantic tuple, parsing tree, etc.
## Motivation
Although the action sequence has a clear sequential structure, the ground-truth action sequence may not be unique. That means multiple action sequences should be considered as the ground truth, and this raises a critical issue of the training process since the common loss function can't handle these cases. To address this, we need to define some new losses that are invariant to the generation order and preserve the sequential dependency during the generation process. A straightforward way to follow the idea of multi-instance learning is to compute many sub-losses and pick up the best one as the loss. However, its computation is very heavy and increases exponentially in many real scenarios. So, we propose an efficient approximation for computing the multi-instance learning loss over different generation orders. As a result, we can define a new structure prediction paradigm that incorporates sequence generation with arbitrary order and multi-instance learning loss.
## Method
Given an input sequence $X = x_1, x_2, ..., x_n$ and the target structure $Y = y_1, y_2, ..., y_m$. Note that the order of target structure doesn't matter, all its permutations are valid target. Let the $\pi_i(Y)$ represents the i-th permutation of the sequence $Y$. Therefore, we can define the multi-instance learning loss over all the permuations as,
\begin{gather}
S_M = \max_i S(f(X), \pi_i(Y))
\end{gather}
where $S$ is the similarity.
But it's hard to enumerate $\pi$ in many real scenarios, so we introduce a latent variable to encode the order information and apply the variational inference to form a new loss,
\begin{gather}
S_M = \max_i S(f(X), \pi_i(Y)) \\
S_M \geq \sum_i P(i|X) S(f(X), \pi_i(Y)) \\
S_M \geq \sum_i P(i|X) S(f(X), \pi_i(Y)) \frac{q(i|X,Y)}{q(i|X,Y)} \\
S_M \geq \sum_i q(i|X,Y) S(f(X), \pi_i(Y)) \frac{P(i|X)}{q(i|X,Y)} \\
\log S_M \geq \mathbb{E}_{q(i|X,Y)} \log S(f(x), \pi_i(Y)) - KL(q(i|X,Y) || P(i|X))
\end{gather}
Since $\log$ is monotonic, it doesn't change the optimization direction. $q(i|X,Y;\theta_1)$ and $P(i|X;\theta_2)$ are learnable Gaussian, we can train them in an end-to-end fashion. As a result, we train the model with $q(i|X,Y)$ (sampling serval times, the number of samples is a hyper-parameter) that avoids the enumeration of permutations $\pi_i$ and use $P(i|X)$ to decide the best order during the test time.
Related work
https://openreview.net/pdf?id=US-TP-xnXI
https://arxiv.org/pdf/1606.05908.pdf
DocRED