Considering generation orders in sequential structure prediction

Considering generation orders in sequential structure prediction = $\frac{\verb~x~ }{b}$ $a-+=.,/?;:'"|b$ Structure prediction is a task that tasks the model to generate some structured outputs, and the sequential indicates its generation process is step-by-step. It's natural to decompose the generation process into several small steps, and each step is often easier. The sequential generation process has a lot of advantages in estimation and inference, and it fits the natural language structure as well. However, the way of decomposition may affect the overall performance because it may break the dependency between elements. We want the model to be sequential and the decomposition to harm fewer dependencies. The simplest way is to teach the model to pick correct orders that can generate high-quality structures with annotations of orders. There are some positive results reported in previous works. But in many cases, we don't have the annotation, so we want the model to learn the order only with the teaching signal of structured outputs. In this case, we can introduce the generation order as a latent variable with some constraints. We don't care about the model performance when the generation order is incorrect (e.g., random orders). So, we have \begin{gather} \max \max_\pi P(x|\pi), \\ \end{gather} where x is the structure output, and $\pi$ is the generation order. Note that we omit the parameter $\theta$ in the first $\max$ and $P()$ Let the similarity $S=\max_\pi P(x|\pi)$, then \begin{align} S \ge & \sum_\pi P(\pi) P(x|\pi) \\ = & \sum_\pi P(x,\pi) \frac{Q(\pi|x)}{Q(\pi|x)} \\ \log S \ge & \mathbb{E}_{\pi \sim Q(\pi|x)} \log P(x|\pi) - KL(Q(\pi|x) || P(\pi)) \\ \end{align} where, P,Q are learnable diagonal guassian distributions. Note that we need to sample different orders (uniformly) for $P(x|\pi)$ because the order affects the sequence $x=\{x_1,\cdots,x_n\}$ and the loss. So far, we have a solution to train the model without annotation of generation orders. But the remaining issue is the latent variable may not correspond to the generation order, and the teaching signal of $P(x|\pi)$ is inefficient. [gqp: I will clear this section and add some figures later.] Defining the direction or the order of a structure is difficult in many scenarios, but it's easy for many NLP applications. For example, for most information extraction tasks, our goal is to extract some text spans from the text sequence, and we don't care about the extraction order. Therefore, we can define extraction order according to the extracted span's position in the input text sequence. So we can introduce an additional model to encode the extraction order sequence to the latent space and force the latent variable to recover (A reconstruction loss). Also, we can use $P(\pi)$ to sample (take the decoder in the last autoencoder model to decode the latent variable as a discrete sequence.) new generation orders in the next training epoch instead of picking some random order. There are multiple solutions, this is one, \begin{gather} \hat{\pi} = \{ \hat{\pi}_1, \cdots, \hat{\pi}_m \} \quad \hat{\pi}_i \in [1,Len(x)] \\ L_{ae} = CE(P(\hat{\pi}| dec(enc(\hat{\pi}))), \hat{\pi}) \\ D = \max \log P(D=1|enc(\hat{\pi})) + \log P(D=0|enc(\pi))\\ L_{adv} = CE(P(D|\pi),1) \end{gather} where, $CE$ is the cross entropy loss, and $D$ is a discriminator. $\hat{\pi}$ is a position sequence of the input text sequence. The main idea is to train an autoencoder of $\hat{\pi}$ and using the adversarial loss to link with the latent variable $\pi$. After that, we can use the former's decoder to decode the latent variable $\pi$ and get some new orders. Another solution is to inference latent variable with $\hat{\pi}$, \begin{gather} \log S \ge & \mathbb{E}_{\pi \sim Q(\pi|\hat{\pi})} \log P(x|\pi) - KL(Q(\pi|\hat{\pi}) || P(\pi)) \end{gather} and let the $Q$ be an invertable network (treating $\hat{\pi}$ as a continuous variable), then we can decode the $\hat{\pi}$ by picking its nearest neighbor. ## Related works [Order Matters: Probabilistic Modeling of Node Sequence for Graph Generation](https://arxiv.org/abs/2106.06189) [Order Matters: Sequence to sequence for sets](https://arxiv.org/abs/1511.06391) [Set Cross Entropy: Likelihood-based Permutation Invariant Loss Function for Probability Distributions](https://arxiv.org/abs/1812.01217)