# DiffusER: Diffusion via Edit-based Reconstruction 12/19 ###### tags: `Yang` Under review as a conference paper at ICLR 2023 - Rating: 8.0, 8.0, 6.0 ### 1. Introduction Revision and editing are central to how humans produce content; we write and revise emails and papers, gradually produce works of art, and iterate on plans for a project. Despite this, the most dominant paradigm in text generation is **purely autoregressive**, producing text left-to-right in a single pass. Although models employing this single-pass form of generation are highly performant, they are limited by the inability to refine existing text. To address this, we propose **DIFFUSER**: Diffusion via Edit-based Reconstruction, a flexible method to apply edit-based generative processes to arbitrary text generation tasks. Prior work on text generation either focuses on improving the performance of standard **autoregressive (AR) models** through larger models and datasets or on proposing new, **non-autoregressive approaches** to improve general modes of text generation. DIFFUSER unifies these two perspectives by enabling edit processes to be applied to general purpose text generation without compromising performance or requiring external supervised data. This design enables it to both generate and edit text, including externally produced content, a natural extension of the text generation paradigm. Overall, we demonstrate the potential of edit-based generative models to offer 1) More performant generation 2) Greater interactivity between different models (as we can now perform edits in the discrete space on model generated output) 3) More flexible/controllable generation. ![](https://i.imgur.com/6CDfn7Y.png) ### 2. DiffusER DIFFUSER, being a diffusion-based method, has two main procedures: corruption and denoising. Both our corruption process and denoising process are based on Levenshtein operations, allowing our model to learn to take advantage of the flexibility of text editing when generating. #### 2.1 EDIT OPERATIONS - **INSERT**: The insertion operation is used to add new text to a sequence. For example in Figure 1, “uses editing processes” is added by DiffusER at timestep $x_{T−2}$. - **DELETE**: The deletion operation erases existing text. In Figure 1, this is shown when “These” gets deleted at timestep $x_{T−2}$ → $x_{T−3}$. - **REPLACE**: The replacement operation works overwriting existing text with new text. This is shown in Figure 1 at step $x_{T}$ → $x_{T−1}$ where “filter Toronto guilty trough feel” is replaced by “These model guilty named DiffusER”. - **KEEP**: The keep operation ensures that a portion of the text remains unchanged into the next iteration. This is illustrated in timestep $x_{T−2}$ → $x_{T−3}$ where “model named DiffusER” is kept. #### 2.2 EDIT-BASED CORRUPTION The four Levenshtein edit operations described above allow us to transform any arbitrary sequence of tokens into another. For every timestep $i$, corruption process $q(x_i|x_{i−1}; \varepsilon_t, \varepsilon_l)$ is parameterized by two distributions: the distribution over **edit types** $\varepsilon_t$ (e.g. 60% keep, 20% replace, 10% delete, 10% insert), and the distribution over **edit length** $\varepsilon_l$. The latter can be parameterized by any distribution over non-negative integers, such as a uniform distribution or a Poisson distribution. For instance, to learn a deletion operation in the reconstruction process, we insert randomly sampled distractor tokens, whereas, to learn an insertion operation we delete a subset of tokens contained in the sequence. #### 2.3 EDIT-BASED RECONSTRUCTION Our generative process is trained via the Edit-based Reconstruction (ER) process. ER can be thought of as the opposite of our corruption process, in which we need to find the appropriate edit operations to transform $\mathbf{x}_T$ to $\mathbf{x}_0$, by way of $\mathbf{x}_{T-1}, \ldots, \mathbf{x}_1$. That is, given a corrupted sequence $\mathbf{x}_T$, we aim to learn the process by which we can reverse the corruption in the following form. $$ P_\theta\left(\mathbf{x}_0\right)=\prod_{t=0}^T p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) $$ Given that, we model the likelihood of each timestep $\mathbf{x}_t$, this can also be referred to as an edit process. As we include an edit process in our model and use Levenshtein tags for editing, one can think of ER as two distinct steps: identify which edits should take place (tagging process) and deciding which tokens should go in these positions (generative process). This decomposition is shown here: $$ p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)=p_\theta^{\mathrm{tag}}\left(\mathbf{e}_t \mid \mathbf{x}_t\right) p_\theta^{\mathrm{gen}}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{e}_t\right) $$ where $p_\theta^{\operatorname{tag}}$ parameterises the tagging model to estimate the likelihood of producing a given set of Levenshtein edit operations $\{$ INSERT, DELETE, KEEP, RE P LACE $\}$ given $\mathbf{x}_t$, and $p_\theta^{\text {gen }}$ parametersies the generator model given sequence $\mathbf{x}_t$ and edit operations $\mathbf{e}_t$. #### 2.4 IMPLEMENTING DIFFUSER WITH TRANSFORMERS When implemented with Transformers, DIFFUSER consists of two components: a **tagger and generator**. The tagger, a transformer network, is trained using cross-entropy loss over the ground-truth tag types to predict the edit operations that should be applied to the sequence, in preparation for the next generation step. Then, in the generation step, after removing tokens selected for deletion, we sum a learned embedding to insert and replace types and generate the inserted and replaced sequences autoregressively. #### 2.5 DECODING METHODS DIFFUSER has an inherently different generation process from a standard autoregressive language generation model — in addition to operating on a **sequence/token level**, we also operate on a **revision level**. This allows us to experiment with different methods for decoding on both the intra-revision (single sequence level) and inter-revision levels (multiple version level), which we explain below. - **Beam Search**: One method for decoding is to perform beam search over $b$ hypotheses at every step on the output of our autoregressive generator (intra-revision level), while performing greedy decoding at the inter-revision level. Although being conceptually straightforward, this method has the limitation of not searching over the inter-revision space (despite evisions being a key component of our approach). - **2D Beam Search**: We propose 2D beam search, in which we extend beam search as it is applied to token-level autoregressive generative models, and perform beam search using both an intra-revision width of $b$ and an inter-revision beam width of $r$. This allows us to perform search on the interrevision level, which we find results in better downstream performance, but increases the beam count to $r × b$ beams. - **Nucleus Sampling**: To improve the diversity of generations, we also consider a nucleus sampling based approach, where at every timestep $x_t$, we use nucleus sampling with $p = 0.6$ to sample each token autoregressively at the intra-revision level, and greedily decode at the inter-revision level. #### 2.6 DECODER INITIALIZATION TECHNIQUES ![](https://i.imgur.com/6l8SdRS.png) - **Null Sequence**: In this setting, we simply initialize DIFFUSER with a null string, in which the first edit is constrained to be insertion. - **Random Tokens**: In this setting, we initialize DIFFUSER with a series of random tokens. The model then learns to edit this random sequence. - **AR Bootstrap**: We bootstrap the reverse diffusion process by taking the output of DIFFUSER constrained to generate autoregressively (essentially mimicking a standard autoregressive generator). We then use DIFFUSER to further edit the output of this operation. - **Source Bootstrap** In a sequence-to-sequence setting, we can also generate by bootstrapping using the source text, by setting $x_T$ to be equivalent to $s$. As we show in later sections, this is particularly useful in tasks such as summarization in which the output can be easily formulated as an editing version of the input. ### 3. EXPERIMENTS #### 3.1 MODELS - **DIFFUSER**: We instantiate DIFFUSER with two separate Transformer models for the tagger and generator. We use the Transformer-base encoder-decoder (Vaswani et al., 2017) architecture, with 6 layers, for the a hidden dimension of 512, feedforward dimension of 2048, 8 attention heads, and dropout p = 0.3. - **Baselines (MT & Summ)**: We use several Transformer baselines from previous literature for our various tasks. We include a conventional 6-layer encoder-decoder Transformer model from Vaswani et al. (2017), as well as models proposed in related work from the non-autoregressive generation literature: Levensthein Transformer (Gu et al., 2019), CMLM (Ghazvininejad et al., 2019), DisCo (Kasai et al., 2020a), Imputer (Saharia et al., 2020), and SUNDAE (Savinov et al., 2022). #### 3.2 RESULTS ![](https://i.imgur.com/glQ54FT.png) #### 3.3 ANALYSIS - Decoding Method Ablation ![](https://i.imgur.com/DzcubkB.png) - Time comparsion between decoding methods / Number of Edit Steps versus Performance ![](https://i.imgur.com/03WGvYC.png) - How does text change every step? ![](https://i.imgur.com/PZ8oAu7.png) ### 4. CONCLUSIONS We proposed DIFFUSER, an diffusion-based generative model for text using edits. DIFFUSER shows improvements across the tasks considered (machine translation, summarization, style transfer), with improved generative flexibility via incremental text improvement, and compatibility with standard autoregressive models. We hope that DIFFUSER with spur research on edit-based generative models, with further potentials including how we can leverage edits to ensemble models (regardless of parameter count) in the discrete space.