# High-Resolution Image Synthesis with Latent Diffusion Models
###### tags:`論文翻譯` `deeplearning`
[TOC]
## 說明
排版的順序為先原文,再繁體中文,並且圖片與表格都會出現在第一次出現的段落下面
原文
繁體中文
照片或表格
:::warning
1. 個人註解,任何的翻譯不通暢部份都請留言指導
2. 為了加速閱讀,實驗與結論的相關部份會直接採用自建的反思翻譯(Phi4-14B模型所翻譯)的結果呈現,然後快速看過,語意對了就頭過身就過,多多包函。
:::
:::danger
* [paper hyperlink](https://arxiv.org/abs/2112.10752)
* [github](https://github.com/CompVis/latent-diffusion)
:::
## Abstract
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve new state-of-the-art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including text-to-image synthesis, unconditional image generation and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.
透過將影像生成過程分解為[降雜訊自動編碼器](https://terms.naer.edu.tw/detail/f3d28a3c934f812dbf1a5fc5efcc2eb8/)的連續應用,擴散模型(Diffusion Models,DMs)在影像資料及其它領域實現了最佳的合成結果。此外,這種方法允許在不重新訓練的情況下,通過引導機制來控制影像生成過程。然而,由於這些模型通常直接在像素空間中處理,強大的DMs的最佳化通常需要耗費數百個GPU days的計算資源,而且由於需要連續的計算,推理過程的成本非常高昂。為了在有限的計算資源下進行DM的訓練,同時維持其品質和靈活性,我們將它們應用於強大的預訓練自編碼器的潛在空間(latent space)中。與先前的研究相比,在這種表示上訓練擴散模型首次能夠在降低複雜度和保留細節之間達到接近最佳的平衡點上,從而大大提高了視覺[保真度](https://terms.naer.edu.tw/detail/70e7be60dcd0f879ef8cb8e0023d96f7/)。通過在模型架構中引入交叉注意力層(cross-attention layers),我們將擴散模型轉變為強大且靈活的生成器,可用於文字或邊界框等一般條件輸入,並以卷積方式實現高解析度的合成。我們的潛在擴散模型(Latent Diffusion Models,LDMs)在影像修補(image inpainting)和類別條件影像合成(class-conditional image synthesis)方面達到了新的最先進水平,並在各種任務上展現了極具競爭力的效能,包括文本到影像的合成(text-to-image synthesis)、非條件式影像生成(unconditional image generation)和超解析度(super-resolution),同時與基於像素的DMs相比,明顯降低計算需求。
## 1. Introduction
Image synthesis is one of the computer vision fields with the most spectacular recent development, but also among those with the greatest computational demands. Especially high-resolution synthesis of complex, natural scenes is presently dominated by scaling up likelihood-based models, potentially containing billions of parameters in autoregressive (AR) transformers [66,67]. In contrast, the promising results of GANs [3, 27, 40] have been revealed to be mostly confined to data with comparably limited variability as their adversarial learning procedure does not easily scale to modeling complex, multi-modal distributions. Recently, diffusion models [82], which are built from a hierarchy of denoising autoencoders, have shown to achieve impressive results in image synthesis [30,85] and beyond [7,45,48,57], and define the state-of-the-art in class-conditional image synthesis [15,31] and super-resolution [72]. Moreover, even unconditional DMs can readily be applied to tasks such as inpainting and colorization [85] or stroke-based synthesis [53], in contrast to other types of generative models [19,46,69]. Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models [67].
影像合成是近年來發展最引人注目的電腦視覺領域之一,同時也是計算資源需求最高的領域之一。特別是複雜的自然場景的高解析度合成,目前主要由scaling up likelihood-based models(擴大基於似然估計的模型)主導,這些模型在autoregressive (AR) transformers [66,67]中可能包含數十億個參數。相較之下,生成對抗網絡(GAN)[3, 27, 40] 的優異成果已被證實主要侷限於變異性相對有限的數據,因為其對抗性學習過程不易擴展至建構複雜的多模態分佈。近來,由降雜訊自動編碼器層次構建的擴散模型 [82],在影像合成 [30,85] 及其它領域 [7,45,48,57] 中展現了令人印象深刻的成果,並在類別條件式影像合成 [15,31] 和超解析度 [72] 領域達到最新的技術水準。此外,對比其它類型的生成模型 [19,46,69],即使是非條件式擴散模型(unconditional DMs)也能輕鬆應用於修補和上色 [85] 或stroke-based的合成 [53] 等任務上。作為基於似然估計的模型,擴散模型並不會如GAN那般出現模式崩潰和訓練不穩定的問題,並且透過大量利用參數共享,不需要像AR模型 [67] 那樣涉及數十億個參數,就可以對自然影像的高度複雜分佈進行建模。
**Democratizing High-Resolution Image Synthesis** DMs belong to the class of likelihood-based models, whose mode-covering behavior makes them prone to spend excessive amounts of capacity (and thus compute resources) on modeling imperceptible details of the data [16, 73]. Although the reweighted variational objective [30] aims to address this by undersampling the initial denoising steps, DMs are still computationally demanding, since training and evaluating such a model requires repeated function evaluations (and gradient computations) in the high-dimensional space of RGB images. As an example, training the most powerful DMs often takes hundreds of GPU days (e.g. 150 - 1000 V100 days in [15]) and repeated evaluations on a noisy version of the input space render also inference expensive, so that producing 50k samples takes approximately 5 days [15] on a single A100 GPU. This has two consequences for the research community and users in general: Firstly, training such a model requires massive computational resources only available to a small fraction of the field, and leaves a huge carbon footprint [65, 86]. Secondly, evaluating an already trained model is also expensive in time and memory, since the same model architecture must run sequentially for a large number of steps (e.g. 25 - 1000 steps in [15]).
**Democratizing High-Resolution Image Synthesis** 擴散模型(Diffusion Models,DMs)屬於基於似然的模型類別,其模式涵蓋的行為使其容易花費過多的容量(從而增加計算資源的需求)來對資料中難以察覺的細節進行建模 [16, 73]。儘管重新加權的變分目標 [30] 旨在透過對初始去噪步驟進行[低抽樣](https://terms.naer.edu.tw/detail/427d1e8a4a621e7698839ba13c1ff9c2/)(undersampling)來解決這一問題,但DMs在計算上的要求仍然很高,因為訓練和評估這樣的模型需要在RGB影像的高維空間中重複的進行函數評估(以及梯度計算)。舉例來說,訓練最強大的DMs通常需要數百個GPU days(像是 [15] 中的150至1000個V100 days),在輸入空間的雜訊版本(noisy version)上重複評估的推理成本也是很高,以致於產生5萬個樣本大約需5天(在單一張A100 GPU上) [15]。這對研究界和一般使用者會有兩個後果:首先,訓練這樣的模型需要大量的計算資源,而這些資源只有該領域的一小部分人才能獲得,並且會留下巨大的碳足跡 [65, 86]。其次,評估已經訓練過的模型在時間和記憶體上也會非常昂貴,因為相同的模型架構必須連續執行大量的步驟(像是 [15] 中就需要25 - 1000個步驟)。
To increase the accessibility of this powerful model class and at the same time reduce its significant resource consumption, a method is needed that reduces the computational complexity for both training and sampling. Reducing the computational demands of DMs without impairing their performance is, therefore, key to enhance their accessibility.
為了提高這種強大模型類別的[可達性](https://terms.naer.edu.tw/detail/0d6069b0286e331f35041dbdffb739ca/),同時減少其大量的資源消耗,我們需要一種方法來降低訓練與採樣的計算複雜度。因此,在不損害擴散模型(DMs)效能的情況下減少其運算需求,是提高其可達性的關鍵。
**Departure to Latent Space** Our approach starts with the analysis of already trained diffusion models in pixel space: Fig. 2 shows the rate-distortion trade-off of a trained model. As with any likelihood-based model, learning can be roughly divided into two stages: First is a perceptual compression stage which removes high-frequency details but still learns little semantic variation. In the second stage, the actual generative model learns the semantic and conceptual composition of the data (semantic compression). We thus aim to first find a perceptually equivalent, but computationally more suitable space, in which we will train diffusion models for high-resolution image synthesis.
**Departure to Latent Space** 我們的方法從分析像素空間中已經訓練過的擴散模型開始:Fig. 2顯示了訓練模型的資料率失真(rate-distortion)的權衡。與任何基於似然的模型一樣,學習大致可以分為兩個階段:第一階段是感知壓縮階段,它會去除高頻細節,但仍然學習很少的語義變化。第二階段的話,實際的生成模型學習資料的語意和脈絡的組成(語意壓縮)。因此,我們的目標是先找到一個感知上等價但計算上更合適的空間,在其中我們將訓練用於高解析度影像合成的擴散模型。

Figure 2. Illustrating perceptual and semantic compression: Most bits of a digital image correspond to imperceptible details. While DMs allow to suppress this semantically meaningless information by minimizing the responsible loss term, gradients (during training) and the neural network backbone (training and inference) still need to be evaluated on all pixels, leading to superfluous computations and unnecessarily expensive optimization and inference. We propose latent diffusion models (LDMs) as an effective generative model and a separate mild compression stage that only eliminates imperceptible details. Data and images from [30].
:::warning
>[name=Felo.ai]
影像中的高頻細節是指圖像中灰度或顏色變化劇烈的部分,通常與圖像的邊緣、紋理和細節有關。這些高頻成分在圖像的頻域表示中對應於高頻信號,反映了圖像中像素值的快速變化。
高頻細節的特徵:
* 邊緣和輪廓:高頻細節通常出現在物體的邊緣和輪廓處,這些地方的灰度值變化非常明顯。例如,人的臉部輪廓、物體的邊緣等都是高頻細節的典型例子。
* 紋理和細節:高頻部分還包括圖像中的細微紋理和其他細節,如皮膚的質感、衣物的紋理等。這些細節使得圖像看起來更加清晰和真實。
* 噪聲:在許多情況下,圖像中的噪聲也屬於高頻成分,因為噪聲通常會導致像素值的快速變化,這使得它們在頻域中表現為高頻信號。
:::
Following common practice [11, 23, 66, 67, 96], we separate training into two distinct phases: First, we train an autoencoder which provides a lower-dimensional (and thereby efficient) representational space which is perceptually equivalent to the data space. Importantly, and in contrast to previous work [23,66], we do not need to rely on excessive spatial compression, as we train DMs in the learned latent space, which exhibits better scaling properties with respect to the spatial dimensionality. The reduced complexity also provides efficient image generation from the latent space with a single network pass. We dub the resulting model class Latent Diffusion Models (LDMs).
依照慣例[11、23、66、67、96],我們將訓練分為兩個不同的階段:首先,我們訓練一個自動編碼器(autoencoder),它提供一個低維度(從而更有效率)的表示空間,其於感知上等價於資料空間。重要的是,相較於先前的研究[23,66],我們不需要依賴過度的空間壓縮,因為我們在學習到的潛在空間中訓練DMs,這在空間維數方面表現出更好的縮放特性。降低的複雜度還可以透過單次的網路傳播從潛在空間中高效地生成影像。我們將得到的模型類稱為潛在擴散模型(Latent Diffusion Models (LDMs))。
:::warning
字面意思來看,似乎指的是利用autoencoder學習到的Bottleneck Layers來做為擴散模型的學習依據。
:::
A notable advantage of this approach is that we need to train the universal autoencoding stage only once and can therefore reuse it for multiple DM trainings or to explore possibly completely different tasks [81]. This enables efficient exploration of a large number of diffusion models for various image-to-image and text-to-image tasks. For the latter, we design an architecture that connects transformers to the DM’s UNet backbone [71] and enables arbitrary types of token-based conditioning mechanisms, see Sec. 3.3.
這種方法的一個明顯優點就是,我們只需要訓練一次通用的自動編碼階段,因此可以重複使用多次DM的訓練或探索可能完全不同的任務 [81]。這使得能夠有效地探索各種image-to-image和text-to-image任務的擴散模型。對於後者,我們設計了一個架構,將transformers連接到DM的UNet主幹 [71],並支援任意類型的基於token的條件機制,詳見Sec 3.3.
In sum, our work makes the following **contributions**:
1. In contrast to purely transformer-based approaches [23, 66], our method scales more graceful to higher dimensional data and can thus (a) work on a compression level which provides more faithful and detailed reconstructions than previous work (see Fig. 1) and (b) can be efficiently applied to high-resolution synthesis of megapixel images.
1. 與單純基於Transformer的方法[23, 66]相比,我們的方法可以更優雅地擴展到高維度的資料空間中,因此可以(a)在[壓縮等級](https://terms.naer.edu.tw/detail/59479d621a87f270a64e2fd7d7e64986/)上工作,提供比以往研究更真實且更詳細的重建(見Fig. 1),且(b)有效地應用於百萬像素影像的高解析度合成。

Figure 1. Boosting the upper bound on achievable quality with less agressive downsampling. Since diffusion models offer excellent inductive biases for spatial data, we do not need the heavy spatial downsampling of related generative models in latent space, but can still greatly reduce the dimensionality of the data via suitable autoencoding models, see Sec. 3. Images are from the DIV2K [1] validation set, evaluated at 512^2^ px. We denote the spatial downsampling factor by $f$. Reconstruction FIDs [29] and PSNR are calculated on ImageNet-val. [12]; see also Tab. 8.
2. We achieve competitive performance on multiple tasks (unconditional image synthesis, inpainting, stochastic super-resolution) and datasets while significantly lowering computational costs. Compared to pixel-based diffusion approaches, we also significantly decrease inference costs.
2. 我們在多項任務(非條件式影像合成、修補、隨機超解析度)和資料集上實現了具有競爭力的效能,同時顯著地降低了計算成本。與基於像素的擴散方法相比,我們也明顯降低推論成本。
3. We show that, in contrast to previous work [93] which learns both an encoder/decoder architecture and a score-based prior simultaneously, our approach does not require a delicate weighting of reconstruction and generative abilities. This ensures extremely faithful reconstructions and requires very little regularization of the latent space.
3. 我們的結果說明著,相較於先前的研究[93],[93]同時學習編碼器/解碼器架構和基於分數的先驗,我們的方法不需要對重建和生成能力進行精細的加權。這確保了極其忠實的重建,並且幾乎不需要對潛在空間進行正規化。
4. We find that for densely conditioned tasks such as super-resolution, inpainting and semantic synthesis, our model can be applied in a convolutional fashion and render large, consistent images of ∼ 1024^2^ px
4. 我們發現,對於超解析度、修補和語義合成等密集條件任務,我們的模型可以以卷積方式來應用,並渲染約1024^2^像素且大而一致的影像
5. Moreover, we design a general-purpose conditioning mechanism based on cross-attention, enabling multi-modal training. We use it to train class-conditional, text-to-image and layout-to-image models
5. 此外,我們設計了一種基於交叉注意力(cross-attention)的通用條件機制,實現多模態模型的訓練。我們用它來訓練類別條件、text-to-image和layout-to-image的模型
6. Finally, we release pretrained latent diffusion and autoencoding models at https://github.com/CompVis/latent-diffusion which might be reusable for a various tasks besides training of DMs [81].
6. 最後,我們在 https://github.com/CompVis/latent-diffusion 上發布預訓練的潛在擴散與自動編碼模型,這些模型除了可以用於擴散模型的訓練之外,還可重複用於各種任務[81]。
## 2. Related Work
**Generative Models for Image Synthesis** The high dimensional nature of images presents distinct challenges to generative modeling. Generative Adversarial Networks (GAN) [27] allow for efficient sampling of high resolution images with good perceptual quality [3, 42], but are diffcult to optimize [2, 28, 54] and struggle to capture the full data distribution [55]. In contrast, likelihood-based methods emphasize good density estimation which renders optimization more well-behaved. Variational autoencoders (VAE) [46] and flow-based models [18, 19] enable efficient synthesis of high resolution images [9, 44, 92], but sample quality is not on par with GANs. While autoregressive models (ARM) [6, 10, 94, 95] achieve strong performance in density estimation, computationally demanding architectures [97] and a sequential sampling process limit them to low resolution images. Because pixel based representations of images contain barely perceptible, high-frequency details [16,73], maximum-likelihood training spends a disproportionate amount of capacity on modeling them, resulting in long training times. To scale to higher resolutions, several two-stage approaches [23,67,101,103] use ARMs to model a compressed latent image space instead of raw pixels.
**Generative Models for Image Synthesis** 影像的高維度特性為生成模型帶來了獨特的挑戰。生成對抗網路(GAN)[27] 能夠有效地生成具有良好的感知品質的高解析度影像 [3, 42],但在最佳化方面有著一定的困難 [2, 28, 54],而且難以全面捕捉到資料的分佈 [55]。相較之下,基於似然的方法強調良好的密度估計,使其最佳化過程更為良態。變分自編碼器(Variational autoencoders (VAE))[46] 和基於流(flow-base)的模型 [18, 19] 能夠高效地合成高解析度影像 [9, 44, 92],但其樣本品質沒辦法跟GAN相比。雖然自回歸模型(autoregressive models (ARM))[6, 10, 94, 95] 在密度估計方面表現出色,但其計算架構複雜 [97],且序列採樣過程限制了其在低解析度影像上的應用。由於基於像素的影像表示包含幾乎無法察覺的高頻細節 [16, 73],最大似然的訓練方式會耗費大量的資源來建構這些細節,導致訓練時間過長。為了擴展至更高的解析度,多種兩階段的方法 [23, 67, 101, 103] 就使用 ARMs(Autoregressive Models) 來針對壓縮的潛在影像空間建模,而非直接處理原始像素。
Recently, **Diffusion Probabilistic Models (DM)** [82], have achieved state-of-the-art results in density estimation [45] as well as in sample quality [15]. The generative power of these models stems from a natural fit to the inductive biases of image-like data when their underlying neural backbone is implemented as a UNet [15, 30, 71, 85]. The best synthesis quality is usually achieved when a reweighted objective [30] is used for training. In this case, the DM corresponds to a lossy compressor and allow to trade image quality for compression capabilities. Evaluating and optimizing these models in pixel space, however, has the downside of low inference speed and very high training costs. While the former can be partially adressed by advanced sampling strategies [47, 75, 84] and hierarchical approaches [31, 93], training on high-resolution image data always requires to calculate expensive gradients. We adress both drawbacks with our proposed LDMs, which work on a compressed latent space of lower dimensionality. This renders training computationally cheaper and speeds up inference with almost no reduction in synthesis quality (see Fig. 1).
近來,擴散機率模型(Diffusion Probabilistic Models, DM)[82] 在密度估計 [45] 和樣本品質 [15] 方面達到了最先進的成果。當這些模型的底層神經網路骨幹採用 UNet [15, 30, 71, 85] 時,其模型的生成能力就源於與影像類型資料的歸納偏差的自然契合。通常,使用reweighted objective [30] 進行訓練時,可以得到最佳的合成品質。在這種情況下,DM就相當於是一種有損的壓縮器,允許在影像品質與壓縮能力之間進行權衡。然而,在像素空間中評估和最佳化化這些模型,存在推理速度慢且訓練成本高的缺點。儘管先進的採樣策略 [47, 75, 84] 和分層方法 [31, 93] 能部分緩解推理速度問題,不過,在高解析度影像資料上的訓練仍需計算昂貴的梯度。我們所提出的LDMs(Latent Diffusion Models)能夠解決這兩個缺點,因為它是在低維度壓縮過的潛在空間中運作。這使得訓練在計算上更為便宜,並加快推論的速度,而且合成品質幾乎沒有降低(見Fig. 1)。
:::warning
> [name=GPT]
U-Net 的架構設計特別適合處理具有空間結構的資料,因其結合了編碼器和解碼器的對稱結構,以及跳躍連接(skip connections)。編碼器逐步提取影像的抽象特徵,解碼器則逐步恢復影像的空間解析度。跳躍連接將編碼器中相應層的特徵直接傳遞給解碼器,確保高解析度的特徵在重建過程中得以保留,從而有效捕捉影像的空間結構和細節資訊。
:::
**Two-Stage Image Synthesis** To mitigate the shortcomings of individual generative approaches, a lot of research [11, 23, 67, 70, 101, 103] has gone into combining the strengths of different methods into more efficient and performant models via a two stage approach. VQ-VAEs [67, 101] use autoregressive models to learn an expressive prior over a discretized latent space. [66] extend this approach to text-to-image generation by learning a joint distributation over discretized image and text representations. More generally, [70] uses conditionally invertible networks to provide a generic transfer between latent spaces of diverse domains. Different from VQ-VAEs, VQGANs [23, 103] employ a first stage with an adversarial and perceptual objective to scale autoregressive transformers to larger images. However, the high compression rates required for feasible ARM training, which introduces billions of trainable parameters [23, 66], limit the overall performance of such approaches and less compression comes at the price of high computational cost [23, 66]. Our work prevents such tradeoffs, as our proposed LDMs scale more gently to higher dimensional latent spaces due to their convolutional backbone. Thus, we are free to choose the level of compression which optimally mediates between learning a powerful first stage, without leaving too much perceptual compression up to the generative diffusion model while guaranteeing highfidelity reconstructions (see Fig. 1).
**Two-Stage Image Synthesis** 為了彌補個別生成方法的缺點,許多研究[11, 23, 67, 70, 101, 103]致力於透過兩階段方法,結合不同方法的優點來形成更高效且效能更好的模型。VQ-VAEs[67, 101]使用自回歸模型在離散的潛在空間中學習表示先驗(expressive prior)。[66]將這個方法擴展至文本到影像(text-to-image)的生成,透過在離散的影像和文本表示空間中學習聯合分佈。更一般地,[70]使用條件式可逆的網路在不同領域的潛在空間之間提供通用的轉換。與VQ-VAE不同,VQGAN[23, 103]在第一階段採用對抗性與感知目標,將autoregressive transformers擴展至更大的影像。然而,為了進行可行的自回歸模型訓練,需要高壓縮率,這引入了數十億個可訓練參數[23, 66],限制了這類方法的整體效能,而且降低壓縮率就會增加計算成本[23, 66]。我們的研究避開了這種權衡,因為我們所提出的LDMs由於其卷積主幹能更平滑地擴展至更高維度的潛在空間。因此,我們可以自由選擇壓縮的等級(level),最佳地平衡第一階段的學習,避免將過多的感知壓縮留給生成擴散模型,同時保證高保真度的重建(見Fig. 1)。
While approaches to jointly [93] or separately [80] learn an encoding/decoding model together with a score-based prior exist, the former still require a difficult weighting between reconstruction and generative capabilities [11] and are outperformed by our approach (Sec. 4), and the latter focus on highly structured images such as human faces.
雖然已有方法嘗試聯合[93]或分別[80]學習編碼/解碼模型(與score-based prior),不過前者仍然需要在重建與生成能力之間進行艱難的權衡[11],且其表現不及我們的方法(見Sec. 4);後者則是專注於如人臉等高度結構化的影像。
## 3. Method
To lower the computational demands of training diffusion models towards high-resolution image synthesis, we observe that although diffusion models allow to ignore perceptually irrelevant details by undersampling the corresponding loss terms [30], they still require costly function evaluations in pixel space, which causes huge demands in computation time and energy resources.
為了降低訓練擴散模型以實現高解析度影像合成的計算需求,我們觀察到,儘管擴散模型允許透過對相應的損失項進行[低抽樣](https://terms.naer.edu.tw/detail/427d1e8a4a621e7698839ba13c1ff9c2/)(undersampling)來忽略感知上不相關的細節[30],但它們仍然需要在像素空間中進行昂貴的函數評估,這導致了對計算時間和能源資源的需求龐大。
We propose to circumvent this drawback by introducing an explicit separation of the compressive from the generative learning phase (see Fig. 2). To achieve this, we utilize an autoencoding model which learns a space that is perceptually equivalent to the image space, but offers significantly reduced computational complexity.
我們提出透過將壓縮從學習階段中明確分開來規避這個缺點(見Fig. 2)。為了實現這一點,我們利用一個自編碼模型學習一個等價於影像空間的空間,但計算複雜度卻明顯降低。
Such an approach offers several advantages: (i) By leaving the high-dimensional image space, we obtain DMs which are computationally much more efficient because sampling is performed on a low-dimensional space. (ii) We exploit the inductive bias of DMs inherited from their UNet architecture [71], which makes them particularly effective for data with spatial structure and therefore alleviates the need for aggressive, quality-reducing compression levels as required by previous approaches [23, 66]. (iii) Finally, we obtain general-purpose compression models whose latent space can be used to train multiple generative models and which can also be utilized for other downstream applications such as single-image CLIP-guided synthesis [25].
這類方法有幾個優點:(i)透過離開高維度影像空間,我們獲得了計算效率更高的擴散模型,因為採樣是在低維度空間上進行的。(ii)我們利用繼承自UNet架構[71]的擴散模型的歸納偏差,這使得它們對於具有空間結構的資料特別有效,從而減輕了先前的方法對於主動(aggressive)、降低品質的壓縮等級的需求[23, 66]。(iii)最後,我們得獲得通用的壓縮模型,其潛在空間可以用於訓練多個生成模型,也可用於其它下游應用,如single-image CLIP-guided synthesis[25]。
### 3.1. Perceptual Image Compression
Our perceptual compression model is based on previous work [23] and consists of an autoencoder trained by combination of a perceptual loss [106] and a patch-based [33] adversarial objective [20, 23, 103]. This ensures that the reconstructions are confined to the image manifold by enforcing local realism and avoids bluriness introduced by relying solely on pixel-space losses such as $L_2$ or $L_1$ objectives.
我們的感知壓縮模型基於先前的研究[23],由一個透過結合感知損失[106]和patch-based的對抗性目標[20, 23, 103]所訓練的自編碼器。這確保了重建結果被限制在影像流形(image manifold)內(透過強化局部真實感),避免單純依賴像素空間損失(如$L_2$或$L_1$目標式)所引入的模糊。
:::warning
>[name=Felo.ai]
Image manifold 是一個數學概念,指的是在高維空間中,圖像數據的內在結構可以被視為一個低維流形。這意味著,儘管圖像的像素數據在高維空間中可能非常複雜,但實際上這些圖像可以被映射到一個較低維度的空間中,並且在這個低維空間中,圖像之間的關係和變化是連續的。這種流形結構使得我們能夠更有效地進行圖像處理和分析,因為它捕捉了圖像的本質特徵,而不僅僅是像素值的變化。
:::
More precisely, given an image $x \in \mathbb{R}^{H\times W \times 3}$ in RGB space, the encoder $\mathcal{E}$ encodes $x$ into a latent representation $z = \mathcal{E}(x)$, and the decoder $\mathcal{D}$ reconstructs the image from the latent, giving $\tilde{x}=\mathcal{D}(z)=\mathcal{D}(\mathcal{E}(x))$, where $z \in \mathbb{R}^{h\times w\times c}$ . Importantly, the encoder downsamples the image by a factor $f=H/h=W/w$, and we investigate different downsampling factors $f = 2^m$, with $m \in \mathbb{N}$.
更精確地說,給定RGB空間中的影像$x \in \mathbb{R}^{H\times W \times 3}$,編碼器 $\mathcal{E}$ 將 $x$ 編碼為潛在表示(latent representation)$z = \mathcal{E}(x)$,解碼器 $\mathcal{D}$ 則從該潛在表示重建影像,得到 $\tilde{x}=\mathcal{D}(z)=\mathcal{D}(\mathcal{E}(x))$,其中 $z \in \mathbb{R}^{h\times w\times c}$。重要的是,編碼器以因子 $f=H/h=W/w$ 對影像進行[降低取樣](https://terms.naer.edu.tw/detail/44b16bf53d61d6109057b56e3dfad517/)(downsamples),我們探討了不同的降低取樣因子 $f = 2^m$,其中 $m \in \mathbb{N}$。
In order to avoid arbitrarily high-variance latent spaces, we experiment with two different kinds of regularizations. The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE [46, 69], whereas VQ-reg. uses a vector quantization layer [96] within the decoder. This model can be interpreted as a VQGAN [23] but with the quantization layer absorbed by the decoder. Because our subsequent DM is designed to work with the two-dimensional structure of our learned latent space $z = \mathcal{E}(x)$, we can use relatively mild compression rates and achieve very good reconstructions. This is in contrast to previous works [23, 66], which relied on an arbitrary 1D ordering of the learned space $z$ to model its distribution autoregressively and thereby ignored much of the inherent structure of $z$. Hence, our compression model preserves details of $x$ better (see Tab. 8). The full objective and training details can be found in the supplement.
為了避免任意高變異數的潛在空間,我們以兩種不同的正規化方法做實驗。第一種,KL-reg,在學習到的潛在表示上施加輕微的KL-penalty,使其趨向於標準正態分佈,這類似於VAE[46, 69];而 VQ-reg 則是在解碼器中使用向量量化層(vector quantization layer)[96]。此模型可被視為一種 VQGAN[23],不過其量化層被整合進解碼器就是。由於我們後續的擴散模型(DM)設計為與所學習到的二維結構潛在空間$z = \mathcal{E}(x)$協同工作,我們可以使用相對溫和的壓縮率,並實現不錯的重建結果。這與先前的研究[23, 66]形成對比,先前的研究依賴於對學習到的空間 $z$ 進行任意的一維排序,以自回歸方式建構其分佈,因而忽略了 $z$ 的許多內在結構。因此,我們的壓縮模型能夠更好地保留 $x$ 的細節(見Tab. 8)。完整的目標和訓練細節可在補充資料中找到。
:::warning
>[name=GPT + Felo]
KL-reg
```python=
def kl_divergence(mu, logvar):
return -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
```
$$
\text{KL}(q(z|x) \parallel p(z)) = -\frac{1}{2} \sum_{i=1}^{d} \left(1 + \log(\sigma_i^2) - \mu_i^2 - \sigma_i^2\right)
$$
:::
:::warning
>[name=GPT + Felo]
>
VQ-reg
```python=
class VQLayer(nn.Module):
def __init__(self, num_embeddings, embedding_dim):
super(VQLayer, self).__init__()
self.embeddings = nn.Embedding(num_embeddings, embedding_dim)
def forward(self, x):
# 將x映射到最近的embedding
x_flat = x.view(-1, x.size(-1))
distances = (x_flat.unsqueeze(1) - self.embeddings.weight.unsqueeze(0)).pow(2).sum(-1)
indices = distances.argmin(1)
return self.embeddings(indices)
```
:::

Table 8. Complete autoencoder zoo trained on OpenImages, evaluated on ImageNet-Val. † denotes an attention-free autoencoder.
### 3.2. Latent Diffusion Models
**Diffusion Models** [82] are probabilistic models designed to learn a data distribution $p(x)$ by gradually denoising a normally distributed variable, which corresponds to learning the reverse process of a fixed Markov Chain of length $T$. For image synthesis, the most successful models [15,30,72] rely on a reweighted variant of the variational lower bound on $p(x)$, which mirrors denoising score-matching [85]. These models can be interpreted as an equally weighted sequence of denoising autoencoders $\epsilon_\theta(x_t, t);t=1...T$, which are trained to predict a denoised variant of their input $x_t$, where $x_t$ is a noisy version of the input $x$. The corresponding objective can be simplified to (Sec. B)
$$
L_{DM}=\mathbb{E}_{x,\epsilon\sim \mathcal{N}(0,1),t}[\Vert \epsilon -\epsilon_\theta(x_t, t)\Vert_2^2] \tag{1}
$$
with $t$ uniformly sampled from ${1, . . . , T}$.
**Diffusion Models** [82] 是一種機率模型,主要是透過逐步從一個正態分佈的隨機變數中去除噪點的方式來學習資料分佈 $p(x)$,這相當於學習一個固定長度為$T$的馬可夫鏈的逆過程。在影像合成方面,最成功的模型 [15,30,72] 依賴於$p(x)$上[變分](https://terms.naer.edu.tw/detail/de677b0dff9665b142192413a04abe1a/)下界的重新加權變體,這與denoising score-matching [85] 相似。這些模型可以被視為是一個等權重的去噪編碼器的序列 $\epsilon_\theta(x_t, t); t=1...T$,它們被訓練來預測其輸入 $x_t$ 的denoised variant(去噪變體?),其中 $x_t$ 是輸入 $x$ 的噪點版本。對應的目標函數可簡化為(見附錄 B):
$$
L_{DM}=\mathbb{E}_{x,\epsilon\sim \mathcal{N}(0,1),t}[\Vert \epsilon -\epsilon_\theta(x_t, t)\Vert_2^2 ] \tag{1}
$$
其中,$t$ 從 ${1, . . . , T}$ 中均勻採樣。
:::warning
符號說明:
* $x$:原始輸入
* $\epsilon\sim \mathcal{N}(0,1)$:從正態分佈中取樣的噪點
* $t$:time step(時步)
* $x_t$:在時步$t$的時候,加入一堆噪點之後的$x$
* $\epsilon_\theta(x_t, t)$:參數為$\theta$的神經網路所預估出來在時步$t$時候,其$x_t$的噪點
* $\Vert \cdot\Vert^2_2$,計算採樣的與預估的噪點之間的均方誤差
:::
**Generative Modeling of Latent Representations** With our trained perceptual compression models consisting of $\mathcal{E}$ and $\mathcal{D}$, we now have access to an efficient, low-dimensional latent space in which high-frequency, imperceptible details are abstracted away. Compared to the high-dimensional pixel space, this space is more suitable for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space.
**Generative Modeling of Latent Representations** 透過我們所訓練的感知壓縮模型,$\mathcal{E}$(encoder)和$\mathcal{D}$(decoder),現在我們可以存取一個高效、低維度的潛在空間,其中高頻、難以察覺的細節將會被抽象化。相較於高維度的像素空間,這個空間更適合基於似然的生成模型,因為它們現在可以(i)專注於資料中重要的語義部分,以及(ii)在較低維度、計算上更高效的空間中進行訓練。
Unlike previous work that relied on autoregressive, attention-based transformer models in a highly compressed, discrete latent space [23,66,103], we can take advantage of image-specific inductive biases that our model offers. This includes the ability to build the underlying UNet primarily from 2D convolutional layers, and further focusing the objective on the perceptually most relevant bits using the reweighted bound, which now reads
$$
L_{LDM}:=\mathbb{E}_{\mathcal{E}(x),\epsilon\sim\mathcal{N}(0,1)}[\Vert\epsilon-\epsilon_\theta(z_t, t)\Vert^2_2]\tag{2
}
$$
不同於先前依賴於自回歸、基於注意力機制在高度壓縮、離散化的潛在空間中運作的Transformer模型的研究 [23,66,103],我們可以利用我們模型所提供的影像特有的歸納偏差。這包含主要使用二維卷積層構建底層UNet的能力,並進一步使用重新加權的邊界將目標集中在感知上最相關的部份,現在表達示為:
$$
L_{LDM}:=\mathbb{E}_{\mathcal{E}(x),\epsilon\sim\mathcal{N}(0,1)}[\Vert\epsilon-\epsilon_\theta(z_t, t)\Vert^2_2]\tag{2
}
$$
:::warning
與公式(1)最大的差別在於目標調整成是encoder之後的output
:::
The neural backbone $\epsilon_\theta(o,t)$ of our model is realized as a time-conditional UNet [71]. Since the forward process is fixed, $z_t$ can be efficiently obtained from $\mathcal{E}$ during training, and samples from $p(z)$ can be decoded to image space with a single pass through $\mathcal{D}$.
我們模型的神經網路主幹 $\epsilon_\theta(o,t)$ 被實現為time-conditional UNet [71]。由於前向過程是固定的,$z_t$ 可以在訓練期間從 $\mathcal{E}$ 高效地獲取,並且從 $p(z)$ 採樣的樣本可以通過一次傳遞經過 $\mathcal{D}$ 解碼回影像空間。
:::warning
UNet的encoder將輸入影像壓縮到瓶頸層,那$p(z)$指的就是從一個prior distribution(也許是個正態分佈,然後維度等同於瓶頸層)中採樣。
前向過程是固定的說的就是,這就是一個設計好的馬可夫鏈的過程,從$t=0$一直到$T$,逐步加噪讓原始照片變成一個接近正態分佈的狀態?(應該吧..)
:::
### 3.3. Conditioning Mechanisms
Similar to other types of generative models [56, 83], diffusion models are in principle capable of modeling conditional distributions of the form $p(z|y)$. This can be implemented with a conditional denoising autoencoder $\epsilon_\theta(z_t,t,y)$ and paves the way to controlling the synthesis process through inputs $y$ such as text [68], semantic maps [33, 61] or other image-to-image translation tasks [34].
類似於其它類型的生成模型 [56, 83] ,擴散模型原則上能夠對形式為 $p(z|y)$ 的條件分佈建模。這可以透過條件去噪自編碼器(Conditional Denoising Autoencoder)$\epsilon_\theta(z_t,t,y)$ 來實現,並為透過輸入 $y$(如文本 [68]、語義圖 [33, 61] 或其他圖像到圖像的轉換任務 [34])控制合成過程創造條件。
In the context of image synthesis, however, combining the generative power of DMs with other types of conditionings beyond lass-labels [15] or blurred variants of the input image [72] is so far an under-explored area of research.
然而,在影像合成的上下文中,將 DMs 的生成能力與超出類別標籤或輸入影像模糊變體等其它類型的條件相結合,仍是一個待深入探索的研究領域。
We turn DMs into more flexible conditional image generators by augmenting their underlying UNet backbone with the cross-attention mechanism [97], which is effective for learning attention-based models of various input modalities [35,36]. To pre-process $y$ from various modalities (such as language prompts) we introduce a domain specific encoder $\tau_\theta$ that projects $y$ to an intermediate representation $\tau_\theta(y)\in\mathbb{R}^{M\times d_\tau}$ , which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing $\text{Attention}(Q,K,V)=\text{softmax}(\dfrac{QK^T}{\sqrt{d}})\cdot V$, with
$$
Q=W_Q^{(i)}\cdot\varphi_i(z_t),K=W_K^{(i)}\cdot \tau_\theta(y),V=W_V^{(i)}\cdot\tau_\theta(y)
$$
我們透過在其底層的 UNet 主幹中引入交叉注意力機制 [97],將 DMs 轉變為更靈活的條件式影像生成器,交叉注意力機制對於學習各種輸入模態的基於注意力的模型非常有效 [35, 36]。為了對來自各種模態(例如語言提示)的 $y$ 進行預處理,我們引入了一個特定領域的編碼器 $\tau_\theta$,它能夠將 $y$ (條件式)投射到中間表示 $\tau_\theta(y) \in \mathbb{R}^{M \times d_\tau}$,然後透過交叉注意力層將其映射到 UNet 的中間層,該交叉注意力層實現了 $\text{Attention}(Q,K,V)=\text{softmax}(\dfrac{QK^T}{\sqrt{d}})\cdot V$,其中
$$
Q=W_Q^{(i)}\cdot\varphi_i(z_t),K=W_K^{(i)}\cdot \tau_\theta(y),V=W_V^{(i)}\cdot\tau_\theta(y)
$$
Here, $\varphi_i(z_t)\in()\mathbb{R}^{N\times d^i_\epsilon}$ denotes a (flattened) intermediate representation of the UNet implementing $\epsilon_\theta$ and $W_V^{(i)}\in\mathbb{R}^{d\times d_\epsilon^i}$, $W_Q^{(i)}\in\mathbb{R}^{d\times d_\tau} \space \& \space W_K^{(i)}\in\mathbb{R}^{d\times d_\tau}$ are learnable projection matrices [36, 97]. See Fig. 3 for a visual depiction.
這邊,$\varphi_i(z_t) \in \mathbb{R}^{N \times d^i_\epsilon}$ 表示實現 $\epsilon_\theta$ 的 UNet 的中間表示(平展的),$W_V^{(i)} \in \mathbb{R}^{d \times d_\epsilon^i}$、$W_Q^{(i)} \in \mathbb{R}^{d \times d_\tau}$ 和 $W_K^{(i)} \in \mathbb{R}^{d \times d_\tau}$ 是可學習的投影矩陣 [36, 97]。視覺說明可見Fig. 3。

Figure 3. We condition LDMs either via concatenation or by a more general cross-attention mechanism. See Sec. 3.3
:::warning
上部說的是,$x$經過encdoer-$\mathcal{E}$,得到縮壓後的結果$z$,然後經過$T$個時步之後的加噪得到$z_T$
右邊說的是,條件式$y$的部份經過$\tau\theta$的轉換之後加入去噪的逆向工程中
下部說的是,加噪後的$z_T$跟轉換後的條件式一起進入去噪的逆向工程中,這個去噪的神經網路也是一個UNet,並且也有使用skip-connection,很明顯的在每個階段都會把條件式一起加入
:::
Based on image-conditioning pairs, we then learn the conditional LDM via
$$
L_{LDM}:=\mathbb{E}_{\mathcal{E}(x),y,\epsilon\sim\mathcal{N}(0,1), t}[\Vert \epsilon - \epsilon_\theta(z_t, t, \tau_\theta(y)) \Vert^2_2]\tag{3}
$$
基於image-conditioning pairs,我們通過以下方式學習conditional LDM:
$$
L_{LDM}:=\mathbb{E}_{\mathcal{E}(x),y,\epsilon\sim\mathcal{N}(0,1), t}[\Vert \epsilon - \epsilon_\theta(z_t, t, \tau_\theta(y)) \Vert^2_2]\tag{3}
$$
where both $\tau_\theta$ and $\epsilon_\theta$ are jointly optimized via Eq. 3. This conditioning mechanism is flexible as $\tau_\theta$ can be parameterized with domain-specific experts, e.g. (unmasked) transformers [97] when $y$ are text prompts (see Sec. 4.3.1)
其中,$\tau_\theta$ 和 $\epsilon_\theta$ 會通過公式 (3) 一起最佳化。這個條件機制是具靈活性的,因為 $\tau_\theta$ 可以由特定領域的專家參數化,舉例來說,當 $y$ 是文本提示時,使用(unmasked) transformers [97](見Sec. 4.3.)。
## 4. Experiments
LDMs provide means to flexible and computationally tractable diffusion based image synthesis of various image modalities, which we empirically show in the following. Firstly, however, we analyze the gains of our models compared to pixel-based diffusion models in both training and inference. Interestingly, we find that LDMs trained in VQ-regularized latent spaces sometimes achieve better sample quality, even though the reconstruction capabilities of VQ-regularized first stage models slightly fall behind those of their continuous counterparts, $cf$ . Tab. 8. A visual comparison between the effects of first stage regularization schemes on LDM training and their generalization abilities to resolutions > 256^2^ can be found in Appendix D.1. In E.2 we list details on architecture, implementation, training and evaluation for all results presented in this section.
LDMs提供了在各種影像模式下進行靈活且計算可行的基於擴散的影像合成方法,以下我們將以實驗結果說明這一點。然而,首先,我們分析了與基於像素的擴散模型相比,我們的模型在訓練和推論過程中的優勢。有趣的是,我們發現,以VQ-regularized latent spaces訓練的模型在某些時候會有較好的樣本品質,儘管VQ-regularized在第一階段模型的重建能力略微落後於連續對應模型,相關比較請參考Tab. 8。關於第一階段正則化方案對 LDM 訓練的影響及其在解析度 > 256^2^ 影像上的泛化能力的視覺比較,可以在Appendix D.1找到。附錄 E.2 的部份,我們提供了本節中所有結果的架構、實現、訓練和評估細節。
### 4.1. On Perceptual Compression Tradeoffs
This section analyzes the behavior of our LDMs with different downsampling factors $f \in \left\{1, 2, 4, 8, 16, 32\right\}$ (abbreviated as $LDM-f$, where $LDM-1$ corresponds to pixel-based DMs). To obtain a comparable test-field, we fix the computational resources to a single NVIDIA A100 for all experiments in this section and train all models for the same number of steps and with the same number of parameters.
本章節分析我們的 LDMs在不同降低取樣因子 $f \in \left\{1, 2, 4, 8, 16, 32 \right\}$ 下的行為(簡稱為 $LDM-f$,其中 $LDM-1$ 對應於基於像素的擴散模型 DMs)。為了取得可比較的測試場景,我們將本章節中的所有實驗的計算資源固定為單一 NVIDIA A100,並且所有的模型訓練都是以相同數量的步驟以及相同的參數量。
Tab. 8 shows hyperparameters and reconstruction performance of the first stage models used for the LDMs compared in this section. Fig. 6 shows sample quality as a function of training progress for 2M steps of class-conditional models on the ImageNet [12] dataset. We see that, i) small downsampling factors for $LDM-\{1,2\}$ result in slow training progress, whereas ii) overly large values of $f$ cause stagnating fidelity after comparably few training steps. Revisiting the analysis above (Fig. 1 and 2) we attribute this to i) leaving most of perceptual compression to the diffusion model and ii) too strong first stage compression resulting in information loss and thus limiting the achievable quality. $LDM-\{4-16\}$ strike a good balance between efficiency and perceptually faithful results, which manifests in a significant FID [29] gap of 38 between pixel-based diffusion (LDM-1) and LDM-8 after 2M training steps.
Tab. 8說明了本章節中用於比較的 LDMs 第一階段模型所使用的超參數和重建效能。Fig. 6說明了在 ImageNet [12] 資料集上對分類條件式模型進行 2M 步驟訓練後,其樣本品質隨著訓練進度變化的狀態。我們可以看到:i) 較小的降低取樣因子 $LDM-\{1,2\}$ 會導致訓練進展緩慢;而 ii) 過大的 $f$ 值則在相對較少的訓練步驟後造成保真度停滯不前。回顧上述分析(Fig. 1和2),我們將之歸因於 i) 大部分感知壓縮由 diffusion model完成,以及 ii) 第一階段過強的壓縮導致信息的損失,從而限制了可達到的質量。$LDM-\{4-16\}$ 在效率和感知保真度之間取得了良好平衡,這一點體現在pixel-based diffusion (LDM-1)與 LDM-8 在 2M 訓練步驟後的 FID [29] 指標上,其差距高達 38。

Figure 6. Analyzing the training of class-conditional LDMs with different downsampling factors $f$ over 2M train steps on the ImageNet dataset. Pixel-based LDM-1 requires substantially larger train times compared to models with larger downsampling factors (LDM-{4-16}). Too much perceptual compression as in LDM-32 limits the overall sample quality. All models are trained on a single NVIDIA A100 with the same computational budget. Results obtained with 100 DDIM steps [84] and $κ = 0$.
In Fig. 7, we compare models trained on CelebAHQ [39] and ImageNet in terms sampling speed for different numbers of denoising steps with the DDIM sampler [84] and plot it against FID-scores [29]. $LDM-\{4-8\}$ outperform models with unsuitable ratios of perceptual and conceptual compression. Especially compared to pixel-based $LDM-1$, they achieve much lower FID scores while simultaneously significantly increasing sample throughput. Complex datasets such as ImageNet require reduced compression rates to avoid reducing quality. In summary, $LDM-4$ and $-8$ offer the best conditions for achieving high-quality synthesis results.
在Fig. 7中,我們比較了在 CelebAHQ [39] 和 ImageNet 上訓練的模型,使用 DDIM sampler [84] 進行不同去噪步數的採樣速度,並將結果繪製與 FID 分數 [29] 進行比較。結果顯示,$LDM-\{4-8\}$ 的表現優於那些在感知和概念壓縮比例不適當的模型。特別是與基於像素的 $LDM-1$ 相比,它們在顯著降低 FID 分數的同時,大幅提升了採樣吞吐量。對於如 ImageNet這類複雜的資料集,則是需要降低壓縮率以避免品質的下降。總的來說, $LDM-4$ 和 $LDM-8$ 提供了實現高品質圖像合成的最佳條件。
:::warning
DDIM:Denoising Diffusion Implicit Models
:::

Figure 7. Comparing LDMs with varying compression on the CelebA-HQ (left) and ImageNet (right) datasets. Different markers indicate {10, 20, 50, 100, 200} sampling steps using DDIM, from right to left along each line. The dashed line shows the FID scores for 200 steps, indicating the strong performance of LDM-{4-8}. FID scores assessed on 5000 samples. All models were trained for 500k (CelebA) / 2M (ImageNet) steps on an A100.
### 4.2. Image Generation with Latent Diffusion
We train unconditional models of 256^2^ images on CelebA-HQ [39], FFHQ [41], LSUN-Churches and -Bedrooms [102] and evaluate the i) sample quality and ii) their coverage of the data manifold using ii) FID [29] and ii) Precision-and-Recall [50]. Tab. 1 summarizes our results. On CelebA-HQ, we report a new state-of-the-art FID of 5.11, outperforming previous likelihood-based models as well as GANs. We also outperform LSGM [93] where a latent diffusion model is trained jointly together with the first stage. In contrast, we train diffusion models in a fixed space and avoid the difficulty of weighing reconstruction quality against learning the prior over the latent space, see Fig. 1-2.
我們在 CelebA-HQ [39]、FFHQ [41]、LSUN-Churches 和 LSUN-Bedrooms [102] 上訓練了 256^2^ 像素大小的非條件式的模型,並評估其 i) 樣本品質和 ii) 對資料流形(data manifold)的覆蓋度(使用 FID [29] 和 Precision-and-Recall [50]計算)。Tab. 1 總結了我們的成果。在 CelebA-HQ 資料集中,我們報告了一個新的最先進的 FID 分數 5.11,超越了以前基於對數似然的模型以及GANs。我們同時也超越了 LSGM [93],該方法是在第一階段聯合訓練潛在擴散模型 (latent diffusion model)。相比之下,我們是在一個固定的空間中訓練擴散模型,從而避免了在重建品質與學習對潛在空間先驗之間取得平衡的難題(參見圖 1-2)。

Table 1. Evaluation metrics for unconditional image synthesis. CelebA-HQ results reproduced from [43, 63, 100], FFHQ from [42, 43]. †: $N-s$ refers to $N$ sampling steps with the DDIM [84]sampler. ∗: trained in KL-regularized latent space. Additional results can be found in the supplementary.
We outperform prior diffusion based approaches on all but the LSUN-Bedrooms dataset, where our score is close to ADM [15], despite utilizing half its parameters and requiring 4-times less train resources (see Appendix E.3.5).
我們在除了 LSUN-Bedrooms 資料集之外的所有情況下都超越了過去基於擴散模型的方法。儘管在 LSUN-Bedrooms 資料集上,我們的得分與 ADM [15] 相近,但我們僅使用其一半的參數,且所需的訓練資源僅為其四分之一(請參見附錄 E.3.5)。
Moreover, LDMs consistently improve upon GAN-based methods in Precision and Recall, thus confirming the advantages of their mode-covering likelihood-based training objective over adversarial approaches. In Fig. 4 we also show qualitative results on each dataset.
此外,LDMs 在精確度(Precision)和召回率(Recall)方面穩定地優於基於 GAN 的方法,這證實了其以涵蓋模式(mode-covering)的似然函數(likelihood-based)作為訓練目標,相較於對抗式方法具有一定的優勢。此外,我們在Fig. 4中說明了每個資料集的定性結果(qualitative results)。

Figure 4. Samples from LDMs trained on CelebAHQ [39], FFHQ [41], LSUN-Churches [102], LSUN-Bedrooms [102] and classconditional ImageNet [12], each with a resolution of 256 × 256. Best viewed when zoomed in. For more samples cf . the supplement
### 4.3. Conditional Latent Diffusion
#### 4.3.1 Transformer Encoders for LDMs
By introducing cross-attention based conditioning into LDMs we open them up for various conditioning modalities previously unexplored for diffusion models. For text-to-image image modeling, we train a 1.45B parameter KL-regularized LDM conditioned on language prompts on LAION-400M [78]. We employ the BERT-tokenizer [14] and implement $\tau_\theta$ as a transformer [97] to infer a latent code which is mapped into the UNet via (multi-head) crossattention (Sec. 3.3). This combination of domain specific experts for learning a language representation and visual synthesis results in a powerful model, which generalizes well to complex, user-defined text prompts, $cf$ . Fig. 8 and 5. For quantitative analysis, we follow prior work and evaluate text-to-image generation on the MS-COCO [51] validation set, where our model improves upon powerful AR [17, 66] and GAN-based [109] methods, cf . Tab. 2. We note that applying classifier-free diffusion guidance [32] greatly boosts sample quality, such that the guided LDM-KL-8-G is on par with the recent state-of-the-art AR [26] and diffusion models [59] for text-to-image synthesis, while substantially reducing parameter count.To further analyze the flexibility of the cross-attention based conditioning mechanism we also train models to synthesize images based on semantic layouts on OpenImages [49], and finetune on COCO [4], see Fig. 8. See Sec. D.3 for the quantitative evaluation and implementation details.
透過將基於交叉注意(cross-attention)的條件機制引入LDMs,我們為其開拓了在擴散模型中未曾探索的多種條件模式。針對text-to-image的影像建模,我們在 LAION-400M [78] 資料集上,訓練了一個具有 1.45B 參數量且經 KL-regularized 的 LDM,其以語言提示(language prompts)作為條件。我們採用 BERT-tokenizer [14],並將 $\tau_\theta$ 實作為一個transformer[97]用於推論潛在編碼(latent code),該編碼透過(multi-head) cross-attention 映射到 UNet 模型中(Sec. 3.3)。這種結合語言表示學習(language representation learning)和視覺合成(visual synthesis)的專門技術,形成了一個功能強大的模型,該模型能夠很好地適應複雜且由使用者定義的文本提示(如Fig. 8和Fig. 5所示)。在量化分析方面,我們遵循先前的研究方法,並在 MS-COCO [51] 驗證集上對text-to-image的生成進行評估。我們的模型在性能上超越了強大的自回歸模型(autoregressive, AR)[17, 66] 與基於生成式對抗網絡(GAN)[109] 的方法(Tab. 2)。我們注意到,採用classifier-free diffusion guidance(無類別擴散引導?)[32] 能顯著提高樣本質量,從而使guided LDM-KL-8-G 在text-to-image的合成方面與最近的最先進的自回歸模型(AR)[26] 和擴散模型 [59] 表現相當,同時大幅減少參數量。為了進一步分析基於交叉注意的條件機制的靈活性,我們還訓練了模型,以在 OpenImages [49] 資料集上基於語意佈局(semantic layouts)來生成影像,並在 COCO [4] 資料集上進行微調。如Fig. 8所示,有關量化評估和實現細節,請參見Sec. D.3。

Figure 5. Samples for user-defined text prompts from our model for text-to-image synthesis, LDM-8 (KL), which was trained on the LAION [78] database. Samples generated with 200 DDIM steps and $\eta = 1.0$. We use unconditional guidance [32] with $s = 10.0$.

Figure 8. Layout-to-image synthesis with an LDM on COCO [4], see Sec. 4.3.1. Quantitative evaluation in the supplement D.3.

Table 2. Evaluation of text-conditional image synthesis on the 256 × 256-sized MS-COCO [51] dataset: with 250 DDIM [84] steps our model is on par with the most recent diffusion [59] and autoregressive [26] methods despite using significantly less parameters. †/∗:Numbers from [109]/ [26]
Lastly, following prior work [3, 15, 21, 23], we evaluate our best-performing class-conditional ImageNet models with $f \in \left\{4, 8\right\}$ from Sec. 4.1 in Tab. 3, Fig. 4 and Sec. D.4. Here we outperform the state of the art diffusion model ADM [15] while significantly reducing computational requirements and parameter count, $cf$ . Tab 18.
最後,參考先前的研究 [3, 15, 21, 23],我們選取效能最佳的類別條件式的 ImageNet 模型($f \in \left\{4, 8\right\}$),如第 4.1 節所述進行評估。這些評估結果分別在Tab. 3、Fig. 4 和Sec. D.4中說明。我們的模型在性能上超越了當前最先進的擴散模型 ADM [15],同時明顯降低計算需求與參數數量(詳見Tab 18)。

Table 3. Comparison of a class-conditional ImageNet LDM with recent state-of-the-art methods for class-conditional image generation on ImageNet [12]. A more detailed comparison with additional baselines can be found in D.4, Tab. 10 and F. c.f.g. denotes classifier-free guidance with a scale s as proposed in [32].

Table 18. Comparing compute requirements during training and inference throughput with state-of-the-art generative models. Compute during training in V100-days, numbers of competing methods taken from [15] unless stated differently;∗ : Throughput measured in samples/sec on a single NVIDIA A100;† : Numbers taken from [15] ;‡ : Assumed to be trained on 25M train examples; ††: R-FID vs. ImageNet validation set
#### 4.3.2 Convolutional Sampling Beyond 256^2^
By concatenating spatially aligned conditioning information to the input of $\epsilon_\theta$, LDMs can serve as efficient general purpose image-to-image translation models. We use this to train models for semantic synthesis, super-resolution (Sec. 4.4) and inpainting (Sec. 4.5). For semantic synthesis, we use images of landscapes paired with semantic maps [23, 61] and concatenate downsampled versions of the semantic maps with the latent image representation of a $f = 4$ model (VQ-reg., see Tab. 8). We train on an input resolution of 256^2^ (crops from 384^2^ ) but find that our model generalizes to larger resolutions and can generate images up to the megapixel regime when evaluated in a convolutional manner (see Fig. 9). We exploit this behavior to also apply the super-resolution models in Sec. 4.4 and the inpainting models in Sec. 4.5 to generate large images between 512^2^ and 1024^2^ . For this application, the signal-to-noise ratio (induced by the scale of the latent space) significantly affects the results. In Sec. D.1 we illustrate this when learning an LDM on (i) the latent space as provided by a $f = 4$ model (KL-reg., see Tab. 8), and (ii) a rescaled version, scaled by the component-wise standard deviation.
透過將空間對齊的條件資訊串接到 $\epsilon_\theta$ 的輸入,LDMs可以作為高效通用型的image-to-image的轉換模型。我們利用這個方法來訓練語義合成(semantic synthesis)、超解析度(Sec. 4.4)和修補(Sec. 4.5)等任務的模型。在語義合成方面,我們使用與語義圖(semantic maps)配對的風景照片 [23, 61],並將語義圖的下採樣版本與 $f = 4$ 模型的潛在影像表示(VQ-reg.,見Tab. 8)串接起來。我們在 256^2^的輸入解析度上進行訓練(從 384^2^的影像裁剪而來),不過我們有發現到模型可以泛化到更高的解析度,並且在以卷積方式進行評估時,可以生成高達百萬像素的影像(見Fig. 9)。我們利用這個特性,將Sec. 4.4的超解析度模型和Sec. 4.5的修補模型應用於生成 512^2^至1024^2^ 的高解析度影像。在這個應用場景下,由潛在空間尺度所引起的[信噪比](https://terms.naer.edu.tw/detail/06a09b8b863311f6c08a642f2a6b1795/)(signal-to-noise ratio)對結果有顯著地影響。在Sec. D.1中,我們在學習LDM時說明了這一點:(i) 使用由 $f = 4$ 模型(KL-reg.,見Tab. 8)所提供的潛在空間,以及(ii) 依component-wise standard deviation所重新縮放的版本。
The latter, in combination with classifier-free guidance [32], also enables the direct synthesis of > 256^2^ images for the text-conditional LDM-KL-8-G as in Fig. 13.
後者(重新縮放的版本)與無分類器指導(classifier-free guidance)[32] 結合,還能直接為文本條件的 LDM-KL-8-G 合成超過 256^2^解析度的影像,如Fig. 13所示。
#### 4.4. Super-Resolution with Latent Diffusion
LDMs can be efficiently trained for super-resolution by diretly conditioning on low-resolution images via concatenation ($cf$ . Sec. 3.3). In a first experiment, we follow SR3 [72] and fix the image degradation to a bicubic interpolation with 4×-downsampling and train on ImageNet following SR3’s data processing pipeline. We use the $f = 4$ autoencoding model pretrained on OpenImages (VQ-reg., $cf$ . Tab. 8) and concatenate the low-resolution conditioning $y$ and the inputs to the UNet, i.e. $\tau_\theta$ is the identity. Our qualitative and quantitative results (see Fig. 10 and Tab. 5) show competitive performance and LDM-SR outperforms SR3 in FID while SR3 has a better IS. A simple image regression model achieves the highest PSNR and SSIM scores; however these metrics do not align well with human perception [106] and favor blurriness over imperfectly aligned high frequency details [72]. Further, we conduct a user study comparing the pixel-baseline with LDM-SR. We follow SR3 [72] where human subjects were shown a low-res image in between two high-res images and asked for preference. The results in Tab. 4 affirm the good performance of LDM-SR. PSNR and SSIM can be pushed by using a post-hoc guiding mechanism [15] and we implement this image-based guider via a perceptual loss, see Sec. D.6.
透過直接連接基於低解析度影像進行條件設置(參見Sec. 3.3),可以高效地訓練 LDMs 用於超解析度。在第一個實驗中,我們依循 SR3 [72],將影像退化固定為採用 4 倍下採樣的[雙三次內挿值](https://terms.naer.edu.tw/detail/620bf21b9342124acb5694f440e4a966/),並按照 SR3 的資料處理流程在 ImageNet 上進行訓練。我們使用在 OpenImages 上預訓練的 $f = 4$ 自動編碼模型(VQ-reg.,參見Tab. 8),並將低解析度條件 $y$ 與 UNet 的輸入進行連接,也就是 $\tau_\theta$ 為恆等映射。我們的定性與定量結果(見Fig. 10和Tab. 5)顯示了良好的競爭效能,其中 LDM-SR 在 FID 指標上優於 SR3,而 SR3 在 IS 指標上表現更佳。一個簡單的影像回歸模型取得了最高的 PSNR 和 SSIM 分數;然而,這些指標與人類的感知並不完全一致[106],更傾向於支持模糊影像而非不完美對齊的高頻細節[72]。此外,我們還進行了一項使用者研究,將像素基線與 LDM-SR 進行比較。我們依循 SR3 [72] 的實驗設置,向受試者展示了一張低解析度影像,位於兩張高解析度影像之間,並詢問其偏好。Tab. 4的結果進一步肯定了 LDM-SR 良好的效能。可以透過使用後處理的引導機制(post-hoc guiding mechanism) [15] 來提升 PSNR 和 SSIM,我們通過感知損失(perceptual loss)實現了一種基於影像的引導方法,詳情參見第 D.6。

Figure 10. ImageNet 64→256 super-resolution on ImageNet-Val. LDM-SR has advantages at rendering realistic textures but SR3 can synthesize more coherent fine structures. See appendix for additional samples and cropouts. SR3 results from [72].

Table 5. ×4 upscaling results on ImageNet-Val. (256^2^); †: FID features computed on validation split, ‡ : FID features computed on train split; ∗ : Assessed on a NVIDIA A100
### 4.5. Inpainting with Latent Diffusion
Inpainting is the task of filling masked regions of an image with new content either because parts of the image are are corrupted or to replace existing but undesired content within the image. We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches for this task. Our evaluation follows the protocol of LaMa [88], a recent inpainting model that introduces a specialized architecture relying on Fast Fourier Convolutions [8]. The exact training & evaluation protocol on Places [108] is described in Sec. E.2.2.
影像修補是一項將影像中被遮罩的區域以新內容填補的任務,這可能是因為影像部分受損或者想要替換掉現有但不希望保留的內容。我們評估了我們的條件式影像生成的方法在此任務中的表現,並將之與更專門的、最先進的技術進行比較。我們的評估依循著LaMa [88] 的協議,LaMa 是一個近期提出的修補模型,其專門架構依賴於快速傅立葉卷積(Fast Fourier Convolutions, FFC)[8]。在 Places [108] 資料集上的具體訓練與評估協議詳見Sec. E.2.2。
We first analyze the effect of different design choices for the first stage. In particular, we compare the inpainting efficiency of LDM-1 (i.e. a pixel-based conditional DM) with LDM-4, for both KL and VQ regularizations, as well as VQLDM-4 without any attention in the first stage (see Tab. 8), where the latter reduces GPU memory for decoding at high resolutions. For comparability, we fix the number of parameters for all models. Tab. 6 reports the training and sampling throughput at resolution 256^2^ and 512^2^ , the total training time in hours per epoch and the FID score on the validation split after six epochs. Overall, we observe a speed-up of at least 2.7× between pixel- and latent-based diffusion models while improving FID scores by a factor of at least 1.6×.
我們首先分析第一階段中不同設計選擇的影響。具體來說,我們比較了 LDM-1(即基於像素的條件式擴散模型,pixel-based conditional DM)與 LDM-4 在影像修補(inpainting)效率上的表現,包括 KL 與 VQ regularizations,以及第一階段中沒有加入任何注意力的 VQLDM-4(參見Tab. 8),後者降低了在高解析度解碼的情況下的GPU記憶體使用量。為了保證可比性,我們固定了所有模型的參數量。Tab. 6給出了在解析度 256^2^ 和 512^2^ 下的訓練與採樣吞吐量(throughput)、每個epoch總訓練時間(以小時計算),以及六個epochs後驗證集的 FID 分數。總體來說,我們觀察到,基於潛在空間(latent space)的擴散模型與基於像素的擴散模型相比,速度提升了至少 2.7 倍,同時 FID 分數改善了至少 1.6 倍。

Table 6. Assessing inpainting efficiency. †: Deviations from Fig. 7 due to varying GPU settings/batch sizes cf . the supplement.

Figure 7. Comparing LDMs with varying compression on the CelebA-HQ (left) and ImageNet (right) datasets. Different markers indicate {10, 20, 50, 100, 200} sampling steps using DDIM, from right to left along each line. The dashed line shows the FID scores for 200 steps, indicating the strong performance of LDM-{4-8}. FID scores assessed on 5000 samples. All models were trained for 500k (CelebA) / 2M (ImageNet) steps on an A100.
The comparison with other inpainting approaches in Tab. 7 shows that our model with attention improves the overall image quality as measured by FID over that of [88]. LPIPS between the unmasked images and our samples is slightly higher than that of [88]. We attribute this to [88] only producing a single result which tends to recover more of an average image compared to the diverse results produced by our LDM $cf$ . Fig. 21. Additionally in a user study (Tab. 4) human subjects favor our results over those of [88].
Tab. 7說明了與其它修補方法的比較,我們帶有注意力機制的模型在 FID 指標上顯著提升了整體的影像品質,超越了 [88]。然而,在未被遮罩的影像與我們生成樣本之間的 LPIPS(感知損失)略高於 [88]。我們認為這是由於 [88] 僅生成一個結果,其傾向於恢復更接近平均影像的結果,而我們的 LDM 則能生成更多樣化的結果(參見Fig. 21)。此外,根據一項使用者研究(Tab. 4),受試者普遍更偏好我們的結果而非 [88] 的結果。

Table 7. Comparison of inpainting performance on 30k crops of size 512 × 512 from test images of Places [108]. The column 40-50% reports metrics computed over hard examples where 40-50% of the image region have to be inpainted. ^†^recomputed on our test set, since the original test set used in [88] was not available.
Based on these initial results, we also trained a larger diffusion model ($big$ in Tab. 7) in the latent space of the VQ-regularized first stage without attention. Following [15], the UNet of this diffusion model uses attention layers on three levels of its feature hierarchy, the BigGAN [3] residual block for up- and down-sampling and has 387M parameters instead of 215M. After training, we noticed a discrepancy in the quality of samples produced at resolutions 256^2^ and 512^2^ , which we hypothesize to be caused by the additional attention modules. However, fine-tuning the model for half an epoch at resolution 512^2^ allows the model to adjust to the new feature statistics and sets a new state of the art FID on image inpainting ($big$, $w/o \space attn$, $w/ ft$ in Tab. 7, Fig. 11.).
根據這些初步結果,我們在VQ-regularized的第一階段潛空間未使用注意力機制的情況下訓練了一個更大的擴散模型(Tab. 7中的 $big$)。根據 [15] 的方法,這個擴散模型的 UNet 在其特徵層級(feature hierarchy)的三個層級上加入注意力層,並採用了 BigGAN [3] 的residual block來處理上取樣和下取樣,並且該模型有著 387M 個參數量(原先的參數量為215M)。在訓練完成之後,我們注意到在解析度256^2^和512^2^下生成的樣本質量存在差異,我們推測這是由於額外的注意力模組所引起。然而,在解析度為512^2^的情況下下對該模型進行半個 epoch 的微調後,模型得以適應新的特徵統計數據,並且在影像修補任務上達到了最棒棒的 FID (參見Tab. 7中的 $big$, $w/o \space attn$, $w/ ft$,以及Fig. 11)。

Figure 11. Qualitative results on object removal with our $big, w/ft$ inpainting model. For more results, see Fig. 22.
## 5. Limitations & Societal Impact
**Limitations** While LDMs significantly reduce computational requirements compared to pixel-based approaches, their sequential sampling process is still slower than that of GANs. Moreover, the use of LDMs can be questionable when high precision is required: although the loss of image quality is very small in our $f = 4$ autoencoding models (see Fig. 1), their reconstruction capability can become a bottleneck for tasks that require fine-grained accuracy in pixel space. We assume that our superresolution models (Sec. 4.4) are already somewhat limited in this respect.
**Limitations** 雖然 LDMs 相較於pixel-based的方法大幅減少計算需求,不過連續採樣的過程(sequential sampling process)還是比 GANs 慢得多。此外,在需要高精度 (high precision) 的應用場景下,LDMs 的適用性還是會受到質疑:雖然我們在 $f = 4$ 自動編碼模型中(見Fig. 1)的影像品質的損失很小,但其重建能力可能成為在像素空間需要細粒度準確性 (fine-grained accuracy) 任務的瓶頸。我們假設Sec. 4.4中所述的超解析度模型在這方面已經受到一定程度的限制。
**Societal Impact** Generative models for media like imagery are a double-edged sword: On the one hand, they enable various creative applications, and in particular approaches like ours that reduce the cost of training and inference have the potential to facilitate access to this technology and democratize its exploration. On the other hand, it also means that it becomes easier to create and disseminate manipulated data or spread misinformation and spam. In particular, the deliberate manipulation of images (“deep fakes”) is a common problem in this context, and women in particular are disproportionately affected by it [13, 24].
**Societal Impact** 生成式模型在媒體內容(如影像)的應用上是一把雙面刃:一方面,它們能實現多種創造性應用,特別是像我們這樣的降低訓練和推理成本的方法,有可能促進對該技術的使用,進而推動其普及與探索。另一方面,這也意味著更容易生成和散播被操縱的資料,或傳播錯誤資訊與垃圾內容。具體而言,故意操縱影像(deep fakes)是這種情境下的常見問題,尤其女性在這方面受到的影響尤為嚴重 [13, 24]。
Generative models can also reveal their training data [5, 90], which is of great concern when the data contain sensitive or personal information and were collected without explicit consent. However, the extent to which this also applies to DMs of images is not yet fully understood.
生成模型還可能會洩露其訓練資料 [5, 90],當這些資料包含敏感或個人資訊,且在未經明確同意的情況下被收集時,這將引發嚴重的擔憂。然而,這種洩露風險在影像的擴散模型中是否同樣適用,目前尚未完全理解。
Finally, deep learning modules tend to reproduce or exacerbate biases that are already present in the data [22, 38, 91]. While diffusion models achieve better coverage of the data distribution than e.g. GAN-based approaches, the extent to which our two-stage approach that combines adversarial training and a likelihood-based objective misrepresents the data remains an important research question.
最後,深度學習模組往往會重現或加劇訓練資料中已經存在的偏差 [22, 38, 91]。儘管擴散模型相較於基於生成對抗網路的方法,能更全面地覆蓋資料分佈,不過我們的兩階段方法結合了對抗訓練與基於似然的目標函數,對資料的潛在錯誤表示(misrepresents)程度仍是一個重要的研究課題。
For a more general, detailed discussion of the ethical considerations of deep generative models, see e.g. [13].
關於深度生成模型的倫理考量更廣泛和更詳細的討論,可參見文獻 [13]。
## 6. Conclusion
We have presented latent diffusion models, a simple and efficient way to significantly improve both the training and sampling efficiency of denoising diffusion models without degrading their quality. Based on this and our crossattention conditioning mechanism, our experiments could demonstrate favorable results compared to state-of-the-art methods across a wide range of conditional image synthesis tasks without task-specific architectures.
我們提出了潛在擴散模型,這是一種簡單且高效的方法來顯著提升denoising diffusion models在訓練和抽採樣過程中的效率,同時不會降低其品質。基於這一點及我們的交叉注意力條件式機制,我們的實驗結果說明,在條件影像合成任務中,即使不使用特定於任務的架構,仍能是好棒棒的。