<style>
r { color: Red }
o { color: Orange }
g { color: Green }
</style>
<sub><sup> 작게 작성한 부분 == 한국어 원문 </sup></sub>
<font color='blue'>파란색 == 넣을지 말지 고민하고 있는 부분. </font>
<font color='red'>빨간색 == 해야할 일. 수정하고 싶은 부분. </font>
# Global rebuttal
We thank the reviewers for their valuable advice. Here, we compile reviews we want to share with all the reviewers. Please see our responses addressing the specific concerns below:
<!-- We thank the reviewers for their valuable feedback. 여러 리뷰어들의 건설적인 리뷰들 중 모두에게 공유하고 싶은 내용들을 정리하였습니다. Please see our responses addressing the specific concerns below: -->
<!-- **1. Presentation** -->
### 1. Improving Fig. 1 for clarity
#### Reviwers *6jtZ* and *iEgM* suggested to modify Fig. 1 since it has too dense information.
<!-- > [from 6jtZ] Fig. 1, which is presumably intended to illustrate key concepts, is perhaps too dense with information. Dividing Fig. 1 into two separate figures could make it easier to digest, enabling a clearer explanation of the approach.
> [from iEgM] Figure 1 is not sufficiently clear to me, it may hampers the reader's comprehension. -->
To enhance the cmprehensibility of Fig. 1, we divide Fig. 1 into two figures. Please refer to Fig. R1 and R2 in the attached PDF file. **Figure R1** gives a conceptual visualization of the local basis derived from the pullback metric. This figure succinctly outlines the procedural steps involved in obtaining the local basis. **Figure R2** delivers an overview of the image editing method using the discovered local basis. It also provides a concise presentation of the outcomes.
<!-- <font color='blue'> With these changes, we hope that readers will be able to better grasp the core methods of our paper: the local basis and image editing process.</font>
-->
<!-- <sub><sup> 독자들이 이해하기 쉽도록 Fig. 1에서 전달하고자 한 내용을 Figure 1, 2로 나누었다. pdf 에 있는 Figure 1, 2 을 봐달라. 먼저, 새로운 Fig. 1 은 pullback metric 을 통해 찾은 local basis 를 개념적으로 시각화하고, 이를 구하는 방법을 요약해서 제시한다. Fig. 2 는 discovered local basis 를 통해 image editing 을 하는 방법에 대한 overview 와 그 결과를 간략하게 보여주고 있다. Fig. 2 에 대해서는 Image editing method 파트에서 더 자세히 다루겠습니다. 이를 통해, 독자들이 우리 논문의 핵심 method 인 local basis 와 image editing process 를 더 잘 이해할 수 있기를 기대한다. </sup></sub> -->
<!-- **2. Image editing method** -->
### 2. Additional subsection to overview the whole image editing process
#### Reviewers *cKEc*, *6jtZ*, and *iEgM* asked the clear explanation of the editing process.
<!-- > [from cKEc] it is not really clear to me how the editing process works (sections 3.3 and 3.4)
> [from 6jtZ] the method and notation used in this work can lead to some confusion. For instance, Section 3, in its current form, may not be as accessible to all readers as it could be, and it could benefit from being revised for clearer communication of the ideas contained therein
> [from iEgM] the explanation of the image editing technique could be improved. -->
We create a new subsection in section 3 that summarizes the entire procedure of image editing. This subsection provides clear explanations and also visually illustrates the method by Fig. R2. The detailed contents are as follows:
---
### Section 3.4 Overall process of image editing
In this section, we summarize the entire editing process with five steps: 1) The input image is inverted into initial noise $\mathbf{x}_T$ using DDIM inversion. 2) $\mathbf{x}_T$ is gradually denoised until $t$ through DDIM generation. 3) Identify local latent basis $\{ \mathbf{v}_1, \cdots, \mathbf{v}_n\}$ using the pullback metric at $t$. 4) Manipulate $\mathbf{x}_t$ along the one of basis vectors using the $\mathbf{x}$-space guidance. 5) The DDIM generation is then completed with the modified latent variable $\mathbf{x}'_t$. Figure 2 illustrates the entire editing process.
In the context of a text-to-image model like Stable Diffusion, the incorporation of textual conditions during the derivation of local basis vectors becomes possible. This incorporation ensures that all the local basis vectors align with the provided text, and a comprehensive explanation of this process can be found in Section 4.1.
It is noteworthy that while our approach involves moving the latent variable within a single timestep, it achieves semantically meaningful image editing. In addition to the image manipulation within a single timestep, the direct manipulation of the latent variable, $\mathbf{x}_t$, of diffusion models is a pioneering approach to the best of our knowledge.
<!For text-to-image model such as Stable Diffusion, we can utilize text condition when obtaining the local basis vector. Utilizing text condition leads to make all the local basis aligning with given text, see Sec 4.1 for details.>
<!Note that although our method moves the latent variable in just a single timestep, we achieves semantically meaningful image editing. To the best of our knowledge, this is the first approach not only achieving image editing through manipulation within a single timestep but also with the direct manipulation of DMs' latent variable, $\mathbf{x}_t$.>
<!-- if you intend to perform image editing aligned with prompt, you can utilize text condition when obtaining the local basis vector. However, we use an empty string as the condition during DDIM inversion and sampling. This means that we only utilize prompt information when manipulating the latent variable $\mathbf{x}_t$. -->
<!-- <font color='blue'>Our method achieves semantically meaningful image editing by moving the latent variable by a small amount in just a single timestep.</font> <font color='blue'> To the best of our knowledge, this is the first approach achieving image editing through manipulation within a single timestep. Furthermore, the direct manipulation of DMs' latent variable, $\mathbf{x}_t$, make our method distinctive from other editing methods. </font> -->
---
<!-- // <font color='red'>TODO : 밑에 내용 리뷰어로 빼주세요.</font> -->
<!-- 또한, unclear 했던 notation 및 typo 를 정리했다.
>> Line 148 : $\mathbf{v}\in\mathcal{T}_{\mathbf{x}}$, $\mathbf{v'}\in\mathcal{T}_{\mathbf{x}'}$
>> Diffusion Model : $\epsilon^{\theta}:\mathbf{x}_t\mapsto\epsilon_t$
>> Eq.4 ($\mathbf{x}$-space guidance) : $\tilde{\mathbf{x}}=\mathbf{x}+\gamma [\epsilon^{\theta}(\mathbf{x}+\mathbf{v}) - \epsilon^{\theta}(\mathbf{x})]$ -->
### 3. Comparitive experiment to other state-of-the-art (SoTA) editing methods
#### Reviewers *iEgM* and *6jtZ* provided constructive discussion about including comparative analysis for qualitative comparisons and runtime aspect.
<!-- > [from 6jtZ] What is the complexity of computing the Jacobian? What is the the actual runtime (in seconds) of the approach compared to other editing methods? I think answering these questions can provide a good context for readers.
> [from iEgM] To make the paper applicable to real-world scenarios right away, the authors should include a comparative analysis with other image editing techniques. -->
<!-- <font color='blue'>Given that our method edit image in unsupervised manner, achieving a truly fair comparison is challenging. Nevertheless, </font> -->
We conduct qualitative comparisons with a text-guided image editing methods. Our state-of-the-art baseline methods include: (i) [SDEdit], (ii) [Pix2Pix-zero], (iii) [PnP], and (iv) [Instruct Pix2Pix]. All comparisons were performed using the official code. **Please refer to Fig. R3 in the PDF for the qualitative results.**
<!-- In the case of Pix2Pix-zero, we employed ChatGPT to generate 100 prompts per attribute, enabling us to ascertain the editing directions. -->
We also compare the time complexity of each method. For a fair comparison, we only identify the first singular vector $\mathbf{v}_1$, i.e., $n=1$, and set the number of DDIM steps to 50. All experiments were conducted on an Nvidia RTX 3090 with 24GB of VRAM. The runtime for each method is as follows:
| Image Edit Method | running time | Preprocessing |
|:-----------------:|:-------------:|:-------------:|
| Ours | 11 sec |N/A |
| SDEdit | 4 sec |N/A |
| Pix2Pix-zero | 25 sec |4 min |
| PnP | 10 sec |40 sec |
| Instruct Pix2Pix | 11 sec |N/A |
The computation cost of our method remains comparable to other approaches, although the Jacobian approximation takes around 2.5 seconds for $n=1$. This is because we only need to identify the latent basis vector once at a specific timestep.</font> Furthermore, our approach does not require additional preprocessing steps like generating 100 prompts with GPT and obtaining embedding vectors (as in Pix2Pix-zero) or storing feature vectors, queries, and key values (as in PnP). Our method also does not require finetuning (as in Instruct Pix2Pix). This leads to a significantly reduced total editing process time in comparison to other methods.
#### References
[SDEdit] : SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, Chenlin Meng et al., 2021
[Pix2Pix-zero] : Zero-shot Image-to-Image Translation, Gaurav Parmar et al., 2023
[PnP] : Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation, Narek Tumanyan et al., 2022
[Instruct Pix2Pix] : InstructPix2Pix: Learning to Follow Image Editing Instructions, Tim Brooks et al., 2022
# Reviewer cKEc
Thank you for acknowledging the strengths of our paper: first to study the behavior of the space of diffusion models, e.g. coarse-to-fine behavior, and divergence of tangent space from different samples, and meaningful image editing with pullback metric.
<!-- <font color='blue'>We appreciate your recognition of our contribution in employing geometrical analysis of diffusion model latent spaces. This aids in understanding coarse-to-fine generation behavior, identifying tangent space similarities, and discovering semantically meaningful latent directions for image editing.</font> -->
---
> [W1] it is not really clear to me how the editing process works (sections 3.3 and 3.4)
[W1] A new subsection has been added to succinctly summarize the complete image editing procedure. This subsection, accompanied by Fig. R2, offers a visual overview along with concise explanations. Its purpose is to facilitate readers in grasping the integration of elements introduced across various sections within the comprehensive editing process. Further information is available in the global rebuttal 2.
<!--[W1] We have created a new subsection that summarizes the entire procedure of image editing. This subsection provides an overview visually through Fig. R2 and includes clear explanations. The goal is to help readers easily understand how the different elements introduced in each section come together in the full editing process. Please refer to the global rebuttal 2 for more details.-->
> + [W1-a] In 3.3, the letter v is used to indicate elements of both T_x and T_h, so it is not always clear to which space they are referring.
[W1-a] We thank you for identifying the typo, $\mathbf{v} \in \mathcal{T}_\mathbf{h}$, in Section 3.3 Line 148. It should be corrected to $\mathcal{T}_\mathbf{h} \rightarrow \mathcal{T}_\mathbf{x}$. The vector notations, $\mathbf{v}$ and $\mathbf{u}$, are specifically employed for elements of $\mathcal{T}_\mathbf{x}$ and $\mathcal{T}_\mathbf{h}$ correspondingly.
> + [W1-b] In general, it is not clear why the idea expressed in 3.3 is useful and where it was adopted.
<!-- [W1-b] The Parallel Transport (P.T.) is a concept in differential geometry that relocate a vector to other space while maintaining its direction relative to the space. We use the P.T. for two purposes: -->
[W1-b] The Parallel Transport (P.T.) is a concept in differential geometry that involves transporting a geometric object such as a vector between spaces while maintaining its direction relative to the space. We use the P.T. for two purposes:
<!--(P.T.) discussed in Section 3.3 is a technique that transfers local basis vectors obtained from a sample to manipulate different samples. We used P.T. for two purposes: -->
* First, we use it for image editing. Due to the nature of the unsupervised image editing method, we have to manually confirm the semantics of latent basis vector with the edited results. For example, let's say we want to edit 10 man images into woman. Without P.T., we need to manually check the semantics of each latent basis vector $\{\mathbf{v}_1, \cdots, \mathbf{v}_n\}$ to find woman vector for every image. But with P.T., once we find the woman basis vector in one image, you can directly edit other images using P.T., without needing to find woman vector in their local latent basis.
<!-- Local basis vectors were acquired through an unsupervised approach for a specific sample. Subsequently, the process of parallel transport was applied to these local basis vectors to manipulate other samples. -->
<!--* First, from an image editing perspective, it reduces the need to verify the meaning of the editing direction, which is necessary for unsupervised editing methods like ours.-->
* Second, we employ it to verify the similarities among local geometrical structures across various samples as shown in Fig. 7. This empirical demonstration substantiates that the geodesic distance among local geometrical structures, as depicted in Fig. 6, holds tangible significance rather than being merely a conceptual measure. This distance notably influences the behavior of the diffusion model.
We revise the manuscript to better showcase the advantages when using P.T.:
>> (L144) ... the basis vector obtained cannot be used for the other sample $\mathbf{x}'$ because $\mathbf{v}\notin\mathcal{T}_{\mathbf{x}'}$. This means that even if we discover the vector that edits man image into a woman from other images, in order to make a new image into a woman, we still need to re-identify the woman vector within its latent basis $\{\mathbf{v}_1, \cdots, \mathbf{v}_n\}$. If we can use the basis vector obtained from other images, it allows us to reduce the time to find the woman vector again.
<!--* Second, we also utilized P.T. when analyzing the latent space. <font color='red'> In Fig. 7, we demonstrate the result of P.T. belongs to similarity of local geometrical structures between different samples. </font> It verify empirically that the geodesic distance of local geometrical structures seen in Fig. 6 is not just a conceptual measurement, but affects the diffusion model's behavior.-->
<!-- , showing that in the case of samples with similar tangent spaces, editing occurs with similar meanings, but in opposite cases, this does not happen. -->
<!-- <font color='red'>(TODO : Parallel transport 어떻게 새로 서술할지 고민하기)</font> image editing procedure에 대한 이해를 해치지 않기 위해 section 3 의 마지막으로 위치를 옮기고, 다음과 같이 조금 더 명확하게 재서술하였다. -->
<!-- parallel transport 는 다른 sample 에서 구한 local basis 를 새로운 sample 에서 사용할 수 있게 v \in T_x^{1} 를 T_x^{2} 로 옮겨주는 방법이다. 이는 각 local basis 의 의미를 확인해야 하는 과정을 줄여준다. (unsup). Furthermore, 서로 다른 sample 의 local geometrical structure 의 유사도를 editing 을 통해 보이려 했다. 이는 Figure 6 에서 보이는 local geometrical structure 간 유사도가 관념적인 측정에 머무르는 것이 아님을 보여준다. -->
> + [W1-c-i] In eq 4 what is the epsilon function? [W1-c-ii] In general, isn’t the vector toward which to shift selected from T_H (so it should be u) and then transferred to T_x?
[W1-c-i] Thank you for pointing out the notation issue. Here $\epsilon$ is the denoising function of the pretrained diffusion model. For clarity, we will explicitly state this as $\epsilon_{\theta}$ in the revised manuscript:
>> (L166) ... where $\epsilon_{\theta}$ is the denoising fuction of the pretrained diffusion model, and $\gamma$ is ...
[W1-c-ii] Through the decomposition of the Jacobian, we can simultaneously obtain both $\mathbf{v}$ and its corresponding $\mathbf{u}$. However, $\mathbf{v}$ is exclusively employed during the editing procedure. We accomplishes semantically meaningful image editing by adjusting the latent variable $\mathbf{x}_t$ through the guidance of $\mathbf{v}$. For your reference, $\mathbf{u}$ is specifically utilized for parallel transport.
<!--[W1-c-ii] <font color='red'>By decompositing Jacobian, we could get $\mathbf{v}$ and corresponding $\mathbf{u}$ at the same time.</font> We only use $\mathbf{v}$ in editing procedure. Our method achieves semantically meaningful image editing by moving the latent variable $\mathbf{x}_t$ by a small amount with the guidance of $\mathbf{v}$.-->
<!-- <font color='blue'> (u 가 아니라 v 로 edit 하는 이유 작성할지 고민) </font> v 를 통한 조작은 그 자체로 우리가 diffusion model 의 latent space 에서도 semantic 하게 유의미한 subspace 를 찾을 수 있음을 보여준다. diffusion model의 latent space인 xt에서의 직접적인 editing은 우리의 지식으로는 우리가 처음이며, 이러한 xt에서의 조작은 단 한번의 변화만으로 editing을 할 수 있다는 이점을 가져다준다. 이에 반해 h-space 를 많이 조작하면 h space 와 skip-connection 사이 correspondence가 깨지고, 이미지가 망가지게 된다. 따라서 [5] 에서는 여러 타임스텝에 걸쳐 지속적인 변화를 주고 있다. 또한 deepest feature map인 h-space를 조작하기 때문에 전체적인 큰 변화를 만들기 어려운 u와 달리 v를 사용하면 editing은 latent 전체를 수정하여 광역적인 변화를 쉽게 줄 수 있다. </font> -->
<!--In general, $\mathbf{u}$ is only used for P.T. when wanting to use local basis vector obtained from other samples.-->
> [W2] Another concern is about the generalization of the proposed method to other diffusion techniques, or with other score models (i.e. not UNet). I think that this point needs more discussion.
> [L1] A discussion on how the proposed study may generalize to other domains and architectures would be of value.
[W2, L1] Thank you for raising the important issue. Our method holds potential applicability in cases where a feature space with a Euclidean metric exists within the DM, analogous to the $h$-space in [Asyrp]. This has been demonstrated for UNet in [Asyrp]. However, whether other architectures, such as those akin to [DiT] or [MotionDiffusion], can also exhibit Euclidean metric properties is an intriguing question for future research. We add the discussion in the revised manuscript as follows:
>> <font color='black'>Our approach exhibits broad applicability in cases where a feature space within the diffusion model possesses a Euclidean metric, as exemplified by $\mathcal{H}$. This characteristic has been observed in the context of U-Net within [Asyrp]. The question of whether alternative architectures, such as those resembling the structures of [DiT] or [MotionDiffusion], could also manifest a Euclidean metric, presents an intriguing avenue for future investigation.</font>
<!--<font color='red'>Our method can generally be applied if there exists a feature space with a Euclidean metric inside the DM, as with $\mathcal{H}$. This has been revealed for U-Net in [Asyrp], but whether other architectures adopting structures like [DiT] or [MotionDiffusion] can also assume a Euclidean metric is an interesting future study.</font> -->
> [Q1] I was wondering what would happen if, instead of moving along one of the principal axis of T_H, you use directly the principal axis of T_x.
[Q1] As previously mentioned in [W1-c-ii], we directly manipulate the latent variable $\mathbf{x}_t$ with the discovered latent basis vector $\mathbf{v}$. This approach offers several advantages over the editing of $\mathcal{H}$ as described in [Asyrp]. First, direct control over $\mathbf{h}_t$ disrupts the coherence between inner features and skip connections within the U-Net, which can lead to artifacts, particularly when making substantial adjustments [DiffStyle]. As a result, [Asyrp] implements gradual changes across multiple timesteps. In contrast, the manipulation of $\mathbf{x}_t$ enables more substantial alternations within a single timestep. Second, unlike $\mathbf{u}$ which exclusively affects the deepest feature map, $\mathbf{v}$ exerts influence not only on the latent variable $\mathbf{x}_t$, but also on the feature map dependent on $\mathbf{x}_t$, as well as all skip connections.
<!-- while $\mathbf{u}$ produces more localized effects to its impact being limited to the deepest layer. -->
<!--[Q1] As mentioned earlier in [W1-c-ii], we directly manipulate the latent variable $\mathbf{x}_t$ with discovered latent basis vector $\mathbf{v}$. This has several advantages over edit $\mathcal{H}$ as in [Asyrp]. First, directly control the $\mathbf{h}_t$ disrupts the correspondence between the inner features and skip connections of U-Net, which leads to artifacts when trying large manipulation [DiffStyle]. Therefore, [Asyrp] applies gradual changes over multiple timesteps. In contrast, manipulating $\mathbf{x}_t$ allows for larger edits in just a single timestep. <font color='blue'>Second, unlike $\mathbf{u}$ which only manipulates the deepest feature map, $\mathbf{v}$ modifies the entire latent space. This means that $\mathbf{v}$ can induce changes across large areas of the image, while $\mathbf{u}$ causes more localized changes since it only affects the deepest layer</font> -->
<!-- <font color='blue'>Unlike u which manipulates only the deepest feature map, using v modifies the entire latent space, which can easily induce changes in large area the in image.</font> -->
> [Q2] I would also discuss a bit more method [25] in the related work since it seems related to the proposed method.
[Q2] To enhance the distinction between our paper and [Asyrp], we modified the related work as follows:
>> <font color='black'>... Kwon et al. [25] demonstrated that the bottleneck of the U-Net, $\mathcal{H}$, can be used as a semantic latent space. Specifically, they used CLIP to identify directions within $\mathcal{H}$ that facilitate genuine image editing. ... In contrast to the method, our editing approach involves the direct manipulation of latent variables within the latent space. Furthermore, we autonomously identify editing directions through an unsupervised manner.</font>
> [Q3] Check the sentence at rows 74-75
Row 152: we aims
general grammar check
Thank you for the careful comments. We will address and incorporate these revisions in the upcoming manuscript.
[Asyrp] : Diffusion Models already have a Semantic Latent Space, Mingi Kwon et al., 2022
[DiT] : Scalable Diffusion Models with Transformers, William Peebles et al., 2022
[MotionDiffusion] : Human Motion Diffusion Model, Guy Tevet et al., 2022
[DiffStyle] : Training-free Style Transfer Emerges from h-space in Diffusion models, Jaeseok Jeong et al., 2023
# Reviewer vuQy
> [W1] Some of the statements and claims are not clear as pointed in the Questions.
> [Q1] It is stated that “To investigate the geometry of the tangent basis, we employ a metric on the Grassmannian manifold.” However, the space could be identified by another manifold as well. Why and how did you define the space by the Grassmannian manifold?
<!-- [Q1] We are interested in the geometrical structure of the diffusion model, $\mathcal{T}_{\mathbf{x}}$ and $\mathcal{T}_{\mathbf{h}}$, which lie on the points $\mathbf{x} \in \mathcal{X}$ and $\mathbf{h} \in \mathcal{H}$ respectively. That is, $\mathcal{X}$ and $\mathcal{H}$ are manifolds where each point is a vector space, which by definition, are Grassmannian manifolds. Thus, we use a manifold suitable for our interests. Moreover, Grassmannian manifolds come equipped with a geodesic metric defined by the principal angles between different vector spaces, which provides a good measure for assessing the similarity of vector spaces. -->
[Q1] The tangent spaces of $\mathcal{T}_{\mathbf{x}}$ and $\mathcal{T}_{\mathbf{h}}$ correspond to subspaces at the points $\mathbf{x} \in \mathcal{X}$ and $\mathbf{h} \in \mathcal{H}$ of diffusion models. Given that the Grassmannian manifold is characterized as a manifold comprising subspaces, it appears well-suited for representing the manifold of $\mathcal{T}_{\mathbf{h}}$. Additionally, the geodesic metric defined on this manifold quantifies the separation between two subspaces based on their principal angles. Hence, we posit that this geodesic metric offers a robust means of evaluating the similarity between different $\mathcal{T}_{\mathbf{h}}$s.
<!--[Q1] We are interested in the geometrical structure of the diffusion model, $\mathcal{T}_{\mathbf{x}}$ and $\mathcal{T}_{\mathbf{h}}$, which lie on the points $\mathbf{x} \in \mathcal{X}$ and $\mathbf{h} \in \mathcal{H}$ respectively. Note that, by definition, $\mathcal{T}_{\mathbf{x}}$ and $\mathcal{T}_{\mathbf{h}}$ are subspaces (as a vector space). Our various analysis of the diffusion model is based on measuring similarity between $\mathcal{T}_{\mathbf{h}}$. In this regard, Grassmannian manifold is defined as a manifold consisting of subspaces. Thus, we think this manifold is suitable for a manifold of $\mathcal{T}_{\mathbf{h}}$s. Moreover, the geodesic metric on this manifold measures the distance between two subspaces by the principal angles between them. Therefore, we considered this geodesic metric provides a good measure for assessing the similarity between $\mathcal{T}_{\mathbf{h}}$s.-->
> [Q2-a] It is claimed that “the similarity across tangent spaces allows us to effectively transfer the latent basis from one sample to another through parallel transport”. How this improves effectiveness was not analyzed. In general, [Q2-b] how do the proposed methods improve training and inference time? Indeed, the additional transformations can increase training and inference time. Could you please provide an analysis of the footprints?
<!-- We understand the question as pointing out that "the analysis on the effectiveness of the similarity across tangent space between samples for parallel transport is insufficient." -->
[Q2-a] Figure 7 shows the impact of employing parallel transport (P.T.). The transported vectors can edit distinct samples while manipulating the same attributes. We will modify Line 227 as follows to better reflect our intent:
<!--[Q2-a] Figure 7 shows the effect of using P.T. for editing different samples: the transported vectors edit different samples regarding the same attributes.-->
<!-- As shown in Fig. 7 in the our paper, we wanted to point out that the geodesic distance between two different tangent spaces needs to be small for parallel transport to be successful.-->
<!--We will revise Line 227 as follows to align with our intention: -->
>> [Line 227-229] ... us to effectively transfer the latent basis from one sample to another through parallel transport as shown in Figure 7.
>> $\rightarrow$ ... us to successfully transfer the latent basis vector from one sample to another through parallel transport. Figure 7 shows that the transported vectors can edit distict samples while manipulating the same attributes.
[Q2-b] We would like to emphasize that our approach does not require any form of training. Furthermore, the utilization of P.T. is not a default procedure in the editing process; its application is limited to specific instances where local basis vectors from other samples are used. The details regarding the typical editing footprint in the absence of P.T. are outlined in the global rebuttal paragraph 3. Here, for your reference, we focus solely on presenting the result table:
<!--[Q2-b] We respectfully remind that our method does not involve any training. Also, Parallel Transport (P.T.) is not used by default in the editing process; it is only used in special cases where using local basis vector obtained from other samples. For the footprint of typical editing without using P.T., it is discussed in the global rebuttal paragraph 3. Here we will only present the result table: -->
| Image Edit Method | running time | Preprocessing |
|:-----------------:|:-------------:|:-------------:|
| Ours | 11 sec |N/A |
| [SDEdit] | 4 sec |N/A |
| [Pix2Pix-zero] | 25 sec |4 min |
| [PnP] | 10 sec |40 sec |
| [Instruct Pix2Pix] | 11 sec |N/A |
<!-- We revise section 3.3 with the goal of better understanding the methods and purposes. Compared to the previous description which focused on justifying the existing P.T. method, we hope this new illustration could help readers' comprehension. Also, we move this subsection to the end of section 3 so as not to disrupt understanding of the image editing procedure. The new text is as follows: -->
<!-- % section 3.3 을 좀더 방법과 목적의 이해를 목표로 재서술했다. 기존의 P.T. method 의 정당화에 집중했던 기존의 서술에 비해 독자들의 쉬운 이해를 도울 수 있기를 기대한다. 또한, image editing procedure 에 대한 이해를 해치지 않기 위해 section 3 의 마지막으로 위치를 옮겼다. 새로운 글은 다음과 같다 : -->
<!-- >> <font color='red'> TODO : P.T. 챕터 새로 쓰기</font> -->
<!--The runtime of editing using P.T. can be decomposed into 1) DDIM inversion and generation, 2) identifying the local basis, and 3) parallel transport. When using P.T., a sufficiently large $n$ should be used to minimize distortion, which requires more time to compute the local basis. Here, we choose $n=50$, the number of DDIM steps $=100$, and the unconditional diffusion model trained on CelebA-HQ.-->
The time taken for editing using P.T. can be broken down into three components: 1) DDIM inversion and generation, 2) identification of the local basis, and 3) the parallel transport process. For optimal results when employing P.T., it is advisable to use a sufficiently large value of $n$ to mitigate distortion. However, this choice necessitates more computation time for generating the local basis. In our case, we opt for $n=50$, with the number of DDIM steps set at $100$, and utilizing an unconditional diffusion model trained on CelebA-HQ.
|DDIM inversion + generation| Identification of local basis| Parallel transport |
|:--:|:--:|:--:|
| 10sec | 100sec | 0.002sec|
<!-- [Q2-b] Given local bases for each sample, Parallel *Transport* (P.T.) is simply matrix multiplication. The time taken for just for the P.T. is around 0.002 sec. -->
> [W2] The results are given considering the Riemannian geometry of the latent spaces and utilizing the related transformations (e.g. PT) among tangent spaces on the manifolds. However, vanilla DMs do not employ these transformations. Therefore, it is not clear whether these results are for vanilla DMs or the DMs utilizing the proposed transformations.
[W2] We wish to emphasize that our method works on frozen vanilla DMs without necessitating any fine-tuning or architectural modifications. It offers unsupervised image editing capabilities that are applicable to both unconditional DMs and conditional DMs. Furthermore, as explained in [Q2], it is important to note that the inclusion of P.T. is not a default step in our editing process. This technique is selectively employed in particular cases where local basis vectors are transferred for editing other samples.
<!--[W2] We respectfully clarify that our method works on frozen vanilla DMs without any finetuning or architecture changes. It provides unsupervised imaged editing applicable to both unconditional DMs and conditional DMs. -->
<!-- Our method is applicable to vanilla DMs. Our method does not require any assumption about the DM other than that it has a U-Net architecture. Also, we do not make any architectural modifications such as finetuning or adding additional layers.
Without further training, we can do unsupervised image editing, which is a technique that can be applied to both unconditional DMs and conditional DMs. -->
<!-- [W2] vanilla DMs 에 적용 가능하다. 우리의 방법은 DM 이 U-Net 구조를 가졌다는 것 이외에 어떠한 가정도 요구하지 않는다. 또한, finetuning 이나 additional layer 를 추가하는 등의 어떠한 구조 변형도 하지 않는다. 우리는 추가적인 훈련 없이 unsup 방식으로 image editing을 할 수 있으머, 이는 vanilla DMs과 conditional DMs 모두에게 적용가능한 방식이다. -->
<!--Also, as mentioned in [Q2], P.T. is not used by default in the editing process. It is a technique that is only used in special cases where local bases obtained from other samples.-->
> [W3] A major claim is that the proposed methods improve effectiveness of the DMs. However, the employed transformations can increase the footprints of DMs.
[W3] To provide clarification, our primary focus centers on image editing utilizing DMs, without an emphasis on augmenting the performance of DMs themselves.
When performing image editing with a single latent basis vector (i.e., $n=1$), our editing process consumes approximately 11 seconds. This represents roughly 15% of the time required for vanilla inversion and reconstruction. Furthermore, even when employing Jacobian approximation with $n=1$, the computational overhead associated with our approach remains comparable with other state-of-the-art editing methods, as illustrated in the aforementioned table.
<!--[W3] We respectfully clarify that we focus on image editing using DMs and do not consider improving DMs. -->
<!-- Our method allows using DMs for image editing.-->
<!--When $n=1$, i.e. identify only one latent basis vector, image editing with our method takes around 11 seconds which is ~15% of vanilla inversion and reconstruction. Even with Jacobian approximation, which takes around 2.5 seconds when $n=1$, the computation cost of our method remains comparable to other SoTA editing methods.-->
<!-- 우리가 알기로는 한번의 에디팅은 우리가 처음이며, 디퓨전의 레이턴트 스페이스에서의 에디팅도 우리가 처음이다. -->
<!--We attach the table in global rebuttal paragraph 4.
| Image Edit Method | running time | Preprocessing |
|:-----------------:|:-------------:|:-------------:|
| Ours | 11 sec |N/A |
| [SDEdit] | 4 sec |N/A |
| [Pix2Pix-zero] | 25 sec |4 min |
| [PnP] | 10 sec |40 sec |
| [Instruct Pix2Pix] | 11 sec |N/A |
-->
> [L1] Some of the limitations were addressed but potential impacts were not addressed.
[L1] We appreciate your observation and would like to address this concern by incorporating a societal impact and ethics statement into the revised manuscript, which is presented below:
>> **Societal Impact / Ethics Statement.** Our research endeavors to unravel the geometric structures of the diffusion model and facilitate high-quality image editing within its framework. While our primary application resides within the creative realm, it is important to acknowledge that image manipulation techniques, such as the one proposed in our method, hold the potential for misuse, including the dissemination of misinformation or potential privacy implications. Therefore, the continuous advancement of technologies aimed at thwarting [Mist, Immunize] or identifying [Detect, Defake] manipulations rooted in generative models remains of utmost significance.
<!--[L1] Thank you for pointing it out. We add the societal impact / ethics statement in revised manuscript as follows:
>> **Societal impact / Ethics statement.** Our work aims to understand the geometric structure of the diffusion model and enable high-quality image editing through it. While our primary application is in the creative industry, image manipulation techniques like our method have the potential for misuse, such as spreading misinformation or privacy concerns. Therefore, the ongoing development of technologies to prevent [Mist, Immunize] or detect [Detect, Defake] editing based on generative models, remains crucial.-->
<!-- >> **Societal impact / Ethics statement.** 우리의 work 은 diffusion model 의 geometric structure 를 이해하고, 이를 통해 high-quality image editing 을 가능하게 한다. even though 우리의 최우선 사용처는 creative industry 지만, image manipulation technique such as our method have potential abuses such as spreading misinformation or privacy issue. 따라서 생성모형에 의한 editing 을 막거나 [6, 7], 감지하는 기술 [8, 9] 에 대한 개발이 지속적으로 필요하다. -->
#### References
[SDEdit] : SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, Chenlin Meng et al., 2021
[Pix2Pix-zero] : Zero-shot Image-to-Image Translation, Gaurav Parmar et al., 2023
[PnP] : Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation, Narek Tumanyan et al., 2022
[Instruct Pix2Pix] : InstructPix2Pix: Learning to Follow Image Editing Instructions, Tim Brooks et al., 2022
[Mist] : Raising the cost of malicious ai-powered image editing, Salman et al., 2023
[Immunize] : Adversarial Example Does Good: Preventing Painting Imitation from Diffusion Models via Adversarial Examples, Liang et al., 2023
[Detect] : On the detection of synthetic images generated by diffusion models, Riccardo Corvi et al., 2022.
[Defake] : Defake: Detection and attribution of fake images generated by text-to-image diffusion models, Zeyang Sha et al., 2022
# Reviewer 6jtZ
Thank you for acknowledging the strengths of our paper: distinctive idea for editing in Diffusion models (DMs), and enhancing comprehension of the latent space dynamics, e.g. the evolution of geometric structure over time and influence of various text conditions.
> [W1] (...), the method and notation used in this work can lead to some confusion. For instance, [W1-a] Section 3, in its current form, may not be as accessible to all readers as it could be, and it could benefit from being revised for clearer communication of the ideas contained therein. [W1-b] Additionally, Fig. 1, which is presumably intended to illustrate key concepts, is perhaps too dense with information. Dividing Fig. 1 into two separate figures could make it easier to digest, enabling a clearer explanation of the approach.
[W1] Thank you for the constructive suggestion. Here, we re-write the global rebuttal about Fig. 1 (paragraph 1) and section 3 (paragraph 2) for your convenience:
In order to make Fig. 1 more comprehensible, we divide Fig. 1 into two seperate figures. Please refer to Fig. R1 and R2 in the global rebuttal PDF. First, **Figure R1** conceptually visualizes the local basis found through the pullback metric and provides a summary of the process for obtaining local basis. Second, **Figure R2** provides an overview of the image editing method using the discovered local basis and briefly presents its results.
To clarify the image editing process (Sections 3), we added a new subsection summarizing the overall procedure. This includes a visual overview in Fig. R2 and clear textual explanations.
For more details of improvements on Figure 1 and Section 3, please refer to global rebuttal paragraphs 1 and 2, respectively.
> [W2] A second aspect that could be improved upon is the overall presentation and proofreading of the paper. While the approach is relatively simple, its translation into the written form has not been as clear as one would hope. The text could benefit from a thorough proofreading to ensure that it is not just grammatically correct, but also that it conveys the authors' ideas in a way that is accessible to the wider machine learning community. As it stands, the paper's usefulness to this community may be hindered by its presentation.
[W2] Thank you for the advice on clarification. We made the following improvements to help readers understand more easily:
+ We move Parallel Transport (P.T.) from section 3.3 to the last subsection of section 3. Previously, P.T. was introduced amidst the explanation of the image editing process, potentially leading to the misconception that using P.T. is the default for our editing method. However, it is only utilized in the unique scenario of editing with a basis vector from another sample. In its new placement, P.T. clearly introduced as a specialized approach for importing latent basis vectors from other samples once the image editing process has been thoroughly presented.
+ We make all components pertaining to the image editing method consolidate within section 3. For instance, in section 4, specifically L178-182, the process of generating initial noise $\mathbf{x}_T$ through DDIM inversion was previously outlined. This content is now relocated to section 3. This reorganization ensures that readers can comprehend the complete editing process by solely referring to section 3.
+ To make Fig. 8 digestable, we move Fig. 8-(b) and the content from L243-L248 to the appendix. Given that Figure 8-(b) serves to validate the outcomes in (a) and (c), we deemed it more suitable to include this information in the appendix.
<!-- + We move section 3.3 to the end of section 3. In section 3.3, P.T. was introduced in the middle of explaining the image editing process. This could give wrong the impression that P.T. is a default part of the basic editing, rather than used in the specific situation of editing with basis vectors from a different sample. Now, we introduce P.T. after finishing the introduction of our image editing method, as a special method of using latent basis vectors obtained from another sample -->
<!-- + We relocate all elements related to the image editing method to section 3. For instance, the process of generating initial noise $\mathbf{x}_T$ through DDIM inversion mentioned in section 4, from L178-L182, has been moved to section 3. Readers can now understand the entire editing process by referring to section 3 alone. -->
<!-- + To make Fig. 8 digestable, we move Fig. 8-(b) and the content from L243-L248 to the appendix. We think the appendix more suitable for justifying the results of (a) and (c), as they provide content that complements the main text. -->
<!-- [W2] clarification 을 위한 조언 고맙다. 독자들이 더 쉽게 이해할 수 있도록 다음과 같이 개선했다:
- section 3.3 의 Parallel Transport 를 section 3 의 마지막으로 보냈다. 기존에는 image editing process 를 설명하는 중간에 소개 됐습니다. 이는 P.T. 가 특수한 상황, using local basis vector obtained from other samples, 에만 사용되는 것이 아니라, editing 에 default 로 사용된다는 인상을 줄 수 있습니다. 이제는 image editing process 를 마무리한 뒤, 다른 sample 에서 구한 latent basis vector 를 가져다쓰는 특수한 방법으로서 P.T. 를 소개합니다.
- 모든 image editing method 에 관련된 요소는 section 3 에서 소개합니다. 예를 들어, section 4 의 L178-L182 에서 image editing method DDIM inversion 를 다루는데 이를 section 3 로 옮겼습니다. 이제 독자들은 section 3 만 보고도 editing process 를 이해할 수 있습니다.
- Figure 7 과 관련된 본문 L227-235 을 다음과 같이 수정했습니다:
- Figure 8 의 정보가 dence 하여, Fig. 8-(b) 와 L243-L248 은 appendix 로 옮겼습니다. 이들은 (a), (c) 의 결과를 정당화하는 내용으로 본문보다 appendix 가 더 적절하다고 판단했습니다. -->
We will also carefully identify any shortcomings and revise them in camera-ready version.
> [W3] Lastly, the paper could do more to address the computational implications of its approach. The authors use the power method to approximate the Jacobian, which, while effective, can be computationally costly. It would be beneficial if the authors were more transparent about this fact, allowing readers to fully understand the computational demands of the approach and evaluate whether or not it would be feasible in their own applications. (...)
> [L1] I would like to see a deeper analysis of the computational complexity and runtime of the approach.
<!-- [W3, L1] Appendix A 에 Power method 를 통한 Jacobian 의 SVD 근사 시간을 적어뒀다. Jacobian 의 근사 시간은 low dimentional approximation n 에 따라 달라진다. n 이 작을수록 시간이 작게 걸리는데, n=3 의 경우 10sec, n=50 일때는 3~4minute 정도가 걸린다. 우리는 latent의 분석을 위하여 n=50으로 실험하였지만, 파워메소드 방식은 n이 1일때도 근사가 가능하다는 장점이 있으며, stable diffuison을 활용한 text에 따른 editing을 원한다면 매우 빠른 시간안에 editing을 할 수 있다. -->
[W3, L1] In Appendix A, we report the computation time to approximate the Singular Value Decomposition (SVD) of Jacobian using the power method. The runtime of power method varies depending on the parameter $n$, i.e. the low-rank approximation of the original tangent space. Smaller values of $n$ result in shorter computation times. For instance, when $n=3$, it takes approximately 10 seconds, whereas for $n=50$, it takes around 3 to 4 minutes for Stable Diffusion. In particular, when you require just a single basis vector, like in the case of text conditional editing, the time it takes to approximate the
SVD of Jacobian is remarkably short—around 2.5 seconds.
<!-- While we conduct experiments with $n=50$ for analysis purpose, one advantage of the power method is that it can approximate even when $n=1$. Thus, if editing based on text using stable diffusion is desired, it allows for rapid editing within a very short timeframe. -->
> [Q1] What is the complexity of computing the Jacobian? What is the the actual runtime (in seconds) of the approach compared to other editing methods? I think answering these questions can provide a good context for readers.
[Q1] Thank you for your valuable question. We conduct all comparisons on the Nvidia 3090 of 24GB VRAM. To ensure a fair comparison, we set $n=1$ and performed 50 steps of the DDIM algorithm. The time taken by each method is as follows:
<!-- [Q1] 가치있는 질문 감사합니다. 전부 Nvidia 3090 24Gb 에서 진행. For fair comparison, we choose $n=1$, and set the number of DDIM step to 50. 각 method 에서 걸린 시간은 다음과 같다. 더 자세한 논의는 global rebuttal paragraph 3 을 참조해주세요. -->
| Image Edit Method | running time |
|:-----------------:|:-------------:|
| Ours | 11 sec |
| [SDEdit] | 4 sec |
| [Pix2Pix-zero] | 25 sec |
| [PnP] | 10 sec |
| [Instruct Pix2Pix] | 11 sec |
For a more detailed discussion, please refer to global rebuttal paragraph 3.
#### References
[SDEdit] : SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, Chenlin Meng et al., 2021
[Pix2Pix-zero] : Zero-shot Image-to-Image Translation, Gaurav Parmar et al., 2023
[PnP] : Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation, Narek Tumanyan et al., 2022
[Instruct Pix2Pix] : InstructPix2Pix: Learning to Follow Image Editing Instructions, Tim Brooks et al., 2022
# Reviewer iEgM
Thank you for acknowledging our strengths:
- uncovering effect of text prompt and complexity of the dataset on the latent space.
- confirming coarse-to-fine behaviour of diffusion models (DMs).
- enhance credibility through experimental validation.
> [W1] The paper lacks comparisons with other diffusion-based image editing techniques, like [7,18].
> [Q1] To make the paper applicable to real-world scenarios right away, the authors should include a comparative analysis with other image editing techniques.
[W1, Q1] Thanks for the great suggestion. We conduct qualitative comparisons with a text-guided image editing methods. Our state-of-the-art baseline methods include: (i) [SDEdit], (ii) [Pix2Pix-zero], (iii) [PnP], and (iv) [Instruct Pix2Pix]. All comparisons were performed using the official code. Please refer to global rebuttal paragraph 3 and Fig. R3 in the global rebuttal PDF for the results.
<!-- <font color='blue'> Thanks to your suggestion, we were able to realize that our method has competitive performance compared to other state-of-the-art baselines. Thank you once again.</font> -->
> [W2] The presentation and clarity of the paper could be improved. [W2-a] For example, the abstract contains too much detail, making it challenging to understand upon initial reading. [W2-b] Also the explanation of the image editing technique could be improved: [W2-b-i] what is DDIM (section 4)? [W2-b-ii] what is epsilon in Equation 4? [W2-b-iii] Finally, Figure 1 is not sufficiently clear to me, it may hampers the reader's comprehension.
[W2-a] Thank you for constructive suggestion. We revise the abstract to be concise and digestible as follows:
>> Despite the success of diffusion models (DMs), we still lack a thorough understanding of their latent space. To understand the latent space $\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective. Our approach involves deriving the local latent basis within $\mathcal{X}$ by leveraging the pullback metric associated with their encoding feature maps. Remarkably, our discovered local latent basis enable image editing capabilities with moving $\mathbf{x}_t$, the latent space of DMs, along the direction at specific timesteps.
>>
>> We further analyze how the geometric structure of DMs evolves over diffusion timesteps and differs across different text conditions. This confirms known phenomenon of coarse-to-fine generation, as well as reveals novel insights such as discrepancy between $\mathbf{x}_t$ across timesteps, the effect of dataset complexity, and the time-varying influence of text prompts. To the best of our knowledge, this paper is the first to present image editing through $\mathbf{x}$-space traversal, editing only once at specific timestep $t$ without any additional training, and provide thorough analyses of the latent structure of DMs.
<!-- >> Despite the success of diffusion models (DMs), we still lack a thorough understanding of their latent space. To understand the latent space $\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective. Specifically, we utilize the pulback metric to find the local latent basis in $\mathcal{X}$ and their corresponding local tangent basis in $\mathcal{H}$, the intermediate feature maps of DMs. Remarkably, our discovered local latent basis enable image editing capabilities with moving $\mathbf{x}_t$ along the direction at specific timesteps.
>>
>> We further analyze how the geometric structure of DMs evolves over diffusion timesteps and differs across text-condition. This reveals known properties like coarse-to-fine generation, as well as novel insights into discrepancy between $\mathbf{x}_t$ across timestep, the effect of dataset complexity, and the time-varying influence of text prompts. To the best of our knowledge, this paper is the first to present image editing through x-space traversal, editing only once at specific timestep $t$ without any additional training, and provide thorough analyses of the latent structure of DMs. -->
[W2-b] Thank you for pointing out the clarity issue.
[W2-b-i] To make the image editing process easier to understand, we created a summary subsection and an overview figure for the whole process. Please refer to global rebuttal paragraph 2 and Fig. R2 in the global rebuttal PDF for more details. In our method, [DDIM] is used to invert the image into the initial noise $\mathbf{x}_T$, and again for denoising to generate the image.
<!-- [W2-b-i] image editing process 를 이해하기 쉽도록 summary subsection 을 만들고, 전체 process 에 대한 overview figure 를 만들었다. 이와 관련해서 global rebuttal paragraph 2 를 참조해달라. our method 에서 DDIM [1] 은 이미지에서 initial noise $\mathbf{x}_T$ 로 invert 할때, 그리고 다시 이미지 생성을 위해 denoising 을 할때 사용된다. -->
[W2-b-ii] $\epsilon$ is a function representing the pretrained diffusion model. For clarity, in the revised manuscript we will write it as $\epsilon^{\theta}$, and state it explicitly as below :
>> (L166) ... where $\epsilon_{\theta}$ is the denoising fuction of the pretrained diffusion model, and $\gamma$ is ...
[W2-b-iii] Thank you for the valuable suggestion. Please refer to global rebuttal paragraph 1 and Fig. R1, R2 in the PDF for improvements to Fig. 1.
<font color='blue'>Below is an excerpt from the content of the global rebuttal ㅔparagraph 1, extracted to enhance readability for the reviewers:</font>
In order to make Fig. 1 more comprehensible, we divide Fig. 1 into two seperate figures. Please refer to Fig. 1 and 2 in the global rebuttal PDF. First, **Figure R1** conceptually visualizes the local basis found through the pullback metric and provides a summary of the process for obtaining local basis. Second, **Figure R2** provides an overview of the image editing method using the discovered local basis and briefly presents its results.
> Some potentially interesting references
Thank you for valuable references. We will add those suggested references on revised manuscript.
> Typos
Thank you for the careful comments. We will address and incorporate these revisions in the upcoming manuscript.
<!-- With these changes, we hope that readers will be able to better grasp the core methods of our paper: the local basis and image editing process. -->
#### References
[SDEdit] : SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, Chenlin Meng et al., 2021
[Pix2Pix-zero] : Zero-shot Image-to-Image Translation, Gaurav Parmar et al., 2023
[PnP] : Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation, Narek Tumanyan et al., 2022
[Instruct Pix2Pix] : InstructPix2Pix: Learning to Follow Image Editing Instructions, Tim Brooks et al., 2022
[DDIM] : Denoising Diffusion Implicit Models, Jiaming Song et al., 2020