Rebuttal (English ver.) - HackMD

- - Sharing URL
  - /edit
  - View mode
    
    Edit mode
    
    View mode
    
    Book mode
    
    Slide mode
    Edit mode View mode Book mode Slide mode
  - Customize slides
  - Note Permission
  - Read
    Only me
    
    Signed-in users
    
    Everyone
    Only me Signed-in users Everyone
  - Write
    Only me
    
    Signed-in users
    
    Everyone
    Only me Signed-in users Everyone
  - Engagement control Commenting, Suggest edit, Emoji Reply
- Invite by email
  
  Invitee
  
  This note has no invitees
- Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note
  
  Your note will be visible on your profile and discoverable by anyone.
  
  Your note is now live.
  
  This note is visible on your profile and discoverable online.
  
  Everyone on the web can find and read all notes of this public team.
  
  See published notes
  
  Unpublish note
  
  I agree to HackMD’s Community Guideline. Please check the box to agree to the Community Guidelines.
  
  View profile
- Commenting
  Permission
  Disabled Forbidden Owners Signed-in users Everyone
- Enable
- Permission
  Forbidden
  
  Owners
  
  Signed-in users
  
  Everyone
- Suggest edit
  
  Permission
  Disabled Forbidden Owners Signed-in users Everyone
- Enable
- Permission
  Forbidden
  
  Owners
  
  Signed-in users
- Emoji Reply
- Enable

Owned this note Owned this note

<style> r { color: Red } o { color: Orange } g { color: Green } </style> <sub><sup> 작게 작성한 부분 == 한국어 원문 </sup></sub> <font color='blue'>파란색 == 넣을지 말지 고민하고 있는 부분. </font> <font color='red'>빨간색 == 해야할 일. 수정하고 싶은 부분. </font> # Global rebuttal We thank the reviewers for their valuable advice. Here, we compile reviews we want to share with all the reviewers. Please see our responses addressing the specific concerns below:   ### 1. Improving Fig. 1 for clarity #### Reviwers *6jtZ* and *iEgM* suggested to modify Fig. 1 since it has too dense information.  To enhance the cmprehensibility of Fig. 1, we divide Fig. 1 into two figures. Please refer to Fig. R1 and R2 in the attached PDF file. **Figure R1** gives a conceptual visualization of the local basis derived from the pullback metric. This figure succinctly outlines the procedural steps involved in obtaining the local basis. **Figure R2** delivers an overview of the image editing method using the discovered local basis. It also provides a concise presentation of the outcomes.    ### 2. Additional subsection to overview the whole image editing process #### Reviewers *cKEc*, *6jtZ*, and *iEgM* asked the clear explanation of the editing process.  We create a new subsection in section 3 that summarizes the entire procedure of image editing. This subsection provides clear explanations and also visually illustrates the method by Fig. R2. The detailed contents are as follows: --- ### Section 3.4 Overall process of image editing In this section, we summarize the entire editing process with five steps: 1) The input image is inverted into initial noise $\mathbf{x}_T$ using DDIM inversion. 2) $\mathbf{x}_T$ is gradually denoised until $t$ through DDIM generation. 3) Identify local latent basis $\{ \mathbf{v}_1, \cdots, \mathbf{v}_n\}$ using the pullback metric at $t$. 4) Manipulate $\mathbf{x}_t$ along the one of basis vectors using the $\mathbf{x}$-space guidance. 5) The DDIM generation is then completed with the modified latent variable $\mathbf{x}'_t$. Figure 2 illustrates the entire editing process. In the context of a text-to-image model like Stable Diffusion, the incorporation of textual conditions during the derivation of local basis vectors becomes possible. This incorporation ensures that all the local basis vectors align with the provided text, and a comprehensive explanation of this process can be found in Section 4.1. It is noteworthy that while our approach involves moving the latent variable within a single timestep, it achieves semantically meaningful image editing. In addition to the image manipulation within a single timestep, the direct manipulation of the latent variable, $\mathbf{x}_t$, of diffusion models is a pioneering approach to the best of our knowledge. <!For text-to-image model such as Stable Diffusion, we can utilize text condition when obtaining the local basis vector. Utilizing text condition leads to make all the local basis aligning with given text, see Sec 4.1 for details.> <!Note that although our method moves the latent variable in just a single timestep, we achieves semantically meaningful image editing. To the best of our knowledge, this is the first approach not only achieving image editing through manipulation within a single timestep but also with the direct manipulation of DMs' latent variable, $\mathbf{x}_t$.>   ---   ### 3. Comparitive experiment to other state-of-the-art (SoTA) editing methods #### Reviewers *iEgM* and *6jtZ* provided constructive discussion about including comparative analysis for qualitative comparisons and runtime aspect.   We conduct qualitative comparisons with a text-guided image editing methods. Our state-of-the-art baseline methods include: (i) [SDEdit], (ii) [Pix2Pix-zero], (iii) [PnP], and (iv) [Instruct Pix2Pix]. All comparisons were performed using the official code. **Please refer to Fig. R3 in the PDF for the qualitative results.**  We also compare the time complexity of each method. For a fair comparison, we only identify the first singular vector $\mathbf{v}_1$, i.e., $n=1$, and set the number of DDIM steps to 50. All experiments were conducted on an Nvidia RTX 3090 with 24GB of VRAM. The runtime for each method is as follows: | Image Edit Method | running time | Preprocessing | |:-----------------:|:-------------:|:-------------:| | Ours | 11 sec |N/A | | SDEdit | 4 sec |N/A | | Pix2Pix-zero | 25 sec |4 min | | PnP | 10 sec |40 sec | | Instruct Pix2Pix | 11 sec |N/A | The computation cost of our method remains comparable to other approaches, although the Jacobian approximation takes around 2.5 seconds for $n=1$. This is because we only need to identify the latent basis vector once at a specific timestep.</font> Furthermore, our approach does not require additional preprocessing steps like generating 100 prompts with GPT and obtaining embedding vectors (as in Pix2Pix-zero) or storing feature vectors, queries, and key values (as in PnP). Our method also does not require finetuning (as in Instruct Pix2Pix). This leads to a significantly reduced total editing process time in comparison to other methods. #### References [SDEdit] : SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, Chenlin Meng et al., 2021 [Pix2Pix-zero] : Zero-shot Image-to-Image Translation, Gaurav Parmar et al., 2023 [PnP] : Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation, Narek Tumanyan et al., 2022 [Instruct Pix2Pix] : InstructPix2Pix: Learning to Follow Image Editing Instructions, Tim Brooks et al., 2022 # Reviewer cKEc Thank you for acknowledging the strengths of our paper: first to study the behavior of the space of diffusion models, e.g. coarse-to-fine behavior, and divergence of tangent space from different samples, and meaningful image editing with pullback metric.  --- > [W1] it is not really clear to me how the editing process works (sections 3.3 and 3.4) [W1] A new subsection has been added to succinctly summarize the complete image editing procedure. This subsection, accompanied by Fig. R2, offers a visual overview along with concise explanations. Its purpose is to facilitate readers in grasping the integration of elements introduced across various sections within the comprehensive editing process. Further information is available in the global rebuttal 2.  > + [W1-a] In 3.3, the letter v is used to indicate elements of both T_x and T_h, so it is not always clear to which space they are referring. [W1-a] We thank you for identifying the typo, $\mathbf{v} \in \mathcal{T}_\mathbf{h}$, in Section 3.3 Line 148. It should be corrected to $\mathcal{T}_\mathbf{h} \rightarrow \mathcal{T}_\mathbf{x}$. The vector notations, $\mathbf{v}$ and $\mathbf{u}$, are specifically employed for elements of $\mathcal{T}_\mathbf{x}$ and $\mathcal{T}_\mathbf{h}$ correspondingly. > + [W1-b] In general, it is not clear why the idea expressed in 3.3 is useful and where it was adopted.  [W1-b] The Parallel Transport (P.T.) is a concept in differential geometry that involves transporting a geometric object such as a vector between spaces while maintaining its direction relative to the space. We use the P.T. for two purposes:  * First, we use it for image editing. Due to the nature of the unsupervised image editing method, we have to manually confirm the semantics of latent basis vector with the edited results. For example, let's say we want to edit 10 man images into woman. Without P.T., we need to manually check the semantics of each latent basis vector $\{\mathbf{v}_1, \cdots, \mathbf{v}_n\}$ to find woman vector for every image. But with P.T., once we find the woman basis vector in one image, you can directly edit other images using P.T., without needing to find woman vector in their local latent basis.   * Second, we employ it to verify the similarities among local geometrical structures across various samples as shown in Fig. 7. This empirical demonstration substantiates that the geodesic distance among local geometrical structures, as depicted in Fig. 6, holds tangible significance rather than being merely a conceptual measure. This distance notably influences the behavior of the diffusion model. We revise the manuscript to better showcase the advantages when using P.T.: >> (L144) ... the basis vector obtained cannot be used for the other sample $\mathbf{x}'$ because $\mathbf{v}\notin\mathcal{T}_{\mathbf{x}'}$. This means that even if we discover the vector that edits man image into a woman from other images, in order to make a new image into a woman, we still need to re-identify the woman vector within its latent basis $\{\mathbf{v}_1, \cdots, \mathbf{v}_n\}$. If we can use the basis vector obtained from other images, it allows us to reduce the time to find the woman vector again.     > + [W1-c-i] In eq 4 what is the epsilon function? [W1-c-ii] In general, isn’t the vector toward which to shift selected from T_H (so it should be u) and then transferred to T_x? [W1-c-i] Thank you for pointing out the notation issue. Here $\epsilon$ is the denoising function of the pretrained diffusion model. For clarity, we will explicitly state this as $\epsilon_{\theta}$ in the revised manuscript: >> (L166) ... where $\epsilon_{\theta}$ is the denoising fuction of the pretrained diffusion model, and $\gamma$ is ... [W1-c-ii] Through the decomposition of the Jacobian, we can simultaneously obtain both $\mathbf{v}$ and its corresponding $\mathbf{u}$. However, $\mathbf{v}$ is exclusively employed during the editing procedure. We accomplishes semantically meaningful image editing by adjusting the latent variable $\mathbf{x}_t$ through the guidance of $\mathbf{v}$. For your reference, $\mathbf{u}$ is specifically utilized for parallel transport.    > [W2] Another concern is about the generalization of the proposed method to other diffusion techniques, or with other score models (i.e. not UNet). I think that this point needs more discussion. > [L1] A discussion on how the proposed study may generalize to other domains and architectures would be of value. [W2, L1] Thank you for raising the important issue. Our method holds potential applicability in cases where a feature space with a Euclidean metric exists within the DM, analogous to the $h$-space in [Asyrp]. This has been demonstrated for UNet in [Asyrp]. However, whether other architectures, such as those akin to [DiT] or [MotionDiffusion], can also exhibit Euclidean metric properties is an intriguing question for future research. We add the discussion in the revised manuscript as follows: >> <font color='black'>Our approach exhibits broad applicability in cases where a feature space within the diffusion model possesses a Euclidean metric, as exemplified by $\mathcal{H}$. This characteristic has been observed in the context of U-Net within [Asyrp]. The question of whether alternative architectures, such as those resembling the structures of [DiT] or [MotionDiffusion], could also manifest a Euclidean metric, presents an intriguing avenue for future investigation.</font>  > [Q1] I was wondering what would happen if, instead of moving along one of the principal axis of T_H, you use directly the principal axis of T_x. [Q1] As previously mentioned in [W1-c-ii], we directly manipulate the latent variable $\mathbf{x}_t$ with the discovered latent basis vector $\mathbf{v}$. This approach offers several advantages over the editing of $\mathcal{H}$ as described in [Asyrp]. First, direct control over $\mathbf{h}_t$ disrupts the coherence between inner features and skip connections within the U-Net, which can lead to artifacts, particularly when making substantial adjustments [DiffStyle]. As a result, [Asyrp] implements gradual changes across multiple timesteps. In contrast, the manipulation of $\mathbf{x}_t$ enables more substantial alternations within a single timestep. Second, unlike $\mathbf{u}$ which exclusively affects the deepest feature map, $\mathbf{v}$ exerts influence not only on the latent variable $\mathbf{x}_t$, but also on the feature map dependent on $\mathbf{x}_t$, as well as all skip connections.    > [Q2] I would also discuss a bit more method [25] in the related work since it seems related to the proposed method. [Q2] To enhance the distinction between our paper and [Asyrp], we modified the related work as follows: >> <font color='black'>... Kwon et al. [25] demonstrated that the bottleneck of the U-Net, $\mathcal{H}$, can be used as a semantic latent space. Specifically, they used CLIP to identify directions within $\mathcal{H}$ that facilitate genuine image editing. ... In contrast to the method, our editing approach involves the direct manipulation of latent variables within the latent space. Furthermore, we autonomously identify editing directions through an unsupervised manner.</font> > [Q3] Check the sentence at rows 74-75 Row 152: we aims general grammar check Thank you for the careful comments. We will address and incorporate these revisions in the upcoming manuscript. [Asyrp] : Diffusion Models already have a Semantic Latent Space, Mingi Kwon et al., 2022 [DiT] : Scalable Diffusion Models with Transformers, William Peebles et al., 2022 [MotionDiffusion] : Human Motion Diffusion Model, Guy Tevet et al., 2022 [DiffStyle] : Training-free Style Transfer Emerges from h-space in Diffusion models, Jaeseok Jeong et al., 2023 # Reviewer vuQy > [W1] Some of the statements and claims are not clear as pointed in the Questions. > [Q1] It is stated that “To investigate the geometry of the tangent basis, we employ a metric on the Grassmannian manifold.” However, the space could be identified by another manifold as well. Why and how did you define the space by the Grassmannian manifold?  [Q1] The tangent spaces of $\mathcal{T}_{\mathbf{x}}$ and $\mathcal{T}_{\mathbf{h}}$ correspond to subspaces at the points $\mathbf{x} \in \mathcal{X}$ and $\mathbf{h} \in \mathcal{H}$ of diffusion models. Given that the Grassmannian manifold is characterized as a manifold comprising subspaces, it appears well-suited for representing the manifold of $\mathcal{T}_{\mathbf{h}}$. Additionally, the geodesic metric defined on this manifold quantifies the separation between two subspaces based on their principal angles. Hence, we posit that this geodesic metric offers a robust means of evaluating the similarity between different $\mathcal{T}_{\mathbf{h}}$s.  > [Q2-a] It is claimed that “the similarity across tangent spaces allows us to effectively transfer the latent basis from one sample to another through parallel transport”. How this improves effectiveness was not analyzed. In general, [Q2-b] how do the proposed methods improve training and inference time? Indeed, the additional transformations can increase training and inference time. Could you please provide an analysis of the footprints?  [Q2-a] Figure 7 shows the impact of employing parallel transport (P.T.). The transported vectors can edit distinct samples while manipulating the same attributes. We will modify Line 227 as follows to better reflect our intent:    >> [Line 227-229] ... us to effectively transfer the latent basis from one sample to another through parallel transport as shown in Figure 7. >> $\rightarrow$ ... us to successfully transfer the latent basis vector from one sample to another through parallel transport. Figure 7 shows that the transported vectors can edit distict samples while manipulating the same attributes. [Q2-b] We would like to emphasize that our approach does not require any form of training. Furthermore, the utilization of P.T. is not a default procedure in the editing process; its application is limited to specific instances where local basis vectors from other samples are used. The details regarding the typical editing footprint in the absence of P.T. are outlined in the global rebuttal paragraph 3. Here, for your reference, we focus solely on presenting the result table:  | Image Edit Method | running time | Preprocessing | |:-----------------:|:-------------:|:-------------:| | Ours | 11 sec |N/A | | [SDEdit] | 4 sec |N/A | | [Pix2Pix-zero] | 25 sec |4 min | | [PnP] | 10 sec |40 sec | | [Instruct Pix2Pix] | 11 sec |N/A |     The time taken for editing using P.T. can be broken down into three components: 1) DDIM inversion and generation, 2) identification of the local basis, and 3) the parallel transport process. For optimal results when employing P.T., it is advisable to use a sufficiently large value of $n$ to mitigate distortion. However, this choice necessitates more computation time for generating the local basis. In our case, we opt for $n=50$, with the number of DDIM steps set at $100$, and utilizing an unconditional diffusion model trained on CelebA-HQ. |DDIM inversion + generation| Identification of local basis| Parallel transport | |:--:|:--:|:--:| | 10sec | 100sec | 0.002sec|  > [W2] The results are given considering the Riemannian geometry of the latent spaces and utilizing the related transformations (e.g. PT) among tangent spaces on the manifolds. However, vanilla DMs do not employ these transformations. Therefore, it is not clear whether these results are for vanilla DMs or the DMs utilizing the proposed transformations. [W2] We wish to emphasize that our method works on frozen vanilla DMs without necessitating any fine-tuning or architectural modifications. It offers unsupervised image editing capabilities that are applicable to both unconditional DMs and conditional DMs. Furthermore, as explained in [Q2], it is important to note that the inclusion of P.T. is not a default step in our editing process. This technique is selectively employed in particular cases where local basis vectors are transferred for editing other samples.     > [W3] A major claim is that the proposed methods improve effectiveness of the DMs. However, the employed transformations can increase the footprints of DMs. [W3] To provide clarification, our primary focus centers on image editing utilizing DMs, without an emphasis on augmenting the performance of DMs themselves. When performing image editing with a single latent basis vector (i.e., $n=1$), our editing process consumes approximately 11 seconds. This represents roughly 15% of the time required for vanilla inversion and reconstruction. Furthermore, even when employing Jacobian approximation with $n=1$, the computational overhead associated with our approach remains comparable with other state-of-the-art editing methods, as illustrated in the aforementioned table.      > [L1] Some of the limitations were addressed but potential impacts were not addressed. [L1] We appreciate your observation and would like to address this concern by incorporating a societal impact and ethics statement into the revised manuscript, which is presented below: >> **Societal Impact / Ethics Statement.** Our research endeavors to unravel the geometric structures of the diffusion model and facilitate high-quality image editing within its framework. While our primary application resides within the creative realm, it is important to acknowledge that image manipulation techniques, such as the one proposed in our method, hold the potential for misuse, including the dissemination of misinformation or potential privacy implications. Therefore, the continuous advancement of technologies aimed at thwarting [Mist, Immunize] or identifying [Detect, Defake] manipulations rooted in generative models remains of utmost significance.   #### References [SDEdit] : SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, Chenlin Meng et al., 2021 [Pix2Pix-zero] : Zero-shot Image-to-Image Translation, Gaurav Parmar et al., 2023 [PnP] : Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation, Narek Tumanyan et al., 2022 [Instruct Pix2Pix] : InstructPix2Pix: Learning to Follow Image Editing Instructions, Tim Brooks et al., 2022 [Mist] : Raising the cost of malicious ai-powered image editing, Salman et al., 2023 [Immunize] : Adversarial Example Does Good: Preventing Painting Imitation from Diffusion Models via Adversarial Examples, Liang et al., 2023 [Detect] : On the detection of synthetic images generated by diffusion models, Riccardo Corvi et al., 2022. [Defake] : Defake: Detection and attribution of fake images generated by text-to-image diffusion models, Zeyang Sha et al., 2022 # Reviewer 6jtZ Thank you for acknowledging the strengths of our paper: distinctive idea for editing in Diffusion models (DMs), and enhancing comprehension of the latent space dynamics, e.g. the evolution of geometric structure over time and influence of various text conditions. > [W1] (...), the method and notation used in this work can lead to some confusion. For instance, [W1-a] Section 3, in its current form, may not be as accessible to all readers as it could be, and it could benefit from being revised for clearer communication of the ideas contained therein. [W1-b] Additionally, Fig. 1, which is presumably intended to illustrate key concepts, is perhaps too dense with information. Dividing Fig. 1 into two separate figures could make it easier to digest, enabling a clearer explanation of the approach. [W1] Thank you for the constructive suggestion. Here, we re-write the global rebuttal about Fig. 1 (paragraph 1) and section 3 (paragraph 2) for your convenience: In order to make Fig. 1 more comprehensible, we divide Fig. 1 into two seperate figures. Please refer to Fig. R1 and R2 in the global rebuttal PDF. First, **Figure R1** conceptually visualizes the local basis found through the pullback metric and provides a summary of the process for obtaining local basis. Second, **Figure R2** provides an overview of the image editing method using the discovered local basis and briefly presents its results. To clarify the image editing process (Sections 3), we added a new subsection summarizing the overall procedure. This includes a visual overview in Fig. R2 and clear textual explanations. For more details of improvements on Figure 1 and Section 3, please refer to global rebuttal paragraphs 1 and 2, respectively. > [W2] A second aspect that could be improved upon is the overall presentation and proofreading of the paper. While the approach is relatively simple, its translation into the written form has not been as clear as one would hope. The text could benefit from a thorough proofreading to ensure that it is not just grammatically correct, but also that it conveys the authors' ideas in a way that is accessible to the wider machine learning community. As it stands, the paper's usefulness to this community may be hindered by its presentation. [W2] Thank you for the advice on clarification. We made the following improvements to help readers understand more easily: + We move Parallel Transport (P.T.) from section 3.3 to the last subsection of section 3. Previously, P.T. was introduced amidst the explanation of the image editing process, potentially leading to the misconception that using P.T. is the default for our editing method. However, it is only utilized in the unique scenario of editing with a basis vector from another sample. In its new placement, P.T. clearly introduced as a specialized approach for importing latent basis vectors from other samples once the image editing process has been thoroughly presented. + We make all components pertaining to the image editing method consolidate within section 3. For instance, in section 4, specifically L178-182, the process of generating initial noise $\mathbf{x}_T$ through DDIM inversion was previously outlined. This content is now relocated to section 3. This reorganization ensures that readers can comprehend the complete editing process by solely referring to section 3. + To make Fig. 8 digestable, we move Fig. 8-(b) and the content from L243-L248 to the appendix. Given that Figure 8-(b) serves to validate the outcomes in (a) and (c), we deemed it more suitable to include this information in the appendix.     We will also carefully identify any shortcomings and revise them in camera-ready version. > [W3] Lastly, the paper could do more to address the computational implications of its approach. The authors use the power method to approximate the Jacobian, which, while effective, can be computationally costly. It would be beneficial if the authors were more transparent about this fact, allowing readers to fully understand the computational demands of the approach and evaluate whether or not it would be feasible in their own applications. (...) > [L1] I would like to see a deeper analysis of the computational complexity and runtime of the approach.  [W3, L1] In Appendix A, we report the computation time to approximate the Singular Value Decomposition (SVD) of Jacobian using the power method. The runtime of power method varies depending on the parameter $n$, i.e. the low-rank approximation of the original tangent space. Smaller values of $n$ result in shorter computation times. For instance, when $n=3$, it takes approximately 10 seconds, whereas for $n=50$, it takes around 3 to 4 minutes for Stable Diffusion. In particular, when you require just a single basis vector, like in the case of text conditional editing, the time it takes to approximate the SVD of Jacobian is remarkably short—around 2.5 seconds.  > [Q1] What is the complexity of computing the Jacobian? What is the the actual runtime (in seconds) of the approach compared to other editing methods? I think answering these questions can provide a good context for readers. [Q1] Thank you for your valuable question. We conduct all comparisons on the Nvidia 3090 of 24GB VRAM. To ensure a fair comparison, we set $n=1$ and performed 50 steps of the DDIM algorithm. The time taken by each method is as follows:  | Image Edit Method | running time | |:-----------------:|:-------------:| | Ours | 11 sec | | [SDEdit] | 4 sec | | [Pix2Pix-zero] | 25 sec | | [PnP] | 10 sec | | [Instruct Pix2Pix] | 11 sec | For a more detailed discussion, please refer to global rebuttal paragraph 3. #### References [SDEdit] : SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, Chenlin Meng et al., 2021 [Pix2Pix-zero] : Zero-shot Image-to-Image Translation, Gaurav Parmar et al., 2023 [PnP] : Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation, Narek Tumanyan et al., 2022 [Instruct Pix2Pix] : InstructPix2Pix: Learning to Follow Image Editing Instructions, Tim Brooks et al., 2022 # Reviewer iEgM Thank you for acknowledging our strengths: - uncovering effect of text prompt and complexity of the dataset on the latent space. - confirming coarse-to-fine behaviour of diffusion models (DMs). - enhance credibility through experimental validation. > [W1] The paper lacks comparisons with other diffusion-based image editing techniques, like [7,18]. > [Q1] To make the paper applicable to real-world scenarios right away, the authors should include a comparative analysis with other image editing techniques. [W1, Q1] Thanks for the great suggestion. We conduct qualitative comparisons with a text-guided image editing methods. Our state-of-the-art baseline methods include: (i) [SDEdit], (ii) [Pix2Pix-zero], (iii) [PnP], and (iv) [Instruct Pix2Pix]. All comparisons were performed using the official code. Please refer to global rebuttal paragraph 3 and Fig. R3 in the global rebuttal PDF for the results.  > [W2] The presentation and clarity of the paper could be improved. [W2-a] For example, the abstract contains too much detail, making it challenging to understand upon initial reading. [W2-b] Also the explanation of the image editing technique could be improved: [W2-b-i] what is DDIM (section 4)? [W2-b-ii] what is epsilon in Equation 4? [W2-b-iii] Finally, Figure 1 is not sufficiently clear to me, it may hampers the reader's comprehension. [W2-a] Thank you for constructive suggestion. We revise the abstract to be concise and digestible as follows: >> Despite the success of diffusion models (DMs), we still lack a thorough understanding of their latent space. To understand the latent space $\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective. Our approach involves deriving the local latent basis within $\mathcal{X}$ by leveraging the pullback metric associated with their encoding feature maps. Remarkably, our discovered local latent basis enable image editing capabilities with moving $\mathbf{x}_t$, the latent space of DMs, along the direction at specific timesteps. >> >> We further analyze how the geometric structure of DMs evolves over diffusion timesteps and differs across different text conditions. This confirms known phenomenon of coarse-to-fine generation, as well as reveals novel insights such as discrepancy between $\mathbf{x}_t$ across timesteps, the effect of dataset complexity, and the time-varying influence of text prompts. To the best of our knowledge, this paper is the first to present image editing through $\mathbf{x}$-space traversal, editing only once at specific timestep $t$ without any additional training, and provide thorough analyses of the latent structure of DMs.  [W2-b] Thank you for pointing out the clarity issue. [W2-b-i] To make the image editing process easier to understand, we created a summary subsection and an overview figure for the whole process. Please refer to global rebuttal paragraph 2 and Fig. R2 in the global rebuttal PDF for more details. In our method, [DDIM] is used to invert the image into the initial noise $\mathbf{x}_T$, and again for denoising to generate the image.  [W2-b-ii] $\epsilon$ is a function representing the pretrained diffusion model. For clarity, in the revised manuscript we will write it as $\epsilon^{\theta}$, and state it explicitly as below : >> (L166) ... where $\epsilon_{\theta}$ is the denoising fuction of the pretrained diffusion model, and $\gamma$ is ... [W2-b-iii] Thank you for the valuable suggestion. Please refer to global rebuttal paragraph 1 and Fig. R1, R2 in the PDF for improvements to Fig. 1. <font color='blue'>Below is an excerpt from the content of the global rebuttal ㅔparagraph 1, extracted to enhance readability for the reviewers:</font> In order to make Fig. 1 more comprehensible, we divide Fig. 1 into two seperate figures. Please refer to Fig. 1 and 2 in the global rebuttal PDF. First, **Figure R1** conceptually visualizes the local basis found through the pullback metric and provides a summary of the process for obtaining local basis. Second, **Figure R2** provides an overview of the image editing method using the discovered local basis and briefly presents its results. > Some potentially interesting references Thank you for valuable references. We will add those suggested references on revised manuscript. > Typos Thank you for the careful comments. We will address and incorporate these revisions in the upcoming manuscript.  #### References [SDEdit] : SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, Chenlin Meng et al., 2021 [Pix2Pix-zero] : Zero-shot Image-to-Image Translation, Gaurav Parmar et al., 2023 [PnP] : Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation, Narek Tumanyan et al., 2022 [Instruct Pix2Pix] : InstructPix2Pix: Learning to Follow Image Editing Instructions, Tim Brooks et al., 2022 [DDIM] : Denoising Diffusion Implicit Models, Jiaming Song et al., 2020

Cheatsheet

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.