# Descussion period
# Review TZ1p
Comment:
Thank you for the authors' rebuttal, which has answered some of my questions. However, I still have some questions:
Why does transferring style through attention work? The current paper lacks theoretical analysis, and the method is too empirical, seemingly obtained through trial and error. The current version may be more suitable for computer vision type conferences.
The authors have introduced many hyperparameters in the paper, such as the position and number of blocks to be replaced. This is too many parameters for users to adjust. Can other methods achieve the same effect with hyper-parameter search?
Is the use of the base model sensitive, such as replacing SDXL and some community base models. What is the effect of direct application? Do these set parameters need to be modified? Can more results of inserting LoRA and ControlNet be released?
The computational complexity of the method.
## 1. Why style transfer work with attention?
Style transfer aims to synthesize any contents with a given style. We consider style transfer as constructing contents by rearranging visual elements in a style reference according to the content. It aligns with the attention mechanism between the content (query) and the style (key and value). Attention mechanism first obtains an attention map using similarity between query features Q and key features K, then aggregate value features V using the attention map as weights. In self-attention, features with spatial dimensions serve as key, query, and value tokens. The proposed swapping self-attention employs the attention mechanism to reassemble visual features of a style image through the swapped value V’ according to the attention map between the content query Q and the swapped key K’. Figure 6 shows the attention map representing style correspondence between a reference and a content in a self-attention block. Please note that it is a profound principle supported by a number of papers [aams, sanet, mast, adaattn, styletr2].
- [aams] Attention-aware multi-stroke style transfer, Yao+, cvpr 2019
- [sanet] Arbitrary Style Transfer with Style-Attentional Networks, Park and Lee, cvpr 2019
- [mast] Arbitrary style transfer via multi-adaptation network, Deng+, acmmm 2020
- [adaattn] Adaattn: Revisit attention mechanism in arbitrary neural style transfer, Liu+, iccv 2021
- [styletr2] Stytr2: Image style transfer with transformers, Deng+, cvpr 2022
## 2. Many Hypterparameters
We respectfully correct the comment: our method has **only one hyperparameter**, the starting block in the U-Net. I.e., the position of the starting block decides the number of blocks to be replaced. There are hardly “many” hyper-parameters. Our method works with an identical hyperparameter across all style references. Users do not have to adjust the hyperparameter but can use the hyperparameter chosen by us.
The competitors cannot perform visual style prompting with any combination of their hyperparameters. We conducted thorough hyperparameter sweeps on the competitors and reported their best performing configurations.
## 3. Sensitivity to base models
Let us respond over two interpretations of the “base models”. i) As base models == backbones, our method works with different backbones: SDXL (main paper) and SDv1.5 (Appendix). Likewise our method works with one fixed hyperparameter for any references on SDXL, one fixed hyperparameter provided by us suffices on SDv1.5. Adopting any other backbone requires only one hyperparameter searching per model. ii) As base models == LoRA and ControlNet variants of one backbone, the hyperparameter for the base model works on its LoRA and ControlNet variants. E.g., the hyperparameter for SDXL works with SDXL-LoRA and SDXL-ControlNet. We will mention this in the camera ready. Although the change of architecture does not allow the direct application of the optimal parameter, we could select the optimal parameter with the proposed quantitative measurement.
We would be pleased to include more results of inserting LoRA and ControlNet in camera-ready version. However, this is currently not permitted by ICML policy.
## 4. Computational complexity
We provide the time cost for each method as follows:
| Method | inference time (sec) |
| ------ | -------- |
|StyleDrop|11.21|
| SDXL (baseline) | 14.41 |
| DB-LoRA |14.41 |
| IP-Adapter |14.43 |
| Ours | 32.75 |
| StyleAlign | 33.01 |
Note that
- DB-LoRA requires additional training time (~30mins).
- For measurement, all methods except for StyleDrop are based on SDXL
- StyleDrop is based on Amused (Unofficial [muse] which is vision transformer model)
[muse] Muse: Text-to-image generation via masked generative transformers.Change+, ICML2023
# Rebuttal period
# Review 1zDe
Thank you for your detailed and supportive review. We will proceed to answer the questions sequentially.
## 1. What is the contribution of the paper compared to the existing literature? + Comparison with [ref-1]
(1) Thanks for acknowledging our competitive results. While _existing literature_ has explored self-attention in _image editing_ scenarios, _our work_ focuses on _image generating_ scenarios where _the generated results reflect the styles of reference images_. It is very different from editing because _it starts from random noises_ to synthesize various images that match an input prompt. Visual style prompting affects only the style of the resulting images _without affecting the original content specified by texts_.
(2) We respectfully correct the second sentence of the review: we replace the key and value features of self-attention layers in generative processes with the ones in a reference process. We do not touch object texts. On the other hand, [ref-1] swaps CLIP embedding features which affects the results through cross-attention. We will include [ref-1] in the related work section. Thank you for enriching our paper.
## 2. Comparison with the image content editing task
2. We respectfully remind the reviewer that the image content is specified by input prompts in this task. The image content in the reference image does not play any role in the results as opposed to image content editing which synthesizes edited versions of the same image content in the reference image. Thank you for revealing the confusion. We will add this explanation right next to the definition of visual style prompting in the introduction section.
## 3. Gram matrix as a evaluation for style similarity
3. Thank you for recommending Gram matrix. Below table confirms superiority of our method compared to previous methods regarding Gram matrix. While we also find DINO being unreliable, we use it following previous methods [1, 2, 3]. We will append the Gram matrix next to DINO.
ours_gram score: 0.791 🙂
style-align_gram score: 0.759
ip-adapter_gram score: 0.768
db-lora_gram score: 0.759
style-drop_gram score: 0.659
## 4. The disparity between the qualitative and quantitative comparing DB-LoRA and ours.
4. We respectively remind that CLIP score is not a perfect evaluation metric which is not perfectly disentangled from style variations. In ours, faithfully reflecting style reference decreases the score while preserving the object shape of a given text prompt. In contrast, DB-LoRA does not.
Moreover, DB-Lora is sensitive to the parameter, resulting in a serious performance gap according to each reference image. Through grid search, we found the optimal parameters for our reference images and prompts.
Text alignment (CLIP score) of T2I is 0.3001. Compared to the baseline, text alignment of all methods inevitably decreases during referring style reference images. To compensate for the imperfection of the metric, we included a user study. Thanks for revealing the confusion and we will update the explanation in section 3.1.
5. Yes, our proposed method consistently succeeds across various object texts and references, i.e., cherry-picking was not needed, except when the text and the reference are contradictory. For example, 예시설명. It is an anticipated phenomenon because swapping self-attention forces the result to be made of the reference.
[1] Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M.,and Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
[2] Hertz, A., Voynov, A., Fruchter, S., and Cohen-Or, D. Stylealigned image generation via shared attention. arXiv, 2023.
[3] Voynov, A., Chu, Q., Cohen-Or, D., and Aberman, K. p+:Extended textual conditioning in text-to-image generation. arXiv .2023.
# Reviewer TZ1p
## 1. Comparison with MasaCtrl
While MasaCtrl starts from duplicating the latent noise DDIM-inverted from a given image in order to keep the most part unchanged except the editing target, our method starts from random latent noises that may lead to different contents with the style specified by a given image. If we adopt the initial noise proposed in MasaCtrl to ours, the resulting images suffer quality issues, content leakage, and lack of diversity within a given prompt. Furthermore, MasaCtrl requires the user to specify the target object in the prompt to indicate the target editing region while ours maintain the vanilla text-to-image procedure except the visual style prompting. Ours allows synthesizing different objects specified by text prompts which do not exist in the given image. Figure 12 shows that our method synthesizes images of identical objects in the given style for a identical latent noise.
We will add the ablation study related to above comparison in the camera-ready version. Lastly, Masactrl does not explain how the self-attention layer can transfer visual elements from reference images and what elements are delivered. We provide intuition of our method in Section 2.1 and 2.2 regarding the self-attention mechanism in style transfer literature: the self-attention layer reassembles the visual features stored in the value tokens from a reference image according to the similarity between the key tokens from the reference and the query tokens from the generative process.
## 2. Why is Swapping self-attention effective within style transfer literature?
2. Swapping self-attention has a strong connection with style transfer literature [aams, sanet, mast, adaattn, styletr2] where the attention mechanism reassembles visual features of a style image (key, value) on a content image (query). Instead of a content image, our method has a random noise and a text prompt for specifying the content. It is briefly explained in Section 2.1-2.2 but we will make it better stand out by elaborating the above explanation.
[aams] Attention-aware multi-stroke style transfer, Yao+, cvpr 2019
[sanet] Arbitrary Style Transfer with Style-Attentional Networks, Park and Lee, cvpr 2019
[mast] Arbitrary style transfer via multi-adaptation network, Deng+, acmmm 2020
[adaattn] Adaattn: Revisit attention mechanism in arbitrary neural style transfer, Liu+, iccv 2021
[styletr2] Stytr2: Image style transfer with transformers, Deng+, cvpr 2022
## 3. Investigation of optional layers for the different model & Is this method sampler sensitive?
3. We provide the investigation of stable diffusion v1.5 in the table. Through the proposed quantitative measurement, we can find the optimal layer among the upblocks. We will update the results in the camera-ready version.
Sampling methods do not influence how to reflect the style reference images. The change induced by the sampling method, such as quality change, is equal both in the presence and absence of our method.
| |early upblock| <- | -> |Late upblock |
| ---------------- | ---------- | ---------- | ---------- | ---------- |
| Style similarity | 0.869 | 0.878 | 0.738 | 0.410 |
| Text alignment | 0.217 | 0.219 | 0.259 | 0.262 |
| Content leakage | 0.478 | 0.463 | 0.412 | 0.376 |
| diversity | 0.439 | 0.567 | 0.639 | 0.646 |
## StyleDrop results are not reliable.
4. Unfortunately, any official code and official demo are not available. Unofficial StyleDrop significantly loses performance in generating complex images and we tried to find better parameters based on them. Please let us know if any official source of StyleDrop is available.
# Reviewer ETwC
## 1. Audience
We suppose that the ICML audience would be interested in the relationship between self-attention in diffusion models and attention mechanism in style transfer.
## 2. Extension to image editing
Our paper is focused on style transfer, not on changing the subject. Style transfer is a well-established area with many studies exploring how it works [1,2,3,4,5,6].
* [1] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, volume 1, page 4, 2016.
* [2] Yingying Deng, Fan Tang, Weiming Dong, Chongyang Ma, Xingjia Pan, Lei Wang, and Changsheng Xu. Stytr2: Image style transfer with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11326–11336, 2022.
* [3] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
* [4] Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, and Mingli Song. Neural style transfer: A review. IEEE transactions on visualization and computer graphics, 26(11):3365–3385, 2019.
* [5] Yanghao Li, NaiyanWang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017.
* [6] Siyu Huang, Jie An, Donglai Wei, Jiebo Luo, and Hanspeter Pfister. Quantart: Quantizing image style transfer towards high visual fidelity.
## 3. Typos
Thank you for pointing out the typos in our paper. Your attention to detail has greatly improved the quality of our work. We have corrected all the mistakes you highlighted.
## Q 1.
Our focus is on style transfer, not subject transfer. Style transfer is a strong application with much research dedicated to understanding its principles.