Rebuttal - HackMD

## Final global response Dear reviewers and AC, We sincerely thank all the reviewers for their valuable comments and suggestions, which have helped improve our paper. Although some reviewers did not engage in our response, we have incorporated all comments into both the main paper and the appendix. We are happy that our key contributions are properly recognized: * $\textit{Directional}$ CLIPScore allows users to grasp the presence of attributes in a given image, while CLIPScore does not. * Single and paired attribute alignment (SaKLD and PaKLD) allow users to choose a model that captures desired attributes in the training distribution. Sincerely, authors $\textit{Directional}$ | | | 10k | 20k | 30k | 40k | 50k | 60k | 70k | |-------|-----------|--------------|--------------|--------------|--------------|--------------|--------------|--------------| | | seed1 | 8.27 | 7.99 | 7.83 | 7.88 | 7.77 | 7.84 | 7.80 | | | seed2 | 8.34 | 8.20 | 8.09 | 8.15 | 8.10 | 8.06 | 8.03 | | SaKLD | seed3 | 9.13 | 8.43 | 8.04 | 8.04 | 7.88 | 7.84 | 7.80 | | | seed4 | 8.66 | 8.20 | 7.89 | 7.98 | 7.85 | 7.89 | 7.87 | | | seed5 | 8.40 | 7.93 | 7.61 | 7.79 | 7.66 | 7.74 | 7.68 | | | mean(var) | 8.56(0.123) | 8.15(0.039) | 7.89(0.036) | 7.96(0.019) | 7.85(0.026) | 7.87(0.013) | 7.83(0.016) | | | | | | | | | | | | | seed1 | 23.47 | 21.18 | 20.29 | 20.08 | 19.69 | 19.69 | 19.45 | | | seed2 | 23.62 | 21.54 | 20.60 | 20.43 | 20.17 | 19.95 | 19.76 | | | seed3 | 24.90 | 21.98 | 20.48 | 20.15 | 19.67 | 19.45 | 19.30 | | PaKLD | seed4 | 24.29 | 21.70 | 20.44 | 20.23 | 19.80 | 19.67 | 19.48 | | | seed5 | 23.82 | 21.03 | 19.82 | 19.88 | 19.48 | 19.44 | 19.19 | | | mean(var) | 24.02(0.337) | 21.48(0.148) | 20.32(0.092) | 20.15(0.040) | 19.76(0.065) | 19.64(0.043) | 19.43(0.046) | 작게 작성한 부분 == 한국어 원문 파란색 == 넣을지 말지 고민하고 있는 부분. 빨간색 == 해야할 일. 수정하고 싶은 부분. 보라색 == 빨간색 수정해본 것(재석), 괜찮으면 반영 ㄱㄱ # Private message to AC Dear AC, We genuinely appreciate the reviewers' efforts in assessing our work. However, we find the reviews from [ch3Y] to be somewhat inspecific and challenging to fully comprehend. Despite this, we have put our best effort in formulating a thorough response. We kindly request that the lack of specificity in his/her reviews be taken into consideration during the decision process. Thank you for your consideration. Best regards, Authors # Global rebuttal We thank the reviewers for their valuable advice. Here, we compile reviews we want to share with all the reviewers. Please see our responses addressing the specific concerns below: #### 1. Additional experiments for providing more quantity of convincing information #### Reviewers T7Dd, 1cWg, and ch3Y suggested to add more experiments to be more convince. We provide more experiments which support the argument of our paper. #### 1-1. Biased image injection experiment We conduct additional experiment with restrict desired conditions which support Sec 5.1; Correlated Image Injection Experiment. We conducted the existing experiment more densely and showed it as a graph. Please refer to **Fig. 2** in the PDF. We compare SaKLD and PaKLD with FIDs by injecting certain attributes - ("man"-"makeup" and "man"-"bangs"). Fig. 2 shows that while SaKLD and PaKLD captures distribution changes, FIDs remain about the same value. The results show that our proposed method can catch the attribute distribution change well, unlike conventional metrics.  #### 1-2. Additional model performance descriptions We provide more results of models - LDM, StyleSwin and ProjectedGAN in the following table. These results show that our metric is similar to the tendency of FIDs. Especially ProjectedGAN, which was criticized [FID-clip] for not focusing on fidelity, only focusing on getting good FID score by training to match training set's imagenet encoder output stats, showed inferior results in SaKLD and especially PaKLD on our metric. It indicates that even if model can mimic few of attributes contained in imagenet labels, it is hard to catch the correlation of attributes on the training set. **Fig. 4** in the PDF shows some examples of ProjectedGAN; a baby with beard and some unusual samples. [FID-clip] Kynkäänniemi et al., The Role of ImageNet Classes in Fréchet Inception Distance, ICLR, 2023 #### Table 1 (*The scale of SaKLD,PaKLD are 1e-7, using 50k FFHQ) | | StyleGAN1 | StyleGAN2 | StyleGAN3 | iDDPM | LDM | StyleSwin | ProjectedGAN | |:----------:|:-----------:|:-----------:|:-----------:|:-------:|:-------:|:-----------:|:--------------:| | SaKLD | 9.82 | 6.70 | 6.00 | 10.70 | 18.32 | 16.93 | 11.45 | | PaKLD | 23.50 | 16.75 | 15.50 | 25.00 | 39.71 | 40.25 | 28.18 | | FID | 4.74 | 3.17 | 3.20 | 7.31 | 11.86 | 5.45 | 4.45 | | FID_CLIP | 3.17 | 1.47 | 1.66 | 2.39 | 3.57 | 3.63 | 2.45 | ### 2. Improving the way to obtain $C_{\mathcal{T}}$, the center of images in text respect. #### Reviewer T7Dd provided constructive discussion about using the attributes themselves for obtaining $C_{\mathcal{T}}$. Thanks to reviwer T7Dd, we improve the way to obtain $C_{\mathcal{T}}$ using the attributes themselves. We conduct the same experiment, reporting classifying accuracy by using CelebA annotation, for comparing between using captions and attribute themselves. | | CS | DCS w/ captions | DCS w/ attribute themselves | | :--------: |:------: | :--------: | :--------: | | accuracy | 0.395 | 0.409 | 0.442 | Using attributes themselves for obtaining $C_{\mathcal{T}}$ shows superior accuracy than others. We have conducted all the experiments in the paper using the attributes themselves, and we confirm that the trends are all the same including Table 1 obtained with attributes themselves. Thanks for the great constructive suggestion, and we are confident that we will update all the results for the camera-ready version and that there are no changes in the aspect of tendency. ### 3. Moving interpretable analysis from appendix to Section 5. Reviewers o9jM and E51D suggested showing a more fine-grained analysis of the scores. Thanks to the great suggestions of reviewers, we decided to move the analysis in the appendix into the main paper. We agree that this analysis can greatly highlight the interpretable advantages of our method. The following is what has been transferred. Please refer to Fig. S13 and Tab. S7 for details. SaKLD SaKLD directly measures the differences in attribute distributions, indicating the challenge for models to match the density of the highest-scoring attributes to that of the training dataset. Examining the top-scoring attributes, all three StyleGAN models have similar high scores in terms of scale. However, there are slight differences, particularly in StyleGAN3, where the distribution of larger accessories such as eyeglasses or hats differs. In contrast, iDDPM demonstrates notable scores, with attributes ‘makeup’ and ‘woman’ showing scores over two times higher than GANs. Particularly, apart from these two attributes, the remaining attributes are similar to GANs, highlighting significant differences in the density of ‘woman’ and ‘makeup’. Investigating how the generation process of diffusion models, which involve computing gradients for each pixel, affects attributes such as ‘makeup’ and ‘woman’ would be an intriguing avenue for future research. PaKLD PaKLD provides a quantitative measure of the appropriateness of relationships between attributes. Thus, if a model generates an excessive or insufficient number of specific attributes, it affects not only SaKLD but also PaKLD. Therefore, it is natural to expect that attribute pairs with high PaKLD scores will often include top-ranking attributes in SaKLD. Nevertheless, PaKLD reveals interesting findings. Firstly, it is noteworthy that attributes related to ‘beard’ consistently receive high scores across all StyleGAN 1, 2, and 3 models. Figure S16 confirm that ‘beard’ significantly contributes to the overall PaKLD scores. This indicates that GANs generally fail to learn the relationship between beards and other attributes, making it an intriguing research topic to explore the extent of this mislearning and its underlying reasons. LDM also exhibits interesting attributes, particularly ‘bald head’. Despite not having a high score in SaKLD, it consistently receives high scores in PaKLD. This implies that individuals with bald heads have a scarcity of many attributes that are commonly found, and analyzing this phenomenon would be a promising avenue for future research.   ------    # Reviewer T7Dd ### Presentation of paper can be improved  [W1-1] We modify the writing and figures for readability. Please refer to attached PDF file for figure. As mentioned in global announce 1, we also add more quantity of information. Please refer to global rebuttal. [W1-2] Evaluating different models allows users to choose a model that meets their needs. FID, precision, recall, density, and coverage measure quality and diversity so that users can choose a model with the best quality or compromise quality with diversity. SaKLD and PaKLD measure how much the generated images align with the training images regarding semantic attributes. For that, directional CLIP score interprets an image as amounts of attributes. Hence, users can choose a model with the desired attributes. [W1-3] We provide our motivations and improve the presentation as followings: ex 5.1 Validation of our metric's effectiveness: To check whether our metric operates properly, we intentionally injected "man with makeup" images only into an experimental set. We found deteriorations on both SaKLD and PaKLD regarding "makeup" since they are rare in the training set. Total SaKLD and PaKLD scores have worsed also, which reflect both individual attributes/attribute affected to total score. Similarly, SaKLD regarding a specific attribute linearly decreases by excluding images having the attribute from one of the sets. ex 5.2 Ablational validation of PaKLD For thorough validation, we established a particular scenario where setA consists of images with more smiling men than unsmiling men and more unsmiling women than smiling women, and in setB, it is vice versa. We set both groups to have the same number of images for each attribute ("smiling", "man" and "woman"). While SaKLD scores regarding each attribute were similar, PaKLD regarding "smiling man" and "smiling woman" showed significnat differences compared to the other PaKLD scores. It demonstrates that PaKLD accurately understand the difference between the correlation of attributes. We used celebA datasets for the experiment, since it has ground truth attribute labels.        -->            ### Concept based representation   Concept Whitening extracts human-understandable concepts from black-box models through the CW layer. In certain contexts, the motivation behind interpreting black box models closely aligns with our metric. However, our interpretation approach diverges significantly as it is posthoc. We recognize the significance of associating our problem set with a concept-based representation, going beyond mere posthoc methods, in order to derive more valuable insights. We organized it and put it in the related work.  ### Further exploration of attribute selection  In addition to language-text models like BLIP and GPT, the incorporation of methods such as classifier model's heatmap for obtaining attributes and ranks can introduce new attributes, potentially leading to novel perspectives in understanding generative models. We concur with the utilization of a variety of methods for attribute extraction. Beyond this, our primary contribution lies in our pioneering effort to interpret generative models. While attribute extraction methodologies may yield broader feedback, they are not as central as our primary objective: interpreting models through attribute quantification and comparative analysis. Additionally, the reason we set the number of attributes up to 40 attributes because celebA has 40 attributes. If a user is interested in hair styles, one can enrich the target attributes. We have added this topic to Discussion.   ### More explanation for toy experiments   We add more detailed explaination which contains the content in global rebuttal and [W1-3]. We designed the toy experiments to destroy the distribution in the way we want. The outcomes of this experiment, particularly the injection of "man&makeup" images that are rarely found in the training set, leads to a noticeable linear increase in both SaKLD and PaKLD scores. However, there were no distinctive patterns observed when we injected another ground-truth subset.    ### Text mean and Image mean #### Dividing text mean and image mean It is necessary to center text and images differently. Appendix B.1. provides an ablation experiment: setting the text origin as the text mean vs image mean. We additionally provide the entire mean setting in the table below. Using the text mean shows the best accuracy. The image origin is set to the image mean. | | different text/image mean | same text/image mean | entire mean | | -------- | -------- | -------- | ------| | accuracy | 0.409 | 0.228 | 0.313 | We suppose the image embedding and the attribute embedding are slightly unaligned because CLIP is trained with sentences and we use a simplified form of “a photo of {attribute}”. In other words, a single attribute does not fully describe one image. This is our motivation for using different centers for images and attributes. #### Text mean as attribute themselves Thank you for constructive suggestion. We verified using attribute themselves for obtaining $C_{\mathcal{T}}$ is more accurate than previous method in CelebA experiment. We conducted all experiment in our paper and confirm that there is no change in tendency. Please refer to global rebuttal. Thanks again for great suggestion.  | | captions | attribute themselves | | -------- | -------- | -------- | | accuracy | 0.409 | 0.442 | ### Relying on external models such as BLIP  We respectfully suppose a misunderstanding. We do not object using external models but object using uninterpretable embeddings. Our solution is to design an interpretable embedding using DCS. These explanations are in L47 and L66. One of the advantage of our metric is its flexibility; one can proceed for the desired task and analysis. Using external models for embedding is a smart approach that removes costly manual annotation. # Reviewer 1cWg ### GPT may bring bias We tone down our claim regarding the use of GPT in the paper as follows. Instead of being a vital network for obtaining text attributes, GPT is positioned as a recommendation module to assist users in selecting attributes. As users cannot manually apprehend all the visual attributes in training datasets, external models including GPT can be helpful but should be treated as a tool, rather than the auxiliary modules in the attribute selection process. ### Correlation with human judgement  We respectfully note that collecting human evaluation for generated attributes is impractical due to the number of samples and subjectiveness. While image quality and diversity can be roughly evaluated by humans with maybe 100 images, attribute distribution evaluated with 100 images is prone to hasty generalization. Furthermore, evaluating 50K images for attributes is impractical. As a remedy, we offer a novel evaluation protocol for broad and customizable attributes using interpretable embedding and divergence of a generated distribution from the real distribution. ### Euclidean CLIPScore  >can the problem exhibited by CLIPScore be solved by directly calculating the Euclidean distance between E(x) and E(a)? > No, it cannot be resolved by Euclidean distance because CLIP is trained with cosine distance. To confirm it, we provide additional experiments in the table below. As expected, Euclidean cases are inferior to cosine cases in CelebA attribute classification. | | Euclidean | Cosine | | -------- | -------- | -------- | | As-is embedding | 0.222 | 0.395 | |Directional embedding| 0.225 | 0.409 | #### Notation error  We replaced N used in Eq. (5) to another notation to avoid overload in Eq. (2). Thanks for finding. # Reviewer o9jM ### Image quantity's impact on result stability  We use gaussian Kernel Density Estimation(gaussian KDE) to make Probability Density Function(PDF) for each attribute from dataset's DCS. If the number of images is too small, the subset's PDF may not describe the full dataset's PDF, and it could deliver inaccurate interpretation of full dataset (or generative model's performance). Fig. 5 (a) describes the impact of sample size and 50k images are recommended to get accurate result. ### First explainable metric  >Regarding the for explainable evaluation ... it is necessary to determine whether this paper is the first to come up with it. > While our evaluation protocol and FID$_\text{CLIP}$[24] both use CLIP embedding, they are different as follows. FID$_\text{CLIP}$ directly computes FID on CLIP embedding space. The CLIP embedding is still an uninterpretable feature vector of 512 dimensions, and each channel of the embedding does not have its explicit meaning. On the other hand, we compute similarity between an image and a set of given attributes in CLIP embedding space. It projects the uninterpretable CLIP embedding of the image to an interpretable embedding space, and each channel of our final embedding conveys similarity of the image to its respective attribute. This paper is the first to come up with an explainable evaluation which is applicable to various settings. We will add a closely related work [GANseeing] as follows.  [GANseeing] Bau et al., Seeing What a GAN Cannot Generate, ICCV, 2019 ### What if CLIP is biased?  Including CLIP model, every model may be biased. However we utilize biased pretrained models as a feature extractor for evaluation metric because we usually calculate the distance of distribution of the features. Even though the model is biased, the distance is meaningful since the features from real dataset are also extracted by that biased model. Nevertheless, it is ideal that there is no bias as much as possible. We suppose that if there are biases in the CLIP space itself, CS and DCS both are not accurate but DCS is better than CS since we move the origin into the middle of attributes.  ### Do we need all training images?  Obviously, it is recommended to use all the training images as the same as in other evaluation metrics. For efficiency, as FID does, one can store and reuse CLIP embeddeings of all images. Also, one can use the well-designed subset of training dataset or evaluation dataset. ### Is the captioning model worth the trade-off? >Also, considering the introduction of the auxiliary image captioning model called BLIP, the improvement in performance seems minimal. Do the authors believe it is worth the trade-off? > One of the advantages of our metric is its flexibility; one can proceed with the desired task and analysis. Using the auxiliary image captioning model is a smart approach that removes costly manual annotation. We kindly mention that improving the performance is not the purpose of using the captioning model. We believe it is worthwhile not only for the efforts to annotate but also for a standardized evaluation way.  ### Explanation of FID$_\text{CLIP}$   FID is Fr\’echet distance between real embeddings and generated embeddings on Inception-V3’s penultimate feature space. FID$_\text{CLIP}$ computes the same Fr\’echet distance but on CLIP image embedding feature space. We included FID and FID$_\text{CLIP}$ in Table 2 to show that they somewhat correlate (negatively) with the number of injected images. Still, they do not provide any clue for what attributes are under-/over-represented. While the original CLIP embedding space is not interpreted, our proposed attribute embedding is designed for interpretable by exploiting the superior text-image model.  ### More than 2 attributes with PaKLD  > Can PaKLD be proposed for more than two attributes? > Of course, our metric can calculate more than two attributes. We conduct the Triple-attribute KLD between GT FFHQ images and generated images from iDDPM. And we observed the probability P("a person" & "glasses" & "a cell phone") was the most significant difference in 3D joint probability between GT and generated images. We made judgments using diverse combinations of attributes and added it into Appendix. However, we kindly mention that intuitive interpretation is difficult in more than three attributes.  ### Interpreting numerical values of metric   We can interpret numerical values by looking in the top-scoring attributes. We briefly share our insights in Appendix. C, and also decide to move some information into main paper. Please refer to global rebuttal. In Appendix.C, we computed major model’s score, and StyleGAN3 was ranked as the best model among all when using BLIP extracted text attributes. Additionally, while comparing each generative model’s score is helpful, a more thorough interpretation of each model’s performance via SaKLD or PaKLD would be beneficial. For example, StyleGAN3 failed to capture training images “eyeglass” or “hat” distribution, possibly due to its training approach of alias-free modeling between fine attributes. # Reviewer ch3Y ### Experiments should be more convincing > Although, the motivation for developing this metric is valid, the overall methodology and the experimental results are not convincing for the use of this metric. We appreciate recognizing the validity of our motivation for evaluating attributes of generated images. We could not understand what is missing in our experiments from the review. If it is meant to be how our metrics vary across different image distortions, we do not focus on image quality and diversity because they can be measured by existing metrics. In order to strengthen our ground, we add one more experiment with omitting eyeglasses where SaKLD decreases according to number of omitted images. Could you elaborate if there are more to add? | | SaKLD | PaKLD | most influencing attribute for SaKLD | |-----------------------------------------------------------------------------------|-------|--------|--------------------------------------| | setA(eyeglasses 3325/total 50000) vs setB(eyeglasses 3260/ total 50000) | 0.632 | 3.421 | "beard" | | setA(eyeglasses 3325/total 50000) vs setB(eyeglasses 2000/total 50000) | 0.892 | 4.050 | "eyeglasses" | | setA(eyeglasses 3325/total 50000) vs setB(eyeglasses 1000/total 50000) | 1.545 | 5.668 | "eyeglasses" | | setA(eyeglasses 3325/total 50000) vs setB(eyeglasses 0/total 50000) | 3.257 | 11.595 | "eyeglasses" | ### Comparative experiments to previous metrics We first remind that the main advantage of our metrics is the interpretability where users can observe which attributes are properly modeled or messing up. Previous metrics cannot measure this aspect. On the other hand, our metrics measure how well per-/pair-attribute distribution of generated images align with the training images (Figure 4, S2, S3, S4, and S18). Furthermore, Table 2 compares how FID and FID$_\text{CLIP}$ changes over different number of injected images. We will add precision and recall in the table during the discussion period. ### Relying on textual attributes We would love representing attributes in the image modality but it is prohibitive because an image cannot hold exactly *one* attribute because an image is composition of various attributes. Textual attributes are common and plausible modality for representing attributes in images[LLaVA][ESPER][Visual ChatGPT][BLIP-2]. [LLaVa] Liu et al. Visual Instruction Tuning, arXiv, 2023 [ESPER] Ye et al. Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning, CVPR, 2023 [Visual ChatGPT] Wu et al. Visual chatgpt: Talking, drawing and editing with visual foundation models arXiv, 2023 [BLIP-2] Li et al. Blip-2: Bootstrapping languageimage pre-training with frozen image encoders and large language models, arXiv, 2023 ### CLIPScore and KLD are not novel We propose directional CLIP score (DCS) instead of CLIPScore. DCS is superior to CLIPScore in attribute classification as shown in Table 1. Furthermore, DCS leads to signed scalar over binary attributes as shown in Figure 2. These improvements make SaKLD and PaKLD more sensitive to attributes in the images. KLD is a common measure for computing divergence of one distribution from another. We respectfully remind that we focus our novelty on defining the distributions of attribute strengths from a set of images rather than using KLD. ### Some important parts are a bit ambiguous ##### Center of attributes Eq. (2): $C_{\mathcal{T}} = \frac {1}{N}\sum_{i=1}^{N}\textbf{E}_{\textbf{T}}(\text{BLIP}(x_i))$, where $\textbf{E}_{\textbf{T}}$ is CLIP text encoder and {$x_i$} are training images. BLIP produces a caption for a given image $x_i$. $\textbf{E}_{\textbf{T}}$ produces a CLIP text embedding of the caption. The rest computes their mean. We are not sure which part is ambiguous. Could you elaborate? ##### CelebA attribute classification accuracy > results from table 1 (what does accuracy stand for) etc. are a bit ambiguous. > [Q2] Table 1 represents accuracy results for CS vs DCS. What is meant by evaluating ”positive samples” in line 164 and 165? And what is meant by ”how well they are aligned”? Also, what exactly is the accuracy calculated for? We perform binary classification over all attributes in CelebA according to DCS and compare it with the groundtruth attribute labels. Positive means having an attribute. Higher accuracy means DCS agrees with the annotation. Thank you for the constructive comment. We will clarify this as follows. >> [before] Furthermore, DCS exhibits superior accuracy compared to CS, as demonstrated in Table 1. The table presents the accuracy results of CS and DCS for annotated attributes in CelebA [20]. By evaluating how well positive samples with the highest score align with positive samples for a given attribute, DCS consistently outperforms CS in accuracy. >> [after] Furthermore, Table 1 demonstrates superiority of DCS over CS in measuring attribute strength. If we consider a sample with high DCS as possessing an attribute, DCS achieves higher accuracy than CS in classifying samples with or without attributes annotated in CelebA [20]. ##### Method is not sound > There is a lot of scope to improve the technical soundness of the paper. > The proposed methodology is not very sound and ... We could not understand which part is not sound. Could you elaborate? ##### Missing literature > Although some of the popular metrics 1are mentioned, there has been a lot of work in the generative modelling domain which the literature survey must cover. We include commonly used metrics for in the related work section: FID, FID$_\text{CLIP}$, precision & recall, improved precision & recall, density & coverage, rarity score. We will add perceptual path length, Fr\'echet segmentation distance as below. Inception score is omitted because it is barely used currently. We also omit the metrics for measuring alignment between condition (input) and generation (output) such as mIoU because our goal is to measure the divergence of generated images from training images. Could you mention any other relavent works that would be helpful to include? >> Perceptual path length [PPL] measures sum of perceptual distances between samples along latent interpolation to indicate smoothness of the latent space. Fr\'echet segmentation distance [FSD] compute Fr\'echet distance between the number of pixels of segmented categories in fake images and real images. [PPL] Karras et al., A Style-Based Generator Architecture for Generative Adversarial Networks, CVPR 2019 [FSD] Bau et al., Seeing What a GAN Cannot Generate, ICCV 2019 ##### Method is not supported by experiments > The proposed methodology is ... not well supported with the experimental setup. > Based on the current state of the experiments, the contributions don’t seem to be significant as there is not enough validation to support the claims. * Superiority of DCS over CLIPScore is supported by Table 1. * Misaligning attributes are captured by SaKLD as shown in Figure 4a. * How SaKLD reflects injecting irrelevant attribute is shown in Table 2. * Misalinging pairs of attributes are captured by PaKLD as shown in Figure 4b. Could you elaborate which part is not supported? ##### The results are also not sufficiently explained We could not catch what results are not sufficiently explained. Could you elaborate? ##### Diversity is the main motivation > Diversity is the main motivation for the paper as it is mentioned in the abstract but any theoretical or empirical work to support it is completely missing. We respectfully remind that our main motivation is interpretable evaluation rather than diversity as written in L39-L51. ### typos Thank you for helping us improve our paper. We have reflected them in the paper and ran another proofreading. ### Image and text embeddings > [Q1] If I understood correctly, the text and the image encoders are different? If yes, then the embedding space of both models is going to be very different and will be influenced by the downstream tasks that they are trained on. Would this metric always work for embeddings from different models? Does it make sense to calculate similarity between embeddings from different models? We use pretrained CLIP image and text encoders for producing image and text embeddings, respectively. CLIP encoders are trained to maximize cosine similarity between an image and its matching text while minimizing cosine similarity between non-matching pairs of image and text. Hence, they live in the same embedding space. It does not make sense to calculate similarity between embeddings from *different models*. It is why we use CLIP embeddings. ### Meaning of positive sample and alignment It is merged in a previous item marked with [Q2]. ### Meaning of each channel of the embedding > [Q3] In Section 3.1, there is a mention of ”each channel of the embedding being utilized”, how exactly does this happen? The procedure for encoding an image into an interpretable embedding is as follows: 1) compute similarities between an image and $N_A$ attributes using DCS. 2) form an $N_A$-dimensional embedding with the similarities. We will add this at the very beginning of Section 3.1. Thank you for the constructive comment. ### The same number of samples with specific attribute? > [Q4] Line 196-197 claims that the generated distribution is desirable only if it has exactly same number of samples as in the training set that has a particular attribute. Why should it be? Please consider an extreme example. If we have a dataset containing dogs and cats, a generative model that produces only dogs is clearly underfit because it assigns no probability to cats. [DeepLearningBook] [Page 715.](https://www.deeplearningbook.org/contents/generative_models.html) We will tone down our phrase as below. >> [before] We design SaKLD to distinguish a good generative model which produces the same quantity of each attribute present in the training data. For example, if 50,000 training data contains 3,000 images with eyeglasses, the model should generate exactly 3,000 images with eyeglasses. Any deviation from this ideal distribution is considered undesirable. We introduce a new metric that quantifies ... >> [after] If we have a dataset containing dogs and cats, a generative model that produces only dogs is clearly underfit because it assigns no probability to cats [DeepLearningBook]. In this context, we consider a generative model better than another if it produces similar number of samples along attributes. As we do not have an access to the groundtruth real and fake distributions, we design a new alternative metric, SaKLD, that quantifies ... [DeepLearningBook] Goodfellow et al., Deep Learning, MIT Press 2016 # Reviewer E51D  ### Failure modes of metrics Here are 2 failure modes we anticipate. We add these into More Discussion. #### Bias in CLIP may deliver inaccurate result. As review o9jM referred, if some attributes are highly correlated in CLIP embedding space, SaKLD and PaKLD will not resemble human judgement. Embeddings from biased encoder will be plotted as distorted distribution causing unreliable results. Using biased model is inevitable since all models are biased including CLIP. However, we do not directly use the features but utilize them to calculate the distance between distributions. Even though the model is biased, the distance is meaningful since the features from real dataset are also extracted by that biased model. #### Need enough samples We use gaussian Kernel Density Estimation(gaussian KDE) to make Probability Density Function(PDF) for each attribute from dataset's DCS. If the number of images is too small, the subset's PDF may not describe the full dataset's PDF, and it could deliver inaccurate interpretation of full dataset (or generative model's performance). Fig. 5 (a) describes the impact of sample size and 50k images are recommended to get accurate result. ### Drastic failure in single attribute v.s. balanced failure among all attributes It is difficult to determine model A (drastic failure in single image) or B (balanced failure among attributes) is worse, since each user has different demand. So, instead of being buried only on total scores of SaKLD or PaKLD, we recommand to inspect SaKLD (or PaKLD) to get thorough insight of model. One can easily look in the top-scoring attributes from SaKLD or PaKLD. Furthermore, the main strength of our metric is flexibility, which contains users can drop some careless attribute (even shown significant SaKLD), or add some attribute which one mainly concentrates.  ### Dominance of several attributes We carefully mention that we set up an extreme experiment(man-smile correlation=1) to show some dramatically poor attribute SaKLD(PaKLD) in experiment 4. Therefore, dominated by few attributes of attribute pairs are our desired result. In other word, if the dominating phenomenon is observed, it can be considered to have as much impact as the extreme experiment we designed. And in practice, a similar phenomenon is observed among current major models. As we described in global rebuttal 3., each model has shown insufficient performance mimicking some attribute distributions. In the same context, we can find some of the model's total SaKLD/PaKLD drop when the number of attributes increased, as Figure 5. (b). We can understand this phenomenon as some attribute they lack is inserted.  ### Exploring major models with our metrics  We added exploration result of major models in global rebuttal 3, and also Appendix C. We reported each model's characteristics, such as iDDPM has a particularly high SaKLD for "women" and "makeup". As interpreting the model is one of our main contribution, we moved interpretable analysis part from appendix to Section 5. ### DCS symmetric about 0?  DCS values for opposite attribute are heading opposite direction in common, but not same as the absolute value. Considering DCS embedding space, an angle between embedding for "long hair" and embedding for "short hair" may not be absolute $180^\circ$ since DCS embedding depends on text mean. We get text mean via considering not only "long hair", and "short hair", but for other attributes, the DCS for opposite attribute cannot be symmetric as ideally. Nevertheless, signs are always observed to be opposite, which brings ease of interpretation. We attached figures of CS, DCS for opposite attributes for 2 images in .pdf file. Please refer to Figure 3 in the global rebuttal PDF file. ### Understanding accuracy We perform binary classification over all attributes in CelebA according to DCS and compare it with the groundtruth attribute labels. Positive means having an attribute. Higher accuracy means DCS agrees with the annotation. Thank you for the constructive comment. We will clarify this as follows.   Here are the first 19 attribute accuracy among 40 attributes. Attributes such as "Double_Chin"'s accuracy is <0.2 for both CS and DCS, while explicit attrbute such as "blonde hair"' accuracy is almost 0.7 for DCS. We attached whole 40 attribute's accuracy in Appendix. | | Mean accuracy | 5_o_Clock_Shadow | Arched_Eyebrows | Attractive | Bags_Under_Eyes | Bald | Bangs | Big_Lips | Big_Nose | Black_Hair | Blond_Hair | Blurry | Brown_Hair | Bushy_Eyebrows | Chubby | Double_Chin | Eyeglasses | Goatee | Gray_Hair | Heavy_Makeup | |-----|---------------|------------------|-----------------|------------|-----------------|-------|-------|----------|----------|------------|------------|--------|------------|----------------|--------|-------------|------------|--------|-----------|--------------| | DCS | 0.442 | 0.188 | 0.419 | 0.549 | 0.192 | 0.644 | 0.634 | 0.247 | 0.432 | 0.566 | 0.696 | 0.199 | 0.448 | 0.313 | 0.327 | 0.191 | 0.731 | 0.525 | 0.382 | 0.563 | | CS | 0.395 | 0.276 | 0.369 | 0.578 | 0.162 | 0.508 | 0.595 | 0.235 | 0.339 | 0.560 | 0.582 | 0.218 | 0.407 | 0.345 | 0.202 | 0.136 | 0.628 | 0.477 | 0.237 | 0.510 | ### More informations about BLIP attribute extraction  The list below is top 20 frequent attributes in BLIP caption sorted as descending order. BLIP provides a standardized attribute extraction for all datasets,but also has a shortage. As BLIP do not describe an image in great detail, for example 'a cat sitting on a table in a room' was commonly seen in LSUN cat captions, we recommand users to use USER annotation mainly.  FFHQ : ['a woman', 'a man', 'a person', 'glasses', 'a suit', 'a little girl', 'a young boy', 'a cell phone', 'a microphone', 'a necklace', 'a hat', 'tie', 'a young girl', 'blonde hair', 'long hair', 'a blue shirt', 'a beard', 'a white shirt', 'her head', 'a tie'] LSUN cat : ['a cat', 'a black cat', 'the floor', 'cats', 'a couch', 'a black and white cat', 'a white cat', 'a couple', 'a woman', 'a table', 'the ground', 'a dog', 'a man', 'black and white cat', 'a small kitten', 'a person', 'blue eyes', 'a blanket', 'a chair', 'a kitten'] Metfaces : ['a painting', 'a man', 'a woman', 'a portrait', 'a black and white photo', 'curly hair','a suit', 'a statue', 'a beard', 'a person', 'a drawing', 'tie', 'a hat', 'a black hat',"a woman's head", 'white hair', 'a young girl', 'her head', 'a white dress', 'long hair'] LSUN church : ['a clock tower', 'a clock', 'a steeple', 'people', 'an old church', 'a building', 'the background', 'a cross', 'a city', 'a white church', 'a large cathedral', 'a large church','a tall tower', 'the middle', 'two towers', 'a very tall building', 'a view', 'a statue', 'trees', 'a street'] #### Minor notes Thank you for helping us improve our paper. And for nP2 in L222, it meant a permutation of two out of N. We have reflected them in the paper and ran another proofreading.