## Final global response Dear reviewers and AC, We sincerely thank all the reviewers for their valuable comments and suggestions, which have helped improve our paper. Although some reviewers did not engage in our response, we have incorporated all comments into both the main paper and the appendix. We are happy that our key contributions are properly recognized: * $\textit{Directional}$ CLIPScore allows users to grasp the presence of attributes in a given image, while CLIPScore does not. * Single and paired attribute alignment (SaKLD and PaKLD) allow users to choose a model that captures desired attributes in the training distribution. Sincerely, authors $\textit{Directional}$ | | | 10k | 20k | 30k | 40k | 50k | 60k | 70k | |-------|-----------|--------------|--------------|--------------|--------------|--------------|--------------|--------------| | | seed1 | 8.27 | 7.99 | 7.83 | 7.88 | 7.77 | 7.84 | 7.80 | | | seed2 | 8.34 | 8.20 | 8.09 | 8.15 | 8.10 | 8.06 | 8.03 | | SaKLD | seed3 | 9.13 | 8.43 | 8.04 | 8.04 | 7.88 | 7.84 | 7.80 | | | seed4 | 8.66 | 8.20 | 7.89 | 7.98 | 7.85 | 7.89 | 7.87 | | | seed5 | 8.40 | 7.93 | 7.61 | 7.79 | 7.66 | 7.74 | 7.68 | | | mean(var) | 8.56(0.123) | 8.15(0.039) | 7.89(0.036) | 7.96(0.019) | 7.85(0.026) | 7.87(0.013) | 7.83(0.016) | | | | | | | | | | | | | seed1 | 23.47 | 21.18 | 20.29 | 20.08 | 19.69 | 19.69 | 19.45 | | | seed2 | 23.62 | 21.54 | 20.60 | 20.43 | 20.17 | 19.95 | 19.76 | | | seed3 | 24.90 | 21.98 | 20.48 | 20.15 | 19.67 | 19.45 | 19.30 | | PaKLD | seed4 | 24.29 | 21.70 | 20.44 | 20.23 | 19.80 | 19.67 | 19.48 | | | seed5 | 23.82 | 21.03 | 19.82 | 19.88 | 19.48 | 19.44 | 19.19 | | | mean(var) | 24.02(0.337) | 21.48(0.148) | 20.32(0.092) | 20.15(0.040) | 19.76(0.065) | 19.64(0.043) | 19.43(0.046) | <sub><sup> 작게 작성한 부분 == 한국어 원문 </sup></sub> <font color='blue'>파란색 == 넣을지 말지 고민하고 있는 부분. </font> <font color='red'>빨간색 == 해야할 일. 수정하고 싶은 부분. </font> <font color='purple'>보라색 == 빨간색 수정해본 것(재석), 괜찮으면 반영 ㄱㄱ</font> # Private message to AC Dear AC, We genuinely appreciate the reviewers' efforts in assessing our work. However, we find the reviews from [ch3Y] to be somewhat inspecific and challenging to fully comprehend. Despite this, we have put our best effort in formulating a thorough response. We kindly request that the lack of specificity in his/her reviews be taken into consideration during the decision process. Thank you for your consideration. Best regards, Authors # Global rebuttal We thank the reviewers for their valuable advice. Here, we compile reviews we want to share with all the reviewers. Please see our responses addressing the specific concerns below: #### 1. Additional experiments for providing more quantity of convincing information #### Reviewers T7Dd, 1cWg, and ch3Y suggested to add more experiments to be more convince. We provide more experiments which support the argument of our paper. #### 1-1. Biased image injection experiment We conduct additional experiment with restrict desired conditions which support Sec 5.1; Correlated Image Injection Experiment. We conducted the existing experiment more densely and showed it as a graph. Please refer to **Fig. 2** in the PDF. We compare SaKLD and PaKLD with FIDs by injecting certain attributes - ("man"-"makeup" and "man"-"bangs"). Fig. 2 shows that while SaKLD and PaKLD captures distribution changes, FIDs remain about the same value. The results show that our proposed method can catch the attribute distribution change well, unlike conventional metrics. <!-- eyeglass 없애기 실험 추가 (그래프) baseline 모델들 추가 후 성능비교 및 분석 첨부했음 (그래프) makeup man 넣기 촘촘히.. 100 200 (그래프) --> #### 1-2. Additional model performance descriptions We provide more results of models - LDM, StyleSwin and ProjectedGAN in the following table. These results show that our metric is similar to the tendency of FIDs. Especially ProjectedGAN, which was criticized [FID-clip] for not focusing on fidelity, only focusing on getting good FID score by training to match training set's imagenet encoder output stats, showed inferior results in SaKLD and especially PaKLD on our metric. It indicates that even if model can mimic few of attributes contained in imagenet labels, it is hard to catch the correlation of attributes on the training set. **Fig. 4** in the PDF shows some examples of ProjectedGAN; a baby with beard and some unusual samples. [FID-clip] Kynkäänniemi et al., The Role of ImageNet Classes in Fréchet Inception Distance, ICLR, 2023 #### Table 1 (*The scale of SaKLD,PaKLD are 1e-7, using 50k FFHQ) | | StyleGAN1 | StyleGAN2 | StyleGAN3 | iDDPM | LDM | StyleSwin | ProjectedGAN | |:----------:|:-----------:|:-----------:|:-----------:|:-------:|:-------:|:-----------:|:--------------:| | SaKLD | 9.82 | 6.70 | 6.00 | 10.70 | 18.32 | 16.93 | 11.45 | | PaKLD | 23.50 | 16.75 | 15.50 | 25.00 | 39.71 | 40.25 | 28.18 | | FID | 4.74 | 3.17 | 3.20 | 7.31 | 11.86 | 5.45 | 4.45 | | FID_CLIP | 3.17 | 1.47 | 1.66 | 2.39 | 3.57 | 3.63 | 2.45 | ### 2. Improving the way to obtain $C_{\mathcal{T}}$, the center of images in text respect. #### Reviewer T7Dd provided constructive discussion about using the attributes themselves for obtaining $C_{\mathcal{T}}$. Thanks to reviwer T7Dd, we improve the way to obtain $C_{\mathcal{T}}$ using the attributes themselves. We conduct the same experiment, reporting classifying accuracy by using CelebA annotation, for comparing between using captions and attribute themselves. | | CS | DCS w/ captions | DCS w/ attribute themselves | | :--------: |:------: | :--------: | :--------: | | accuracy | 0.395 | 0.409 | 0.442 | Using attributes themselves for obtaining $C_{\mathcal{T}}$ shows superior accuracy than others. We have conducted all the experiments in the paper using the attributes themselves, and we confirm that the trends are all the same including Table 1 obtained with attributes themselves. Thanks for the great constructive suggestion, and we are confident that we will update all the results for the camera-ready version and that there are no changes in the aspect of tendency. ### 3. Moving interpretable analysis from appendix to Section 5. Reviewers o9jM and E51D suggested showing a more fine-grained analysis of the scores. Thanks to the great suggestions of reviewers, we decided to move the analysis in the appendix into the main paper. We agree that this analysis can greatly highlight the interpretable advantages of our method. The following is what has been transferred. Please refer to Fig. S13 and Tab. S7 for details. SaKLD SaKLD directly measures the differences in attribute distributions, indicating the challenge for models to match the density of the highest-scoring attributes to that of the training dataset. Examining the top-scoring attributes, all three StyleGAN models have similar high scores in terms of scale. However, there are slight differences, particularly in StyleGAN3, where the distribution of larger accessories such as eyeglasses or hats differs. In contrast, iDDPM demonstrates notable scores, with attributes ‘makeup’ and ‘woman’ showing scores over two times higher than GANs. Particularly, apart from these two attributes, the remaining attributes are similar to GANs, highlighting significant differences in the density of ‘woman’ and ‘makeup’. Investigating how the generation process of diffusion models, which involve computing gradients for each pixel, affects attributes such as ‘makeup’ and ‘woman’ would be an intriguing avenue for future research. PaKLD PaKLD provides a quantitative measure of the appropriateness of relationships between attributes. Thus, if a model generates an excessive or insufficient number of specific attributes, it affects not only SaKLD but also PaKLD. Therefore, it is natural to expect that attribute pairs with high PaKLD scores will often include top-ranking attributes in SaKLD. Nevertheless, PaKLD reveals interesting findings. Firstly, it is noteworthy that attributes related to ‘beard’ consistently receive high scores across all StyleGAN 1, 2, and 3 models. Figure S16 confirm that ‘beard’ significantly contributes to the overall PaKLD scores. This indicates that GANs generally fail to learn the relationship between beards and other attributes, making it an intriguing research topic to explore the extent of this mislearning and its underlying reasons. LDM also exhibits interesting attributes, particularly ‘bald head’. Despite not having a high score in SaKLD, it consistently receives high scores in PaKLD. This implies that individuals with bald heads have a scarcity of many attributes that are commonly found, and analyzing this phenomenon would be a promising avenue for future research. <!-- Please refer to Appendix C for a more detailed analysis including color and shape-related attributes; it concludes DMs struggle with the attributes such as ‘long tail’ and ‘pointed ears’ which share the commonality of being thin and long. --> <!-- ### 3. Moving interpretable analysis from appendix to Section 5. #### Reviewers o9jM and E51D suggested showing a more fine-grained analysis of the scores. Thanks to the great suggestions of reviewers, we decided to move the analysis in the appendix into the main paper. We agree that this analysis can greatly highlight the interpretable advantages of our method. The following is what has been transferred. Please refer to Fig. S13 and Tab. S7 for details. ------ #### SaKLD SaKLD directly measures the differences in attribute distributions, indicating the challenge for models to match the density of the highest-scoring attributes to that of the training dataset. Examining the top-scoring attributes, all three StyleGAN models have similar high scores in terms of scale. However, there are slight differences, particularly in StyleGAN3, where the distribution of larger accessories such as eyeglasses or hats differs. Exploring the training approach of alias-free modeling and its relationship with such accessories would be an interesting research direction. In contrast, iDDPM demonstrates notable scores, with attributes 'makeup' and 'woman' showing scores over two times higher than GANs. Particularly, apart from these two attributes, the remaining attributes are similar to GANs, highlighting significant differences in the density of 'woman' and 'makeup'. Investigating how the generation process of diffusion models, which involve computing gradients for each pixel, affects attributes such as 'makeup' and 'woman' would be an intriguing avenue for future research. For LDM, the top-scoring attributes are similar for both 50-step and 200-step results. However, it is observed that while FID improves with 200 steps, SaKLD gets worse. Specifically, the scores for "earrings," "necklace," and "young" significantly increase with 200-step results. Analyzing the influence of attributes as the number of steps increases, leading to more frequent gradient updates, would be a highly interesting research direction. #### PaKLD PaKLD provides a quantitative measure of the appropriateness of relationships between attributes. Thus, if a model generates an excessive or insufficient number of specific attributes, it affects not only SaKLD but also PaKLD. Therefore, it is natural to expect that attribute pairs with high PaKLD scores will often include top-ranking attributes in SaKLD. Nevertheless, PaKLD reveals interesting findings. Firstly, it is noteworthy that attributes related to 'beard' consistently receive high scores across all StyleGAN 1, 2, and 3 models. Figure S16 confirm that 'beard' significantly contributes to the overall PaKLD scores. This indicates that GANs generally fail to learn the relationship between beards and other attributes, making it an intriguing research topic to explore the extent of this mislearning and its underlying reasons. In the case of iDDPM, the values for 'arched eyebrows' and 'makeup' are overwhelmingly higher compared to other attributes. The reasons behind this will be discussed in the following subsection. LDM also exhibits interesting attributes, particularly 'bald head'. Despite not having a high score in SaKLD, it consistently receives high scores in PaKLD. This implies that individuals with bald heads have a scarcity of many attributes that are commonly found, and analyzing this phenomenon would be a promising avenue for future research. Please refer to Appendix C for a more detailed analysis including color and shape-related attributes; it concludes DMs struggle with the attributes such as 'long tail' and 'pointed ears' which share the commonality of being thin and long. --> ------ <!-- >o9jM : Also, considering the introduction of the auxiliary image captioning model called BLIP, the improvement in performance seems minimal. Do the authors believe it is worth the trade-off? >T7Dd : Also, would it not be reasonable to be using the attributes themselves, rather than captions ? --> <!-- text attribute mean 사용시 celebA 실험 정확도 상승, 현재는 accuracy, 모델간 비교만 결과 뽑은 상태지만 모든 실험 결과 업데이트 할 예정 --> <!-- 글로벌? 피규어 1, 3 다시그린거 --> # Reviewer T7Dd ### Presentation of paper can be improved <!-- > The presentation of the paper could be improved upon. [W1-1] This include the writing, and the readability of the figures, as well as the quantity of information provided. This last points concerns [W1-2] mainly the presentation and framing of the problem of interpretability of representations, as well as [W1-3] the presentation and motivation of the experimental settings. --> [W1-1] We modify the writing and figures for readability. Please refer to attached PDF file for figure. As mentioned in global announce 1, we also add more quantity of information. Please refer to global rebuttal. [W1-2] Evaluating different models allows users to choose a model that meets their needs. FID, precision, recall, density, and coverage measure quality and diversity so that users can choose a model with the best quality or compromise quality with diversity. SaKLD and PaKLD measure how much the generated images align with the training images regarding semantic attributes. For that, directional CLIP score interprets an image as amounts of attributes. Hence, users can choose a model with the desired attributes. [W1-3] We provide our motivations and improve the presentation as followings: ex 5.1 Validation of our metric's effectiveness: To check whether our metric operates properly, we intentionally injected "man with makeup" images only into an experimental set. We found deteriorations on both SaKLD and PaKLD regarding "makeup" since they are rare in the training set. Total SaKLD and PaKLD scores have worsed also, which reflect both individual attributes/attribute affected to total score. Similarly, SaKLD regarding a specific attribute linearly decreases by excluding images having the attribute from one of the sets. ex 5.2 Ablational validation of PaKLD For thorough validation, we established a particular scenario where setA consists of images with more smiling men than unsmiling men and more unsmiling women than smiling women, and in setB, it is vice versa. We set both groups to have the same number of images for each attribute ("smiling", "man" and "woman"). While SaKLD scores regarding each attribute were similar, PaKLD regarding "smiling man" and "smiling woman" showed significnat differences compared to the other PaKLD scores. It demonstrates that PaKLD accurately understand the difference between the correlation of attributes. We used celebA datasets for the experiment, since it has ground truth attribute labels. <!-- ex 5.3. GANs vs Diffusions We evaluted the major current models using our method. Specifically, we computed SaKLD and PaKLD scores for prominent models usch as StyleGAN and iDDPM regarding three attributes. Ours enables a more detailed analysis than what was previously possible with exisiting metrics. Notably, we found interesting features of each model; iDDPM lacks generating color-based attributes possibly due to its dependence on initial noise. Appendix C contains further results. Additionally, we attached additional experiments of other models including LDM, ProjectedGAN, and Styleswin, for a broader analysis of model performance. --> <!-- while StyleGAN3 struggles with fine attributes like "eyeglasses" or "hats" possibly due to its alias-free training moeling scheme. --> <!-- tdb (일단은 피규어만) figure 1은 ~다. ~ 고친다. (현재는 model1과 model2가 circle에서 직관적으로 표시되지 않는다. 표시되도록 고쳤다. precision 등도 직관적으로 보이게 고쳤다. figure3은 ~다. ~ 고친다. (현재는 text mean이 표현되어있지 않다. 또한 img mean을 나타내는 표시선이 백터같이 혼동을 일으킨다. text mean 표현했고 mean 도 점선으로 다시그렸다. --> <!-- fig 4는 b는 두 색의 overlay로 stripe pattern으로 joint probability를 나타냈지만 제대로 전달되지 않았다. 더 가시적으로 고쳤다.(ppt상에서 포토샵하기) --> <!-- #### Quantity of informations #### Fraiming of problem Evaluating different models allows users to choose a model that meets their needs. FID, precision, recall, density, and coverage measure quality and diversity so that users can choose a model with the best quality or compromise quality with diversity. SaKLD and PaKLD measure how much the generated images align with the training images regarding semantic attributes. For that, directional CLIP score interprets an image as amounts of attributes. Hence, users can choose a model with the desired attributes. --> <!-- #### Motivation of experiment --> <!-- <font color='red'> --> --> <!-- ex 5.1. Validation of our metric’s effectiveness: We conducted experiments by artificially injecting "man" & "makeup" images, which are rarely found in the training set, into the generated set. As we increased the number of injected images, both SaKLD and PaKLD scores, including "makeup," worsened. Similarly, the total SaKLD and PaKLD scores, comprising mean scores of individual attributes/attribute pairs, also worsened. Additionally, we performed an extra experiment that demonstrated a linear decline in SaKLD for "eyeglasses," except for images containing the "eyeglasses" attribute in the generated set. These experiments show that our metric effectively captures changes in attribute distribution in the generated set. --> <!-- ex 5.1 Validation of our metric's effectiveness: To check whether our metric operates properly, we intentional injected "man with makeup" images only into an experimental set. We found deteriorations on both SaKLD and PaKLD regarding "makeup" since they are rare in the training set. Total SaKLD and PaKLD scores have worsed also which reflect both individual attributes/attribute affected to total score. Similarly, SaKLD regarding a specific attribute (ex. eyeglasses) linearly decreases by excluding images having the attribute from one of the sets. --> <!-- </font> --> <!-- ex 5.2. Ablational validation of our metric : PaKLD To thoroughly validate our metric's performance, we designed a circumstance where setA consists of images with a positive correlation between "man" and "smile" attributes (negative correlation between "woman" and "smile"), while setB contains images with a negative correlation between "man" and "smile" attributes (positive correlation between "woman" and "smile"). The number of images for each "smiling," "man," and "woman" is the same for setA and setB. We utilized celebA images with ground truth attribute labels for setA and setB. As intended, SaKLD for "smiling," "man," and "woman" did not show remarkable differences between setA and setB. However, PaKLD for "man&smiling" and "woman&smiling" between setA and setB showed significant differences compared to other PaKLD scores. These results demonstrate that our metric accurately captures differences in attribute distributions, making it a reliable evaluation metric. --> <!-- <font color='purple'> --> <!-- ex 5.2 Ablational validation of PaKLD For thorough validation, we established a particular scenario where setA consists of images with more smiling men than unsmiling men and more unsmiling women than smiling women, and in setB, it is vice versa. We set both groups to have the same number of images for each attribute ("smiling", man" and "woman"). While SaKLD scores regarding each attribute were similar, PaKLD regarding "smiling man" and "smiling woman" showed significnat differences compared to the other PaKLD scores. It demonstrate that PaKLD accurately understand the difference between the correlation of attributes. We used celebA datasets for the experiment since it has ground truth attribute labels. --> <!-- </font> --> <!-- ex 5.3. GANs vs Diffusions on our metric We intended to approach current major models in a novel way. We computed SaKLD and PaKLD scores for major models (StyleGAN and iDDPM) using three attribute extraction methods we set, providing informative analysis that reveals novel aspects of existing methods. Notably, we found interesting features of generative models; iDDPM lacks generating color-based attributes due to its dependence on initial noise, while StyleGAN3 struggles with fine attributes like "eyeglasses" or "hats" possibly due to its alias-free training modeling scheme. Appendix C contains further results. Additionally, we attached additional experiments for other models, including LDM, ProjectedGAN, and Styleswin, for a broader analysis of model performance.</font> --> <!-- <font color='purple'> --> <!-- ex 5.3. GANs vs Diffusions We evaluted the major current models using our method. Specifically, we computed SaKLD and PaKLD scores for prominent models usch as StyleGAN and iDDPM regarding three attributes. Ours enables a more detailed analysis than what was previously possible with exisiting metrics. Notably, we found interesting features of each model; iDDPM lakcks generating color-based attributes possibly due to its dependence on initial noise, while StyleGAN3 struggles with fine attributes like "eyeglasses" or "hats" possibly due to its alias-free training moeling scheme. Appendix C contains further results. Additionally, we attached additional experiments of other models including LDM, ProjectedGAN, and Styleswin, for a broader analysis of model performance. --> <!-- </font> --> ### Concept based representation <!-- > [W2] In particular, the paper lacks related work on concept-based representation for interpretability of images. While building metrics dedicated to generative model seems new to me, there are based on an idea which has been explored extensively before. --> <!-- >[W2] In particular, the paper lacks related work on concept-based representation for interpretability of images. While building metrics dedicated to generative model seems new to me, there are based on an idea which has been explored extensively before. See for example "Concept Whitening for Interpretable Image Recognition, Chen et al, 2020". --> Concept Whitening extracts human-understandable concepts from black-box models through the CW layer. In certain contexts, the motivation behind interpreting black box models closely aligns with our metric. However, our interpretation approach diverges significantly as it is posthoc. We recognize the significance of associating our problem set with a concept-based representation, going beyond mere posthoc methods, in order to derive more valuable insights. We organized it and put it in the related work. <!-- Thanks for giving a new aspect making better interpretable metric. As Concept Whitening get human understandable concept via CW layer, in some context the motivation : getting interpretation of blackbox model is quite similar with our metric. But the scheme of interpreting is totally different, we agree to relate our problemset to concept-based representation not only with posthoc only. We would love to get ablational study. --> ### Further exploration of attribute selection <!-- >[W3] The paper focuses on a narrow choice of methods to generate attributes, which, to me, should be one of the key experimental investigation of the paper. Notably, the previous literature explores using different kind of attributes, coming from existing data (for example, "Interpretable Basis Decomposition for Visual Explanation, Zhou et al, 2018") or to be learnt ("A Framework to Learn with Interpretation", Parekh et al, 2021). The authors only (very shortly) argue about the number of needed attributes. --> In addition to language-text models like BLIP and GPT, the incorporation of methods such as classifier model's heatmap for obtaining attributes and ranks can introduce new attributes, potentially leading to novel perspectives in understanding generative models. We concur with the utilization of a variety of methods for attribute extraction. Beyond this, our primary contribution lies in our pioneering effort to interpret generative models. While attribute extraction methodologies may yield broader feedback, they are not as central as our primary objective: interpreting models through attribute quantification and comparative analysis. Additionally, the reason we set the number of attributes up to 40 attributes because celebA has 40 attributes. If a user is interested in hair styles, one can enrich the target attributes. We have added this topic to Discussion. <!-- Addition to language-text models such as BLIP and GPT, getting attributes and ranks via such as classifier model's heatmap can bring novel attributes, may bring new aspects understanding generative models. We aggree with adopting diverse methods extracting attributes. Apart from that, our main contribution is first attempt getting interpretation of generative models. Attribute extracting methologies may bring broader feedback, but not as important as main idea : interpret models via quantifying attributes and comparing. --> <!-- ### Number of needed attributes We set the number of attributes up to 40 attributes because celebA has 40 attributes. It covers most visual attributes in face datasets. The full list of attributes are in the supplementary material (Table S6 and S7). Please note that the modality gap between image and language inhibits perfectly spanning all visual attributes in the images. Nevertheless, users can extend the number of attributes for their demands. For instance, if a user is interested in hair styles in FFHQ, one can enrich the target attributes. --> ### More explanation for toy experiments <!-- >[W4] The toy experiments seem relevant but are very fastly presented and should be expanded upon. The remaining experiments are too short to be convincing and only focus on a handful of models. --> <!-- <font color='red'> 토이실험 설계는 잘 돼는데 설명이 부족해서 아쉽다 설명을 더하라는 말인데 설계에 대한 보강을 했군요 ㅠ 설명추가를 해주세요. 5.1.의 본문에는 서플에 써있다고 돼있는데 (See supplement material for more details.) 없군요? --> We add more detailed explaination which contains the content in global rebuttal and [W1-3]. We designed the toy experiments to destroy the distribution in the way we want. The outcomes of this experiment, particularly the injection of "man&makeup" images that are rarely found in the training set, leads to a noticeable linear increase in both SaKLD and PaKLD scores. However, there were no distinctive patterns observed when we injected another ground-truth subset. <!-- SaKLD and PaKLD correlate with the number of injected attributes in all three choices (BLIP, USER, GPT). It is notable that measures with BLIP linearly correlates with number of images while ones with USER and GPT. Building upon the attributes we acquired, we proceeded with experiment 5.1. The outcomes of this experiment, particularly the inclusion of "man&makeup" images that are rarely found in the training set, into one of the ground-truth subsets, unveil a noticeable linear divergence in both SaKLD and PaKLD scores. However, there were no distinctive divergent patterns observed upon introducing images from another ground-truth subset. --> <!-- With obtained attributes, We conducted experiment 5,1. With findings: man&makeup which hardly occurs in trainingset injected into one subset, the linear diverge in SaKLD and PaKLD imlies or metric effectively operates. <!-- They correlate with FID and FID_CLIP. We add another toy example showing linear SaKLD distribution changes with respect to the decreasing number of images with eyeglasses, using subset comprised with 50,000 celebA images. Also shows whether the distribution is shifted in some direction, which means generative model has whether made more of less images containing some attributes. -> case 4일 때 이미지 --> <!-- ### Better formalized notation > [Q1] Could you better formalize the notation for DCS_x(a_i,j) joint distributions ? tbd --> ### Text mean and Image mean #### Dividing text mean and image mean It is necessary to center text and images differently. Appendix B.1. provides an ablation experiment: setting the text origin as the text mean vs image mean. We additionally provide the entire mean setting in the table below. Using the text mean shows the best accuracy. The image origin is set to the image mean. | | different text/image mean | same text/image mean | entire mean | | -------- | -------- | -------- | ------| | accuracy | 0.409 | 0.228 | 0.313 | We suppose the image embedding and the attribute embedding are slightly unaligned because CLIP is trained with sentences and we use a simplified form of “a photo of {attribute}”. In other words, a single attribute does not fully describe one image. This is our motivation for using different centers for images and attributes. #### Text mean as attribute themselves Thank you for constructive suggestion. We verified using attribute themselves for obtaining $C_{\mathcal{T}}$ is more accurate than previous method in CelebA experiment. We conducted all experiment in our paper and confirm that there is no change in tendency. Please refer to global rebuttal. Thanks again for great suggestion. <!-- <font color='red'> 너 말이 맞다. DCS accuracy상 더 정확하고, 모델간 교체시 순위 변동은 없음 확인했다(main paper과 비교했을떄) tbd </font> --> | | captions | attribute themselves | | -------- | -------- | -------- | | accuracy | 0.409 | 0.442 | ### Relying on external models such as BLIP <!-- > [L1] Previous distributional metrics are several times referred as relying on external models in your paper. However, the attributes that you use also rely on external models (except the USER one, of course - but in this case, the computing of DCS still relies on a captioning model). How would you address this issue ? --> We respectfully suppose a misunderstanding. We do not object using external models but object using uninterpretable embeddings. Our solution is to design an interpretable embedding using DCS. These explanations are in L47 and L66. One of the advantage of our metric is its flexibility; one can proceed for the desired task and analysis. Using external models for embedding is a smart approach that removes costly manual annotation. # Reviewer 1cWg ### GPT may bring bias We tone down our claim regarding the use of GPT in the paper as follows. Instead of being a vital network for obtaining text attributes, GPT is positioned as a recommendation module to assist users in selecting attributes. As users cannot manually apprehend all the visual attributes in training datasets, external models including GPT can be helpful but should be treated as a tool, rather than the auxiliary modules in the attribute selection process. ### Correlation with human judgement <!-- > [W2] The experimental part in Sections 5.1-5.3 does not seem very convincing. The authors mainly verify that the proposed method can indeed be consistent with some expected experimental designs, but there is a lack of more convincing quantitative indicators to show that their proposed evaluation metrics are better than those proposed by others previous research works. I would prefer to see the authors analyze the correlation coefficient between their proposed evaluation metrics and human evaluation, as well as whether their evaluation metrics have improved in terms of correlation coefficient compared to previous evaluation indicators. This is my biggest concern for this work. --> We respectfully note that collecting human evaluation for generated attributes is impractical due to the number of samples and subjectiveness. While image quality and diversity can be roughly evaluated by humans with maybe 100 images, attribute distribution evaluated with 100 images is prone to hasty generalization. Furthermore, evaluating 50K images for attributes is impractical. As a remedy, we offer a novel evaluation protocol for broad and customizable attributes using interpretable embedding and divergence of a generated distribution from the real distribution. ### Euclidean CLIPScore <!-- > [Q1] In Section 3.1, can the problem exhibited by CLIPScore be solved by directly calculating the Euclidean distance between E(x) and E(a) without relying on Directional CLIPScore? --> >can the problem exhibited by CLIPScore be solved by directly calculating the Euclidean distance between E(x) and E(a)? > No, it cannot be resolved by Euclidean distance because CLIP is trained with cosine distance. To confirm it, we provide additional experiments in the table below. As expected, Euclidean cases are inferior to cosine cases in CelebA attribute classification. | | Euclidean | Cosine | | -------- | -------- | -------- | | As-is embedding | 0.222 | 0.395 | |Directional embedding| 0.225 | 0.409 | #### Notation error <!-- > [Q2] Notation overload. The meanings of N used in Eq. (5) and Eq. (2) are different. --> We replaced N used in Eq. (5) to another notation to avoid overload in Eq. (2). Thanks for finding. # Reviewer o9jM ### Image quantity's impact on result stability <!-- >[W1] It appears to be an intrinsic limitation that a significant number of samples (50k) are still required to obtain stable results. Nonetheless, thanks to the reported findings, we can gain more insights, and I appreciate that. Remaining concerns are written in [Questions] section. --> We use gaussian Kernel Density Estimation(gaussian KDE) to make Probability Density Function(PDF) for each attribute from dataset's DCS. If the number of images is too small, the subset's PDF may not describe the full dataset's PDF, and it could deliver inaccurate interpretation of full dataset (or generative model's performance). Fig. 5 (a) describes the impact of sample size and 50k images are recommended to get accurate result. ### First explainable metric <!-- >[Q1] In lines 96-97, it seems that the authors of the referenced paper already propose using the CLIP embedding space for evaluating generative model? If not, it would be helpful to clarify the differences compared to previously proposed methods. Regarding the for explainable evaluation in the related work section, it is necessary to determine whether this paper is the first to come up with it. If there are any additional previous works, it would be helpful to share the, --> >Regarding the for explainable evaluation ... it is necessary to determine whether this paper is the first to come up with it. > While our evaluation protocol and FID$_\text{CLIP}$[24] both use CLIP embedding, they are different as follows. FID$_\text{CLIP}$ directly computes FID on CLIP embedding space. The CLIP embedding is still an uninterpretable feature vector of 512 dimensions, and each channel of the embedding does not have its explicit meaning. On the other hand, we compute similarity between an image and a set of given attributes in CLIP embedding space. It projects the uninterpretable CLIP embedding of the image to an interpretable embedding space, and each channel of our final embedding conveys similarity of the image to its respective attribute. This paper is the first to come up with an explainable evaluation which is applicable to various settings. We will add a closely related work [GANseeing] as follows. <!-- > Our evaluation protocol is similar to GANseeing[#] in that 1) both consider semantics and 2) both compute divergence of the generated images from the real images. On the other hand, differences are: 1) ours consider attributes while GANseeing consider object categories, 2) ours compute KLD for more detailed comparison while GANseeing simply compute Fr\’echet distance, 3) ours is flexible to choose target attributes while GANseeing is limited to the predefined classes in semantic segmentation network. --> [GANseeing] Bau et al., Seeing What a GAN Cannot Generate, ICCV, 2019 ### What if CLIP is biased? <!-- >[Q2] The major concern raised is the reliance on the embedding space. What if there are biases in the CLIP space itself? For example, if being female and wearing makeup are highly correlated, even a woman who does not actually wear makeup may show high similarity to makeup attribute. In such cases, it becomes difficult to consider CLIP score or CLIP direction score as accurate measures of similarity. While a more disentangled multimodal latent space may help alleviate this problem, I'm curious about the authors' perspective on this issue. <font color='red'> --> Including CLIP model, every model may be biased. However we utilize biased pretrained models as a feature extractor for evaluation metric because we usually calculate the distance of distribution of the features. Even though the model is biased, the distance is meaningful since the features from real dataset are also extracted by that biased model. Nevertheless, it is ideal that there is no bias as much as possible. We suppose that if there are biases in the CLIP space itself, CS and DCS both are not accurate but DCS is better than CS since we move the origin into the middle of attributes. <!-- 동균 : CLIP이 bias가 심하다면, CS, DCS 둘 다 부정확 할 것이다. 다만 DCS가 원점이 cluster 가운데에 있으므로 CS 보다는 bias 가 덜 심해 보일것이다. 민기 : CLIP 이 bias 되어있을지라도 score는 reliable하다. biased된 space상에서도 generated가 training의 분포를 따라야함을 변한지않는다. 영정: 민기님 의견에 동의합니다. 다만 설명을 친절하게 해줘야 리뷰어가 쉽게 수긍할거예요. DCS가 더 낫다도 동의합니다. 아래에 적어주신 문단에 하고싶은 이야기가 들어있긴 한데 더 간결해지면 좋겠습니당.</font> --> ### Do we need all training images? <!-- >[Q3] Directional CLIP Score adopts a method of computing the center of training images and normalizing them. Does this mean that anyone who wants to evaluate a generative model needs to share all these training images? Also, considering the introduction of the auxiliary image captioning model called BLIP, the improvement in performance seems minimal. Do the authors believe it is worth the trade-off? --> Obviously, it is recommended to use all the training images as the same as in other evaluation metrics. For efficiency, as FID does, one can store and reuse CLIP embeddeings of all images. Also, one can use the well-designed subset of training dataset or evaluation dataset. ### Is the captioning model worth the trade-off? >Also, considering the introduction of the auxiliary image captioning model called BLIP, the improvement in performance seems minimal. Do the authors believe it is worth the trade-off? > One of the advantages of our metric is its flexibility; one can proceed with the desired task and analysis. Using the auxiliary image captioning model is a smart approach that removes costly manual annotation. We kindly mention that improving the performance is not the purpose of using the captioning model. We believe it is worthwhile not only for the efforts to annotate but also for a standardized evaluation way. <!-- <font color='red'> subset of training img로도 된다.하지만 all training image로 center를 구하는걸 추천한다. 왜냐면 trainingset의 subset으로 구한 center는, 그 Generator를 학습시킨 whole trainingset의 center가 아니다. 이 경우는 with whole trainingset trained Generator로 whole training set’s statistics를 맞추는게 아닌 subset of trainingsets의 stastics를 맞추는거다. 즉 매 실험마다 동일한 이미지에 대해 DCS img direction의 stochastic함이 생길 수 있다. 그러나 in some case like imagenet trained generator는 whole img center를 구하는건 어려울 것이므로, 시드 바꿔서 여러번 돌려서 평균내는거도 대안이 될 수 있을듯. 이걸 FID는 feature stats를 저장해두기도 하는데 비슷하게 우리는 ~를 제공하면 수고를 덜 수 있다고 적어주세요. 여기에 "do the authors believe it is worth the trade-off"에 대한 답변까지 추가 </font> <font color='purple'> Ours also works on subset of training images. However it is desirable to compute the center using all training images. Disparity between the centers of the subset and the whole training set leads to stochasticity of DCS image direction with respect to an identical image since statistics of the subset does not match to the statistics of a generator trained on the whole training set. However, averaging scores on multiple subsets would be plausible alternative in the aspect of both accuracy and computation cost. As feature stats are cached to compute FID, precomputing statistics on the training subset might alleviate the trade off between ~ and ~. </font> --> ### Explanation of FID$_\text{CLIP}$ <!-- >[Q4] Question about Table 5: cI may have missed it, but was there an explanation for the last column of FID_clip in Table 5? What does it represent? Also, in Table 5, it seems that SaKLD is calculated for a single attribute. How is it computed? --> <!-- <font color='red'>FID$_\text{CLIP}$ 의 definition이 없네요. 설명 잘 적어주세요. --> FID is Fr\’echet distance between real embeddings and generated embeddings on Inception-V3’s penultimate feature space. FID$_\text{CLIP}$ computes the same Fr\’echet distance but on CLIP image embedding feature space. We included FID and FID$_\text{CLIP}$ in Table 2 to show that they somewhat correlate (negatively) with the number of injected images. Still, they do not provide any clue for what attributes are under-/over-represented. While the original CLIP embedding space is not interpreted, our proposed attribute embedding is designed for interpretable by exploiting the superior text-image model. <!-- </font> --> ### More than 2 attributes with PaKLD <!-- >[Q5] Can PaKLD be proposed for more than two attributes? It would be great if judgments could be made for a wide range of attributes as shown in Figure 1 (b). > --> > Can PaKLD be proposed for more than two attributes? > Of course, our metric can calculate more than two attributes. We conduct the Triple-attribute KLD between GT FFHQ images and generated images from iDDPM. And we observed the probability P("a person" & "glasses" & "a cell phone") was the most significant difference in 3D joint probability between GT and generated images. We made judgments using diverse combinations of attributes and added it into Appendix. However, we kindly mention that intuitive interpretation is difficult in more than three attributes. <!-- </font> --> ### Interpreting numerical values of metric <!-- >[Q6] How should we interpret numerical values itself of the scores? Although it may not be as interpretable as accuracy I'm still curious about the authors' insights based on their experience. --> <!-- As SaKLD or PaKLD score changes due to allocated text attribute, we recommend assessing generative model’s performance based on ranks or relative score as FID or PR/DC do. --> We can interpret numerical values by looking in the top-scoring attributes. We briefly share our insights in Appendix. C, and also decide to move some information into main paper. Please refer to global rebuttal. In Appendix.C, we computed major model’s score, and StyleGAN3 was ranked as the best model among all when using BLIP extracted text attributes. Additionally, while comparing each generative model’s score is helpful, a more thorough interpretation of each model’s performance via SaKLD or PaKLD would be beneficial. For example, StyleGAN3 failed to capture training images “eyeglass” or “hat” distribution, possibly due to its training approach of alias-free modeling between fine attributes. # Reviewer ch3Y ### Experiments should be more convincing > Although, the motivation for developing this metric is valid, the overall methodology and the experimental results are not convincing for the use of this metric. We appreciate recognizing the validity of our motivation for evaluating attributes of generated images. We could not understand what is missing in our experiments from the review. If it is meant to be how our metrics vary across different image distortions, we do not focus on image quality and diversity because they can be measured by existing metrics. In order to strengthen our ground, we add one more experiment with omitting eyeglasses where SaKLD decreases according to number of omitted images. Could you elaborate if there are more to add? | | SaKLD | PaKLD | most influencing attribute for SaKLD | |-----------------------------------------------------------------------------------|-------|--------|--------------------------------------| | setA(eyeglasses 3325/total 50000) <br> vs <br> setB(eyeglasses 3260/ total 50000) | 0.632 | 3.421 | "beard" | | setA(eyeglasses 3325/total 50000) <br> vs <br> setB(eyeglasses 2000/total 50000) | 0.892 | 4.050 | "eyeglasses" | | setA(eyeglasses 3325/total 50000) <br> vs <br> setB(eyeglasses 1000/total 50000) | 1.545 | 5.668 | "eyeglasses" | | setA(eyeglasses 3325/total 50000) <br> vs <br> setB(eyeglasses 0/total 50000) | 3.257 | 11.595 | "eyeglasses" | ### Comparative experiments to previous metrics We first remind that the main advantage of our metrics is the interpretability where users can observe which attributes are properly modeled or messing up. Previous metrics cannot measure this aspect. On the other hand, our metrics measure how well per-/pair-attribute distribution of generated images align with the training images (Figure 4, S2, S3, S4, and S18). Furthermore, Table 2 compares how FID and FID$_\text{CLIP}$ changes over different number of injected images. We will add precision and recall in the table during the discussion period. ### Relying on textual attributes We would love representing attributes in the image modality but it is prohibitive because an image cannot hold exactly *one* attribute because an image is composition of various attributes. Textual attributes are common and plausible modality for representing attributes in images[LLaVA][ESPER][Visual ChatGPT][BLIP-2]. [LLaVa] Liu et al. Visual Instruction Tuning, arXiv, 2023 [ESPER] Ye et al. Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning, CVPR, 2023 [Visual ChatGPT] Wu et al. Visual chatgpt: Talking, drawing and editing with visual foundation models arXiv, 2023 [BLIP-2] Li et al. Blip-2: Bootstrapping languageimage pre-training with frozen image encoders and large language models, arXiv, 2023 ### CLIPScore and KLD are not novel We propose directional CLIP score (DCS) instead of CLIPScore. DCS is superior to CLIPScore in attribute classification as shown in Table 1. Furthermore, DCS leads to signed scalar over binary attributes as shown in Figure 2. These improvements make SaKLD and PaKLD more sensitive to attributes in the images. KLD is a common measure for computing divergence of one distribution from another. We respectfully remind that we focus our novelty on defining the distributions of attribute strengths from a set of images rather than using KLD. ### Some important parts are a bit ambiguous ##### Center of attributes Eq. (2): $C_{\mathcal{T}} = \frac {1}{N}\sum_{i=1}^{N}\textbf{E}_{\textbf{T}}(\text{BLIP}(x_i))$, where $\textbf{E}_{\textbf{T}}$ is CLIP text encoder and {$x_i$} are training images. BLIP produces a caption for a given image $x_i$. $\textbf{E}_{\textbf{T}}$ produces a CLIP text embedding of the caption. The rest computes their mean. We are not sure which part is ambiguous. Could you elaborate? ##### CelebA attribute classification accuracy > results from table 1 (what does accuracy stand for) etc. are a bit ambiguous. > [Q2] Table 1 represents accuracy results for CS vs DCS. What is meant by evaluating ”positive samples” in line 164 and 165? And what is meant by ”how well they are aligned”? Also, what exactly is the accuracy calculated for? We perform binary classification over all attributes in CelebA according to DCS and compare it with the groundtruth attribute labels. Positive means having an attribute. Higher accuracy means DCS agrees with the annotation. Thank you for the constructive comment. We will clarify this as follows. >> [before] Furthermore, DCS exhibits superior accuracy compared to CS, as demonstrated in Table 1. The table presents the accuracy results of CS and DCS for annotated attributes in CelebA [20]. By evaluating how well positive samples with the highest score align with positive samples for a given attribute, DCS consistently outperforms CS in accuracy. >> [after] Furthermore, Table 1 demonstrates superiority of DCS over CS in measuring attribute strength. If we consider a sample with high DCS as possessing an attribute, DCS achieves higher accuracy than CS in classifying samples with or without attributes annotated in CelebA [20]. ##### Method is not sound > There is a lot of scope to improve the technical soundness of the paper. > The proposed methodology is not very sound and ... We could not understand which part is not sound. Could you elaborate? ##### Missing literature > Although some of the popular metrics 1are mentioned, there has been a lot of work in the generative modelling domain which the literature survey must cover. We include commonly used metrics for in the related work section: FID, FID$_\text{CLIP}$, precision & recall, improved precision & recall, density & coverage, rarity score. We will add perceptual path length, Fr\'echet segmentation distance as below. Inception score is omitted because it is barely used currently. We also omit the metrics for measuring alignment between condition (input) and generation (output) such as mIoU because our goal is to measure the divergence of generated images from training images. Could you mention any other relavent works that would be helpful to include? >> Perceptual path length [PPL] measures sum of perceptual distances between samples along latent interpolation to indicate smoothness of the latent space. Fr\'echet segmentation distance [FSD] compute Fr\'echet distance between the number of pixels of segmented categories in fake images and real images. [PPL] Karras et al., A Style-Based Generator Architecture for Generative Adversarial Networks, CVPR 2019 [FSD] Bau et al., Seeing What a GAN Cannot Generate, ICCV 2019 ##### Method is not supported by experiments > The proposed methodology is ... not well supported with the experimental setup. > Based on the current state of the experiments, the contributions don’t seem to be significant as there is not enough validation to support the claims. * Superiority of DCS over CLIPScore is supported by Table 1. * Misaligning attributes are captured by SaKLD as shown in Figure 4a. * How SaKLD reflects injecting irrelevant attribute is shown in Table 2. * Misalinging pairs of attributes are captured by PaKLD as shown in Figure 4b. Could you elaborate which part is not supported? ##### The results are also not sufficiently explained We could not catch what results are not sufficiently explained. Could you elaborate? ##### Diversity is the main motivation > Diversity is the main motivation for the paper as it is mentioned in the abstract but any theoretical or empirical work to support it is completely missing. We respectfully remind that our main motivation is interpretable evaluation rather than diversity as written in L39-L51. ### typos Thank you for helping us improve our paper. We have reflected them in the paper and ran another proofreading. ### Image and text embeddings > [Q1] If I understood correctly, the text and the image encoders are different? If yes, then the embedding space of both models is going to be very different and will be influenced by the downstream tasks that they are trained on. Would this metric always work for embeddings from different models? Does it make sense to calculate similarity between embeddings from different models? We use pretrained CLIP image and text encoders for producing image and text embeddings, respectively. CLIP encoders are trained to maximize cosine similarity between an image and its matching text while minimizing cosine similarity between non-matching pairs of image and text. Hence, they live in the same embedding space. It does not make sense to calculate similarity between embeddings from *different models*. It is why we use CLIP embeddings. ### Meaning of positive sample and alignment It is merged in a previous item marked with [Q2]. ### Meaning of each channel of the embedding > [Q3] In Section 3.1, there is a mention of ”each channel of the embedding being utilized”, how exactly does this happen? The procedure for encoding an image into an interpretable embedding is as follows: 1) compute similarities between an image and $N_A$ attributes using DCS. 2) form an $N_A$-dimensional embedding with the similarities. We will add this at the very beginning of Section 3.1. Thank you for the constructive comment. ### The same number of samples with specific attribute? > [Q4] Line 196-197 claims that the generated distribution is desirable only if it has exactly same number of samples as in the training set that has a particular attribute. Why should it be? Please consider an extreme example. If we have a dataset containing dogs and cats, a generative model that produces only dogs is clearly underfit because it assigns no probability to cats. [DeepLearningBook] [Page 715.](https://www.deeplearningbook.org/contents/generative_models.html) We will tone down our phrase as below. >> [before] We design SaKLD to distinguish a good generative model which produces the same quantity of each attribute present in the training data. For example, if 50,000 training data contains 3,000 images with eyeglasses, the model should generate exactly 3,000 images with eyeglasses. Any deviation from this ideal distribution is considered undesirable. We introduce a new metric that quantifies ... >> [after] If we have a dataset containing dogs and cats, a generative model that produces only dogs is clearly underfit because it assigns no probability to cats [DeepLearningBook]. In this context, we consider a generative model better than another if it produces similar number of samples along attributes. As we do not have an access to the groundtruth real and fake distributions, we design a new alternative metric, SaKLD, that quantifies ... [DeepLearningBook] Goodfellow et al., Deep Learning, MIT Press 2016 # Reviewer E51D <!-- >[W1] Failure modes of the metric are not discussed (or limitations of language-image models and how they affect the metric). --> ### Failure modes of metrics Here are 2 failure modes we anticipate. We add these into More Discussion. #### Bias in CLIP may deliver inaccurate result. As review o9jM referred, if some attributes are highly correlated in CLIP embedding space, SaKLD and PaKLD will not resemble human judgement. Embeddings from biased encoder will be plotted as distorted distribution causing unreliable results. Using biased model is inevitable since all models are biased including CLIP. However, we do not directly use the features but utilize them to calculate the distance between distributions. Even though the model is biased, the distance is meaningful since the features from real dataset are also extracted by that biased model. #### Need enough samples We use gaussian Kernel Density Estimation(gaussian KDE) to make Probability Density Function(PDF) for each attribute from dataset's DCS. If the number of images is too small, the subset's PDF may not describe the full dataset's PDF, and it could deliver inaccurate interpretation of full dataset (or generative model's performance). Fig. 5 (a) describes the impact of sample size and 50k images are recommended to get accurate result. ### Drastic failure in single attribute v.s. balanced failure among all attributes It is difficult to determine model A (drastic failure in single image) or B (balanced failure among attributes) is worse, since each user has different demand. So, instead of being buried only on total scores of SaKLD or PaKLD, we recommand to inspect SaKLD (or PaKLD) to get thorough insight of model. One can easily look in the top-scoring attributes from SaKLD or PaKLD. Furthermore, the main strength of our metric is flexibility, which contains users can drop some careless attribute (even shown significant SaKLD), or add some attribute which one mainly concentrates. <!-- >[W2] Consider a scenario where there are two models, A and B, with the same SaKLD. Model A produces essentially perfect alignment on all attributes other than one which fails dramatically, causing a large spike in SaKLD histogram, and this attribute is the only contributing to the final score. Model B on the other hand performs poorly across all attributes but averaging over attributes yields the same SaKLD score as the model A. In this scenario SaKLD would potentially not agree with human judgment, since failing in a single attribute might not be visible when inspecting large image grids. Can this kind of scenario occur in practice, and if it can, what would be your recommendation for the user of the metric in that case? --> ### Dominance of several attributes We carefully mention that we set up an extreme experiment(man-smile correlation=1) to show some dramatically poor attribute SaKLD(PaKLD) in experiment 4. Therefore, dominated by few attributes of attribute pairs are our desired result. In other word, if the dominating phenomenon is observed, it can be considered to have as much impact as the extreme experiment we designed. And in practice, a similar phenomenon is observed among current major models. As we described in global rebuttal 3., each model has shown insufficient performance mimicking some attribute distributions. In the same context, we can find some of the model's total SaKLD/PaKLD drop when the number of attributes increased, as Figure 5. (b). We can understand this phenomenon as some attribute they lack is inserted. <!-- >[W3] Fig. 4 shows that SaKLD and PaKLD are dominated by few attributes of attribute pairs. Is this usually the case in practice? Fig. 5 (b) also indicates that adding new attributes contribute to the metric with diminishing strength. This might be misleading for the user of the metric. Intuitively, adding a large set of attributes should correspond to more thorough evaluation of the model, however, this might not be the case if few attributes are dominating the final value of the metric. --> ### Exploring major models with our metrics <!-- >[W4] The empirical effectiveness of the attribute based metric is not fully demonstrated. The authors advocated for an interpretable metric but unfortunately end up comparing modern generative models using single scalar numbers (as the existing metrics do), instead of taking advantage of the interpretability of the metric and showing a more fine-grained analysis of the models. --> We added exploration result of major models in global rebuttal 3, and also Appendix C. We reported each model's characteristics, such as iDDPM has a particularly high SaKLD for "women" and "makeup". As interpreting the model is one of our main contribution, we moved interpretable analysis part from appendix to Section 5. ### DCS symmetric about 0? <!-- >[Q1] Fig. 2 shows CS and DCS values for distinct attributes, how would these scores look like for pairs of “opposite attributes”, e.g., “long hair - short hair” or “makeup - no makeup”? Would DCS values be symmetric around zero or something else? --> DCS values for opposite attribute are heading opposite direction in common, but not same as the absolute value. Considering DCS embedding space, an angle between embedding for "long hair" and embedding for "short hair" may not be absolute $180^\circ$ since DCS embedding depends on text mean. We get text mean via considering not only "long hair", and "short hair", but for other attributes, the DCS for opposite attribute cannot be symmetric as ideally. Nevertheless, signs are always observed to be opposite, which brings ease of interpretation. We attached figures of CS, DCS for opposite attributes for 2 images in .pdf file. Please refer to Figure 3 in the global rebuttal PDF file. ### Understanding accuracy We perform binary classification over all attributes in CelebA according to DCS and compare it with the groundtruth attribute labels. Positive means having an attribute. Higher accuracy means DCS agrees with the annotation. Thank you for the constructive comment. We will clarify this as follows. <!-- We evaluated the accuracy for all 202,599 celebA images, computed CS and DCS for all 40 attributes and arranged them in descending order per each attribute. For example, for the attribute "arched eyebrows", if there are 54,090 positive samples in the celebA annotation labels, we calculated the matchiness by comparing the top 54090 images sorted by CS(or DCS) values. If 19,946 images are matched, the accuracy for "arched eyebrows" is calculated as 19,946/54,090 = 0.369. --> <!-- In this case, 0.01 of difference is 540 images, which is quite many. We followed the same implementation for Directional CLIPScore (DCS), and compared mean accuracy driven by meaning all 40 attributes. --> Here are the first 19 attribute accuracy among 40 attributes. Attributes such as "Double_Chin"'s accuracy is <0.2 for both CS and DCS, while explicit attrbute such as "blonde hair"' accuracy is almost 0.7 for DCS. We attached whole 40 attribute's accuracy in Appendix. | | Mean accuracy | 5_o_Clock_Shadow | Arched_Eyebrows | Attractive | Bags_Under_Eyes | Bald | Bangs | Big_Lips | Big_Nose | Black_Hair | Blond_Hair | Blurry | Brown_Hair | Bushy_Eyebrows | Chubby | Double_Chin | Eyeglasses | Goatee | Gray_Hair | Heavy_Makeup | |-----|---------------|------------------|-----------------|------------|-----------------|-------|-------|----------|----------|------------|------------|--------|------------|----------------|--------|-------------|------------|--------|-----------|--------------| | DCS | 0.442 | 0.188 | 0.419 | 0.549 | 0.192 | 0.644 | 0.634 | 0.247 | 0.432 | 0.566 | 0.696 | 0.199 | 0.448 | 0.313 | 0.327 | 0.191 | 0.731 | 0.525 | 0.382 | 0.563 | | CS | 0.395 | 0.276 | 0.369 | 0.578 | 0.162 | 0.508 | 0.595 | 0.235 | 0.339 | 0.560 | 0.582 | 0.218 | 0.407 | 0.345 | 0.202 | 0.136 | 0.628 | 0.477 | 0.237 | 0.510 | ### More informations about BLIP attribute extraction <!-- >[Q3] Automatic extractions of text attributes is an interesting idea, how robust BLIP is with various datasets? What are the most frequently extracted attributes? --> The list below is top 20 frequent attributes in BLIP caption sorted as descending order. BLIP provides a standardized attribute extraction for all datasets,but also has a shortage. As BLIP do not describe an image in great detail, for example 'a cat sitting on a table in a room' was commonly seen in LSUN cat captions, we recommand users to use USER annotation mainly. <!-- Descending order로 정의 BLIP의 모델적 한계 : BLIP은 학습시킨 이미지-캡션에 의존한다. 예를 들어 LSUN cat을 BLIP에 통과시킬 경우 고양이의 종류에 대한 묘사 보다는, 'a cat sitting on a table in a room' 과같은 문장이 나온다. BLIP을 extractor의 한 종류로 설정했지만, USER attribute의 사용을 추천한다. --> FFHQ : ['a woman', 'a man', 'a person', 'glasses', 'a suit', 'a little girl', 'a young boy', 'a cell phone', 'a microphone', 'a necklace', 'a hat', 'tie', 'a young girl', 'blonde hair', 'long hair', 'a blue shirt', 'a beard', 'a white shirt', 'her head', 'a tie'] LSUN cat : ['a cat', 'a black cat', 'the floor', 'cats', 'a couch', 'a black and white cat', 'a white cat', 'a couple', 'a woman', 'a table', 'the ground', 'a dog', 'a man', 'black and white cat', 'a small kitten', 'a person', 'blue eyes', 'a blanket', 'a chair', 'a kitten'] Metfaces : ['a painting', 'a man', 'a woman', 'a portrait', 'a black and white photo', 'curly hair','a suit', 'a statue', 'a beard', 'a person', 'a drawing', 'tie', 'a hat', 'a black hat',"a woman's head", 'white hair', 'a young girl', 'her head', 'a white dress', 'long hair'] LSUN church : ['a clock tower', 'a clock', 'a steeple', 'people', 'an old church', 'a building', 'the background', 'a cross', 'a city', 'a white church', 'a large cathedral', 'a large church','a tall tower', 'the middle', 'two towers', 'a very tall building', 'a view', 'a statue', 'trees', 'a street'] #### Minor notes Thank you for helping us improve our paper. And for nP2 in L222, it meant a permutation of two out of N. We have reflected them in the paper and ran another proofreading. <!-- >[Minor notes] >Sec. 3.3 a capital letter is missing Line 154: Needs to be “the center of the images” Line 220: n and P_2 are not defined Tab. 3 would benefit from also having a column for FID to see how SaKLD and PaKLD differ from it -->