## Reviewer U69q -- Reject
<!-- A1: It's important to note that our work differs from BLIP and BLIP2. While BLIP and BLIP2 focus on image captioning, our model is designed to understand images, requiring a detailed paragraph for description. Since advertisement understanding is a novel task, we couldn't find a suitable baseline for direct comparison. However, we have compared our model to state-of-the-art large vision-language models (such as MiniGPT4, **LLaVA(which you encourage us to compare with )**, and mPLUG-owl) in our **supplementary materials**. I encourage you to review those carefully. Regarding InstructBLIP, it was published after the deadline for our submission, so we were unable to compare our model to it. I'm uncertain about the meaning behind your question about *How do the models perform w/ and w/o fine-tuning?*. It seems there might be a deep misunderstanding, as we do not employ fine-tuning techniques in our models. -->
**Q1: Comparison with other models**
A1: Thank you for your suggestions, the work you mention is valuable and our response is as follows:
* The dataset we use is "widely used" in early fixed-format advertising comprehension tasks. But we propose a **novel** generic ad comprehension task, which allows a comprehensive and complete understanding of ad content, no longer limited to a fixed format. Since advertisement understanding is a novel task, we couldn’t find a published baseline for direct comparison. However, we have compared AdGPT with the latest and widely influential work (as of our submission) HuggingGPT, which should prove the validity of our approach.
* As for the models you mentioned, we need to claim that we have compared our model to state-of-the-art large vision-language models (such as MiniGPT4, LLaVA(which you encourage us to compare with ), and mPLUG-owl) in our supplementary materials. I encourage you to review those carefully. In addition, it’s important to note that our work differs from BLIP and BLIP2. BLIP and BLIP2 focus on image captioning, our model is designed to understand images, requiring a detailed paragraph for description. Regarding InstructBLIP, it was submitted to *arxiv* after the deadline for our submission, so we were unable to compare our model to it. While GPT4 is still in an unaccessible state as of our submission.
* Thank you for your suggestion about translating our method to other LLMs. We translate our method to Flan-T5 and found it to be equally effective. Flan-T5 with our methods get better performance in 58.9% of cases, and get the same performances in 39.2% of cases.
* In fact, there is no ground truth in the new ad comprehension task, and it cannot be directly finetune.
<!-- * but since the rebuttal period was relatively short, we did not have time to finish the experiment. Moreover, these two works are only arxiv papers, and the comparison with them is not necessary according to the ACM MM policy. Finally, ChatGPT theoretically outperforms these works, and as a work of the same plugin type, we mainly compare AdGPT with HuggingGPT, which should illustrate the superiority of our approach. -->
<!-- * We further translate our method to xxx, the results show that our approach works well on the model.(how well does this method translate to other open source LLMs - Flan-t5, Llama and various sized models - 300m vs for example 65 B? How do the models perform w/ and w/o fine-tuning?) -->
<!-- A2:Our paper describes the improvements we have made over previous work. One of the major limitations of previous research, based on the Pitt Ad dataset, is that they could only understand advertisements using fixed sentences. In contrast, our model produces higher-quality and more diverse understandings of advertising content. Consequently, the traditional metrics used with the Pitt Ad dataset may not accurately evaluate the capabilities of our model. In this regard, conducting a user study can be the most effective way to evaluate our model's performance. -->
**Q2: Why did the authors need to introduce user study for understanding tasks, when the Pitts Ads dataset already contains many evaluation tasks and corresponding metrics?**
A2: Our reasons for not using the previous evaluation tasks and metrics are as follows:
1. The previous evaluation metrics were based on understanding advertisements through fixed-format sentences. Due to the development of LLMs, the linguistic expression of current models is greatly enhanced compared to previous works, new evaluation tasks and metrics are strongly needed. Therefore, we have proposed a more general and meaningful advertisement understanding task that can provide a more comprehensive summary of advertisements.
2. For the novel task of ad understanding, in the absence of ground truth, we first considered manual evaluation of model-generated ad understanding, i.e., user study. Besides, we further proposed a more convenient evaluation metric, Generative Similar Score (GSS). We conducted extensive experiments to verify the consistency of GSS and user study, and proved the reliability of GSS. Such provides an alternative way to evaluate the model's ability to understand advertising.
<!-- For instance, the model is already capable of generating more natural ad summaries. It would be contradictory to our motivation for the ad understanding task to abstract these summaries into unnatural fixed-format templates solely for the purpose of using previous evaluation metrics. -->
<!-- Traditional metrics are unable to evaluate this task, so we employ the user study as an evaluation metric. Additionally, we have introduced a new evaluation metric called Generative Similarity Score and validated its reliability through user studies. -->
## Reviewer 6hCq -- Boardline reject
Q1: The motivation behind understanding advertisements
A1: It's true that an unclear messaging in an advertisement could be attributed to a design failure rather than a failure of the reader's comprehension. However, the motivation for ad understanding is, on the one hand, to understand well-designed ads and perform downstream tasks based on these understandings, and on the other hand, ads that design failures can be analyzed and fed back to the ad agency in a timely manner. Our work aims to make the process of understanding advertising more efficient and concise by automating it.
Q2: How to ensure the reliability and accuracy of the visual model
A2: The accuracy of the visual model is indeed a bottleneck. We cannot guarantee that the output of the vision expert model is always correct. However, our contribution lies in improving the effectiveness of understanding image content when using the output of the vision expert model. Our approach can be seen as a plug-and-play module that can be integrated with more powerful visual expert models. We believe that our work can be really helpful for companys or people who cannot fine-tune their own Large-scale Vision-Language models.
Q3: What is the effect of shared memory.
A3: We appreciate your valuable advice. Our research explores the effect of shared memory. We randomly sampled 300 examples and performed the user study to validate the method, as shown in the following table. In conclusion, shared memory shows a 3% performance improvement over the unshared one. The results of this experiment will be included in our official version.
<!-- Specifically, the shared-memory-based method outperformed the original method in 27.8% of cases, and get the same performance in 47.4% We will update this experiment in the latest version of our paper to provide further details. -->
| Model |User Study | Generative Similarity Score |
| -------- | -------- | -------- |
| AdGPT w/ share memory | 27.8% | 0.5697 |
| AdGPT w/o share memory | 24.8% | 0.5844 |
## Reviewer TvRf -- Boardline accept
Q1: Overclaimed the multi-modality ability of our work.
A1: Our work extends the multimodal capabilities of ChatGPT by way of plugins. Inspired by HuggingGPT[1], which can combine images and texts and more focused on planning for the use of visual expert models, we propose AdGPT focusing more on how to better reason about the results after acquiring visual information. On the other hand, sprit of ACM MM encourages researchers to do some meaningful application with cutting-edge technology. Our work follows this spirit and does some meaningful work.
Q2: User study should be conducted in a broader population
A2: Thanks for your valuable advise. We apply extern user study with five undergraduate students who with no background of Computer Vision. The extern experiment shows our model beat mPLUG-owl in 56.01% of cases, and get the same performance in 24.48% of cases. We will update this new result of experiment in the latest verision of our paper.
| Model |User Study | Generative Similarity Score |
| -------- | -------- | -------- |
| mPLUG-owl | 24.48% | 0.5699 |
|Our model |56.01% | 0.5844 |
[1] Shen, Yongliang, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. "Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface." arXiv preprint arXiv:2303.17580 (2023).
## Reviewer dKD2 -- Boardline accept
Q1: Additional explanation of the unclear details mentioned in the Further Comments
A2: Additional explanation of the unclear details are as follows:
1. The fourth step of AdGPT requires both observation information and an adaptive chain of thought. The Classification step is utilized to generate adaptive inference chains.
2. AdGPT can be viewed as a system, wherein obtaining observations (step 1) is an integral component of this system.
3. Classification influences the adaptive chain of thought, resulting in different chains of thought being produced by ChatGPT for different categories. This demonstrates a focus on distinct priorities. For example, product advertisements may emphasize the selling points of a product, while social advertisements may aim to encourage people to engage in a specific behavior.
4. In HugginGPT, instead of using observation information, we employ images as input. Incorrect visual information can be regarded as a limitation in HugginGPT's capabilities. In figure.1. We input the Oberservation information to non-prompt ChatGPT, instead of HuggingGPT, to better illustrate our method
5. The final result will have the corresponding features of the classification. For example, for a product ad, AdGPT tend to go over the advantages and selling points of the product. While for a public sever ad, AdGPT tend to tap into the actions that the ads call on people to take.
I really appreciate your valuable advise, In the new version of the paper we will make the article more specific and clear according to your comments.
Q2: More qualitative and quantitative results to verify the effectiveness of AdGPT
A2: Thanks for your valuable advice. We do extern experiments both qualitatively and quantitatively. Considering the principle of rebuttal, we submitted the anonymous GitHub page with the qualitative results to Area Chair. He will decide whether to present the results in an open discussion. We also do the extern quantitative experiment in 300 images, the generation score of mPlUG-owl is 0.5699, compared to 0.5844 of Our AdGPT. And our model does better than mplug-owl in 56.01% of cases, getting the same performance in 24.48 % of cases.
| Model |User Study | Generative Similarity Score |
| -------- | -------- | -------- |
| mPLUG-owl | 24.48% | 0.5699 |
|Our model |56.01% | 0.5844 |
Q3: Relationship between 3000 images and meaningful advertisement.
A3: Pitt Ad datasets contain lots of meaningful advertisement. To our knowledge, MetaCLUE[1], a set of vision tasks on visual metaphor, collects image from Pitt ad datasets. So most of advertisement in Pitt Ad datasets can be considered as "meaningful advertisement".
[1] Akula, Arjun R., Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhiwei Jia, Suyash Damle, Garima Pruthi et al. "Metaclue: Towards comprehensive visual metaphors research." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 23201-23211. 2023.
## To AC
Dear Area Chairs,
We are writing this letter to raise your attention to the inappropriate rejection reasons from *Reviewer U69q*.
First, he/she presented a number of models that we were asked to compare. We have carefully examined our paper and clarify as follows:
1. Among them, for the comparison of "competitive methods which essentially do the same thing", we have compared MiniGPT-4, mPLUG-owl and LLaVA(mentioned by *Reviewer U69q*) in our supplementary submission.
2. The work of BLIP and BLIP2 don't do the same thing as our work, they can't accomplish the task of ad understanding that we proposed. Additionally, InstructBLIP, and LLaMa are not officially published work, but only arxiv papers, and according to ACM MM's policy, not necessary comparative work.
Second, his/her comment that "Lack of evaluation is the biggest weakness in the current work" is a misunderstanding. We have valid reasons for not using previous evaluation tasks and metrics:
1. Due to the significant advancement of LLMs, they are now more linguistically expressive than before. Therefore, new evaluation tasks and metrics are strongly needed, and we propose a more comprehensive and meaningful advertising comprehension task that captures the essence of advertising.
2. We not only adopt user study, but also proposed a more convenient evaluation metric, **Generative Similar Score (GSS)**. We conducted extensive experiments to verify the consistency of GSS and user study and the reliability of GSS, and other reviewers agreed with this metric(*Reviewer TvRf and dKD2*) and called it **creative** (*Reviewer TvRf*).
Therefore, we believe that our work has appropriate evaluation. *Reviewer 6hCq* also mentioned that we have done **through metric evaluations**.
It seems that *Reviewer U69q* did not take much time in reviewing this work, and rejecting this work with inaccurate comments is not acceptable. He/She didn't seem to understand our work at all. We sincerely request you to please take a look at our rebuttal and the main paper, consider our appeal, and render a more convincing decision.
In addition, for more examples requested by *Reviewer dKD2*, we put our additional qualitative results on an anonymous github page AdGPT1.github.io, you can decide whether to publish this result to the reviewers during the open discussion. We double-checked the link to make sure it was anonymous, so we sincerely hope the reviewers will see this result.
Thanks for your time!
Authors
<!-- Dear Area Chairs,
We are writing this letter to raise your attention to the inappropriate rejection reasons from Reviewer U69q.
1. We have show experiment which compare our model with LLaVa, while he asked us to compare with this model.
2. Further, most of the models he asks us to compare are completely unreasonable. BLIP and BLIP2 don't do the same thing as our work. InstructBLIP was published after deadline of MM.
3. He questioned our reasons for using user study, which is exactly what motivated us to do this work. And we elaborate on this reason in the article
4. The reason why he refer to fine-tuning is confusing.
It seems that Reviewer U69q did not take much time in reviewing this work, and simply rejecting this work with two inaccurate comments is not acceptable. He didn't seem to understand our work at all. We sincerely request you to please take a look at our rebuttal and the main paper, consider our appeal, and render a more convincing decision.
In addition, we put our additional qualitative results on an anonymous github page AdGPT1.github.io, you can decide whether to publish this result to the reviewers during the open discussion. We double-checked the link to make sure it was anonymous, so we sincerely hope the reviewers will see this result.
Thanks for your time!
Authors -->
# To all
<!-- We thank the reviewers for their constrcutive comments. It is inspiring to see that they confirmed that our work is meaninful and interesting. Our method get high performance when comparing with mPLUG-owl(dKD2, tvRF).
We explored the effects of shared memory(6hCq), and show more qualitative results of AdGPT(dKD2). And we explain the detail of our work(dkD2). -->
<!-- mx: -->
We thank the reviewers for their constructive comments.
It is inspiring to see that the reviewers acknowledge the **motivation and significance** of the paper's approach in leveraging text generation for advertisement understanding(TvRf), and found it **interesting**(dKD2, 6hCq) and **tactful**(TvRf). They also appreciate the **effectiveness of AdGPT** and its improvement over HuggingGPT (TvRf, dKD2). In addition, they endorsed our proposed **new evaluation metric**, the Generative Similarity Score(TvRf, dKD2), and found it to be **creative**(TvRf). The paper's **well-written content(U69q), thorough metric evaluations(6hCq), and inclusion of multiple case studies** are recognized as strengths (6hCq).
We believe the remaining issues can be fully addressed and responds to each reviewer in detail under each reviewer's comments.