<!-- # General response to all reviewers
Thank you for the feedback. Please refer to the “General response to all reviewers” section for overall feedback to all reviewers.
-->
# General Comment
We extend our sincere gratitude to all reviewers for their thorough evaluations and constructive feedback on our work.
## General Feedback and Appreciation:
- We are heartened to note that all reviewers acknowledged the relevance of our subject matter and the value of the proposed benchmark:
- FunQA dataset is **high-quality** that is the **first** to delve into popular, significant, and intricate genres of **counter-intuitive/surprising** videos (4.3k videos, 312k QA pairs in total).
- FunQA dataset has designed innovative tasks, which not only explore human **commonsense understanding** but also advance video comprehension research within the community.
- FunQA benchmark fills the gaps in the existing field of videoQA. It comprehensively assesses the current SOTA **caption-based** and **instruction-based** VLM.
- Furthermore, we are pleased that the conclusions drawn from the FunQA paper have received recognition and have provided a novel and meaningful direction for the development of recent advanced models:
- **None of current VLMs could solve Timestamp localization task**.
Since they don’t have sense of time. Our goal is to highlight the need for VLMs to address queries about specific video time ranges. Knowing which segment is intriguing is vital, and we believe that improving this capability in VLMs is of paramount importance.
- **Current VLMs have difficulty on understanding long and complex videos, nor the reasoning on counter-intuitiveness:**
**Caption-based** models excel in providing detailed descriptions but **struggle in tasks that require reasoning**, resulting in a notable performance gap between description tasks (e.g., H2) and reasoning tasks (e.g., H3). On the other hand, **instruction-based** models demonstrate stronger reasoning capabilities but tend to **underperform in description tasks.** In other words, **no clear winner across all tasks.**
- **Insufficient evaluation metrics for free-text tasks.**
Traditional metrics yield almost **near-zero** scores for free-text questions. While GPT-4 is found to show preliminary capabilities in accessing free-text in deep understanding. However GPT-4's response has the issue of **instablility**, which should be improved in the future.
- In the rebuttal, we discussed the special challenges and future direction that FunQA provides:
- We believe this dataset's demanding nature will catalyze advancements in video comprehension models. We invite the community to take this as an opportunity to redefine what's achievable, including but not limited as follows:
- **Accurate understanding of the videos**
- **Logic Reasoning**
- **Extra Knowledge - Sense of Humor**
- FunQA provides a platform to explore **Potential Solutions:**
- **Model Size**
- **Data Quality**
- **Training Strategy**
- **Model Collaboration.**
---
## Addressing Specific Reviewer Feedback:
- Addressing the feedback from reviewer "eLTe":
- **Related Work Missing:**
- Updated Table 1 to include missing classical benchmarks.
- **Table 2 Confusing:**
- None of current VLMs could solve Timestamp localization task.
- **Translation Limitation in Semantic Variability:**
- We adopted the following strategy: **Special Design on Translation Prompt** & **Manual Screening.**
- In response to reviewer "jbLY":
- **Original FunQA Information Missing:**
- We updated the annotation statics before GPT-3.5 expansion in #Line 54-55 and Table A2 in supplementary materia.
- **Whether GPT-3.5 is Faithful:**
- Previous datasets like MIMIC-IT and LLaVA-Instruct also used similar GPT-3.5 generation techniques.
- We implemented a manual consensus evaluation to demonstrate GPT-3.5 is faithful.
- **Finetuning Experiments is Important:**
- In Table 2, the last row provides insights into our fine-tuning experiments.
- Our team has continuously integrated the result in according to the latest models, which evidents that fine-tuning has indeed led to substantial improvements across different metrics.
- **GPT-3.5 Translation is Limited:**
- We adopted the following strategy: **Special Design on Translation Prompt** & **Manual Screening.**
- **Whether Temporal Dynamics is Emphasized in FunQA:**
- We provided a few examples and elborate them.
- We conducted further manual experiments based on the reviewer's response and proved it.
- Pursuant to suggestions from reviewer "jKvF":
- **FunQA Task Overly Complex:**
- We invite the community to take sloving FunQA task as an opportunity to redefine: **Accurate understanding of the videos, Logic Reasoning , and Extra Knowledge - Sense of Humor.**
- FunQA provides a platform to explore Potential Solutions: **Model Size, Data Quality, Training Strategy, and Model Collaboration.**
- **Human Performance:**
- We did experiment to show that human-driven responses under the FunQA benchmark significantly outstrip model performances.
- **Annotation Subjectiveness and Language Diversity:**
- We employed GPT-3.5 to generate multiple answers, followed by a secondary annotation to heighten objectivity and linguistic diversity.
- **Evalution is Limited by Language Generation:**
- We devised a GPT-4-based evaluation method.
- We carefully crafted promptswere designed to highlight the model's authentic humor comprehension, going beyond simple text generation.
- **Table 2 Confusing:**
- None of current VLMs could solve Timestamp localization task.
- **Audio Influence in FunQA:**
- FunQA is visual-centric.
- We did Ablation Experiments,which demonstrate that the inclusion of audio doesn’t notably diminish the dataset's challenges.
- In line with the feedback provided by reviewer "sSpc":
- **Requirement for Commonsense Reasoning:**
- It’s essential to realize that these discrepancies are not mere mislabelings but highlight the core challenges in model reasoning.
- We tested HumorQA task on the videoChat model. These examples, juxtaposed with human interpretations, reaffirm the need for models to possess a deeper commonsense understanding.
- **The Uniqueness of FunQA (这个我当时问题就没咋看懂,不太会总结了):**
- Deep Temporal-Spatial Reasoning
- Rich Annotations
- Exploring the Sense of Humor
- **Humor Offended Content:**
- Humor in FunQA arises from the incongruity between reality and expectations, flourishing with the skillful juxtaposition and transformation of events.
- Each VQA pair in FunQA undergoes a two-step annotation and consensus evaluation process to eliminate subjectivity.
In conclusion, we believe our revisions and the additional experiments bring greater clarity to our work and underscore its significance. We hope the reviewers find these enhancements satisfactory and reflective of our commitment to advancing this field.
# Updates on the manuscript
We thank all reviewers for giving us valuable and constructive feedback. Below we summarize the added/updated contents; they are all marked **blue** in the revised manuscript.
- In response to reviewer **eLTe**, we have updated **Table 1 and #Line 257-263 in the paper**.
- In response to reviewer **jbLY**, we have updated **Table 2 in the paper, Table A2 and #Line 54-55 in the supplementary materia**.
- In response to reviewer **jKvF**, we have updated **#Line 257-263 in the paper and #Line 149-160 and Table A3 in supplementary materia**.
- In response to reviewer **sSpc**, we have updated **Table 1 and #Line 257-263**.
# eLTe 8 / 4
Official Rebuttal
We sincerely appreciate your constructive comments to improve our paper. We detail our response below point by point. Please kindly let us know if our response addresses the questions you had for this paper.
**1. In Tab. 1, some classic VideoQA benchmarks are missed, e.g., MSVD-QA[1], MSRVTT-QA[1], ActivityNet-QA[2], and SUTD-TrafficQA[3].**
We appreciate your valuable feedback. We have now updated **Table 1** to comprehensively cover key benchmarks in the VideoQA domain, incorporating benchmarks such as MSVD-QA[1], MSRVTT-QA[1], ActivityNet-QA[2], and SUTD-TrafficQA[3].
| Dataset | Domain | Type | Videos | # Clips (K) | Input |# QA pairs (K) | VC | Desc. | Expl. | STR | QA Task |
|--------------------|---------------|------|--------|-------|------------|-----------|-------|-----|------|--------|------|
| MSVD-QA | Social Media | A | 10 | 2.0 | V, A | 50.5 | Yes | No | No | No | OE |
| MSRVTT-QA | Social Media | A | 15 | 10.0 | V, A | 243.6 | Yes | No | No | Yes | OE |
| YouTube2Text | Social Media | A | 10 | 10.0 | V, A | 122.7 | Yes | No | No | Yes | OE |
| ActivityNet-QA | Social Media | M | 180 | 5.8 | V, A | 58 | Yes | No | No | Yes | OE |
| SUTD-TrafficQA | Traffic | M | 3.5 | 10.1 | V, A | 62.5 | Yes | No | No | Yes | OC |
**2. Table 2 is confusing. The model performance on Timestamp localization (i.e., H1, C1, and M1) did not provide in the paper and without explanation. The authors are expected to give more discussions.**
Thank you for raising this point. We understand that our initial explanation might not have been sufficiently detailed. We have updated the result "Timestamp localization task is the most challenging" in **#Line 257-263**. To clarify, since modern VLMs typically process a series of frames as video input. However, given their current model structure, these models cannot handle the timestamp localization effectively, hence the lack of scores. In other words, none of current VLMs could solve H1, C1, M1 tasks since they don’t have sense of time. To be more specific:
- Caption-based models focus mainly on captioning and often omit temporal information. Due to this, they're not equipped to deal with our task that emphasizes temporal aspects, so they aren't scored.
- Instruction-based models, such as Otter, take visual information from specific frames without temporal context. Their outputs are thus confined to individual frames, making them ineffective at addressing temporal localization.
Our goal with the Timestamp localization task is to highlight the need for VLMs to address queries about specific video time ranges. Knowing which segment is intriguing is vital, and we believe that improving this capability in VLMs is of paramount importance.
**3. The strategy of translating the Chinese annotations into English by GPT-3.5 is limited since the same sentence may have different semantics in different scenarios.**
You are absolutely right. Direct word-to-word translations can often lose the essence of the original meaning. We were also aware of the problem. To ensure accurate translations that consider this variability, we adopted the following strategy:
- **Special Design on Translation Prompt:** In the preliminary stage, we favored GPT-3.5 over traditional translation platforms, such as Google Translate. Our rationale for this choice was GPT-3.5's ability to harness its inherent knowledge to comprehend sentences, thus preserving more context post-translation. At this juncture, video information was not incorporated. Generating five synonymous sentences aimed to counterbalance potential semantic shifts during translation, heightening the probability of capturing the intended meaning while maintaining the diversity intrinsic to the annotations.
- **Manual Screening:** The subsequent stage pivoted to manual scrutiny. We delegated this task to bilingual experts, well-versed in both English and Chinese, and with prior experience in video annotations. Their modus operandi involved viewing the video, understanding the Chinese annotations, followed by a meticulous review and refinement of the English translations produced by GPT-3.5. The evaluation metrics centered on:
- Ensuring fidelity to the original video and Chinese annotations.
- Ensuring no factual or logical discrepancies.
- Guaranteeing the absence of any derogatory or sensitive content.
Our holistic strategy is geared to ensure that translations retain the original semantic and emotional nuances, even with the introduction of diverse linguistic expressions.
For tangible examples elucidating how our approach bridges potential semantic and emotional disparities during translation, kindly refer to Appendix A.2 Figure 8.
---
We hope that our responses address your concerns and we are happy to discuss any lingering issues.
# jbLY 5 / 3
We sincerely appreciate your constructive comments to improve our paper. We detail our response below point by point. Please kindly let us know if our response addresses the questions you had for this paper.
> **1. In #Line 201-206, the author said they used GPT3.5 to expand the dataset to 312K. Can the author list the dataset size before using GPT3.5 to expand.**
We have update the annotation statics before GPT-3.5 expansion in **#Line 54-55 and Table A2 in supplementary materia**. The dataset size before using GPT3.5 to expand is listed as below:
| Datasets | Avg length (s) | # Clips (K) | # QA pairs (K) | # QA per clip |
|--------------|----------------|-------------|----------------|---------------|
| HumorQA | 7 | 1.8 | 7.2 | 4.0 |
| CreativeQA | 48 | 0.9 | 4.5 | 5.0 |
| MagicQA | 10 | 1.6 | 4.8 | 3.0 |
| FunQA | 19 | 4.3 | 16.5 | 3.8 |
**For orginal annotation:**
Each video in HumorQA is annotated for H1, H2, H3, and H4 tasks; each video in CreativeQA is annotated for C1, C2, C3, C4, and C5 tasks; each video in MagicQA is annotated for M1, M2 and M3 tasks.
Task 1 (H1, C1 and M1) is the *Counter-intuitive Timestamp Localization Task*.
Task 2 (H2, C2, and M2) is the *Detailed Description Task*.
Task 3 (H3, C3, and M3) is the *Counter-intuitiveness Reasoning Task*.
Task 4 (H4 and C4) is the *Title Task*; Task 5 (C5) is the *Creative Scoring Task*, which is the *Higher Level Tasks*.</span>
**For annotation expansion:**
The number of both the instructions and the answers was expanded fivefold (synonym generation by GPT-3.5).
For instance:
```
Original Annotation:
"A child and two dogs are lying on the beach. The child kicks one of the dogs with their foot, and the dog gets up and retaliates by digging sand back at the child."
```
```
The expansions using GPT-3.5 are:
Expansion_with_Eng_1: A young child and two canines are reclining on the sandy beach. The child delivers a kick to one of the dogs, prompting the dog to rise and retaliate by digging sand back at the child.
Expansion_with_Eng_2: On the sandy shore, there is a small child accompanied by two dogs. The child uses their foot to kick one of the dogs, and in response, the dog stands up and counterattacks by digging sand towards the child.
Expansion_with_Eng_3: Resting on the beach are a young child and a pair of dogs. The child decides to give a kick to one of the dogs, resulting in the dog getting up and retaliating by digging sand back at the child.
Expansion_with_Eng_4: A scene unfolds on the shoreline where a child and two dogs are sprawled out. The child employs a foot to kick one of the dogs, provoking the dog to rise and retaliate by excavating sand towards the child.
Expansion_with_Eng_5: At the beach, there lies a small child alongside two dogs. Using their foot, the child kicks one of the dogs, and in response, the dog stands up and counterattacks by scooping sand and launching it back at the child."
```
**2. Also, are there some evidence/quantitative measurement to support the GPT3.5 can produce the faithful answers to the original ideas?**
Using GPT-3.5 to translate and generate QA pairs is a relatively new method for data expansion. Prior to FunQA, other datasets, such as [MIMIC-IT](https://arxiv.org/abs/2306.05425), and [LLaVA-Instruct](https://arxiv.org/abs/2304.08485), have also employed similar techniques.
To validate the accuracy of our approach, we implemented a manual consensus evaluation, which is a quantitative method, to demonstrate that GPT-3.5 produces erroneous answers with only a minimal probability. As shown in Figure 2, part (i), only 0.9% of the answers were found to be incorrect.
**3. I am unsure if the poor results presented by those video-models are resulted by domain-shift. Can the author finetune 1 or 2 baselines on the subset of the proposed dataset and tested on the remain data to check how performance changes? The comparisons between un-finetuned and finetuned models should be insightful.**
Certainly, we appreciate your observation and inquiry, and we will discuss on the finetuning experiments, discussion on solutions to domain shift and other potential more essential factors as below.
**Finetuning Experiments:** In Table 2, the last row provides insights into our fine-tuning experiments. Initially, the Otter model did not yield state-of-the-art results. "Otter (D.C)" denotes the outcomes prior to fine-tuning, while "Otter (FunQA)" pertains to post-fine-tuning results using our FunQA dataset. Our team has continuously integrated the latest versions of models available on the official website, and the most up-to-date comparative results can be found in the revised paper's **Table 2**. It's evident that fine-tuning with the latest Otter model has indeed led to substantial improvements across different metrics.
**Regarding the Domain-Shift:** We concur with your insightful point regarding domain-shift as a potential contributor to suboptimal performance among some models, which can be shown by the finetuning experiment above. However, we believe there could be more critical problems to be solved apart from mitigate the domain gap, because:
- the finetuned performance is still not good enough.
- While the exact sources of training for vision models like CLIP and other language models remain a bit ambiguous to us, we anticipate that their extensive training sets might mitigate domain discrepancies in each modality to some extent.
For a more nuanced understanding, please refer to Appendix C.4 Figure 16 to check the model performance (failure case). While these models can pinpoint certain familiar elements within videos, their capacity to perceive the entire video narrative holistically and discern its comedic essence appears limited.
**Discussion on (Potential) More Essential Factors:**
- **Accurate understanding of the videos:** Through our analysis of failure cases, we've observed that many models struggle with accurately describing videos. While they might be adept at detecting objects within the videos, they often falter in comprehending the contextual relationship between sequential events. Such misinterpretations indicate that there's a need for further exploration in this domain. The videos we've used can indeed serve as an invaluable dataset for probing video descriptions in depth.
- **Logic Reasoning:** The primary nature of our videos encompasses content that is counterintuitive and contradicts common sense. For models to understand these, it's imperative they grasp the concept of "common sense." They must deduce what would typically transpire under normal circumstances and then use that perspective to humorously interpret the video. This necessitates the model to possess strong reasoning capabilities, especially when it comes to common sense reasoning. It remains a critical research point.
- **Extra Knowledge - Sense of Humor:** To decipher the humor in a video, it's plausible that understanding the fundamental principles of humor is crucial. This type of knowledge, along with many other tidbits of common sense and additional information, might enhance the model's performance. Determining how to integrate valuable knowledge and discerning what counts as "valuable" are topics that warrant further exploration.
Here we also provide some potential solutions that we suggest the community could work on and make valuable and board contributions.
- **Model Size:** increasing the number of parameters is a natural method to enhance the model's performance. However, this approach comes with its own set of engineering challenges, requiring improvements in model optimization and deployment. We're also curious about the relationship between the number of parameters and the performance on FunQA. This is an intriguing research point in itself, and our dataset can serve as an excellent test bed to further this exploration.
- **Data Quality:** We believe the emphasis for this task should be on data collection. Current trends with large and dynamic models have shown that having vast amounts of low-quality data isn't as effective as a smaller quantity of high-quality data. Thus, we hope the community can discover the type of data that genuinely assists in understanding counterintuitive videos. This is a crucial research direction.
- **Training Strategy:** Studying training strategies is also essential. For instance, determining which type of data to start learning from, and understanding the significance of curriculum learning, among others.
- **Model Collaboration:** Ultimately, we might not need to solely focus on a single model to solve this problem. Perhaps multiple models collaboratively working on examples in an elegant manner could be a method to enhance performance. However, this approach might necessitate paying more attention to the overall efficiency of model implementation.
Simultaneously, we have organized a competition based on the FunQA dataset with a prize of 1 million RMB — the "[See and Speak](https://iacc.pazhoulab-huangpu.com/contestdetail?id=64af50154a0ed647faca623a&award=1,000,000)" contest. We believe that with the driving force of the community, there will be even more impressive results achieved.
**4. The author discussed some limitations in Section 5. Annotate with Chinese and translate them into English is suboptimal.**
Indeed, challenges in translation arise due to cultural differences between Chinese and English texts. Hence, during the translation process, we make every effort to avoid literal word-for-word translation. Addressing your concern about different sentences evoking different emotions, our solution is as follows:
- **Special Design on Translation Prompt:** In the preliminary stage, we favored GPT-3.5 over traditional translation platforms, such as Google Translate. Our rationale for this choice was GPT-3.5's ability to harness its inherent knowledge to comprehend sentences, thus preserving more context post-translation. At this juncture, video information was not incorporated. Generating five synonymous sentences aimed to counterbalance potential semantic shifts during translation, heightening the probability of capturing the intended meaning while maintaining the diversity intrinsic to the annotations.
- **Manual Screening:** The subsequent stage pivoted to manual scrutiny. We delegated this task to bilingual experts, well-versed in both English and Chinese, and with prior experience in video annotations. Their modus operandi involved viewing the video, understanding the Chinese annotations, followed by a meticulous review and refinement of the English translations produced by GPT-3.5. The evaluation metrics centered on:
- Ensuring fidelity to the original video and Chinese annotations.
- Ensuring no factual or logical discrepancies.
- Guaranteeing the absence of any derogatory or sensitive content.
Our holistic strategy is geared to ensure that translations retain the original semantic and emotional nuances, even with the introduction of diverse linguistic expressions.
For tangible examples elucidating how our approach bridges potential semantic and emotional disparities during translation, kindly refer to Appendix A.2 Figure 8.
**5. Also, I am unsure if deep learning models can hack the proposed dataset and tasks, i.e., *if the model can solve the problem rely on strong image-based text generation ability, while do not use temporal reasoning.* The proposed QA dataset, Humor, Creative, and Magic, seem need high-level semantics/reasoning, but there lack of evidence to prove this point. Especially, Table 2 shows the image-based methods achieved highest performance for some metrics.**
We sincerely appreciate your observation and the crucial point you raised. When designing the FunQA dataset, one of our guiding principles was to ensure that all tasks necessitated reasoning to arrive at a satisfactory answer. This design philosophy means that even a model endowed with robust image-based capabilities would find it challenging to aptly answer FunQA questions without truly understanding the temporal dynamics.
To better see the emphasis of temporal dynamics in FunQA, here we provide a few examples:
**HumorQA Example:** [link](https://drive.google.com/file/d/1zi72M2FtcHVwJ8PwYcSfMcJf_Wrfpsq5/view?usp=drive_link)
- H2: An individual slipped on the staircase filled with ice and tumbled down to the very bottom, followed by a second person who also fell after witnessing the first person's fall.
- H3: The first person falling down the stairs step by step was already very funny, and the second person repeating the same mistake and falling down made it even more hilarious.
As answered in H2, the main element of humor in this video is two people slipping down an icy staircase in a sequential order, and if the model does not make sense of the temporal information, it will not be able to give the sequential logic of the two people slipping down one after the other. It is also clear from the answer to H3 that the sequence of slipping backwards and forwards is one of the sources of humor, as the person at the back does not learn from the lesson of the former and slips down in a similar way, and this repetition of the wrongdoing brings humor to the situation.
**CreativeQA Example:** [link](https://drive.google.com/file/d/18GRpCZbo8JuJf2YfCWi4TvoDnLm3aRfg/view?usp=drive_link)
- C2: At the center of the stage is a blue rectangular box. Following the little girl's watering of the box, four individuals lying inside gradually lift their hands and legs, each at different paces and heights, until they stop and reveal green painted leaf-shaped objects on their limbs.
- C3: The main creative element of this video is when the four individuals in a box, with varying heights and movement speeds, gradually raise their hands and feet to mimic the growing process of the carrot seeds planted by the little girl. The green leaf-shaped objects, tied and opened in advance with their hands and feet, are used to simulate the true sprouting of carrot sprouts, resulting in a lively and imaginative scene.
The performance in this video mimics growing bean sprouts, the process of growing beans into sprouts by constantly watering the soil. Trying to describe the content of this video idea and explain where the idea came from requires an understanding of the temporal information. Analyzing the visual information together, the model has to understand the sequence of watering, bean sprouts growing from the soil, and bean sprouts growing taller and greener in order to answer the question accurately. This difference in understanding can be demonstrated on the C3 task, where without analyzing the temporal information, the optimal answer would be bean sprouts growing from the soil, and cause and effect logical relationships such as watering to grow bean sprouts would be ignored.
---
We hope that our responses address your concerns and we are happy to discuss any lingering issues.
## Reviewer Response
**1. Could you please provide how many data are involved for this quantitative evaluation: "To validate the accuracy of our approach, we implemented a manual consensus evaluation, which is a quantitative method, to demonstrate that GPT-3.5 produces erroneous answers with only a minimal probability. As shown in Figure 2, part (i), only 0.9% of the answers were found to be incorrect."?**
We did manual consensus evaluation on the original FunQA dataset, randomly sampling 10% of the data from all three sub-datasets (HumorQA, CreativeQA and MagicQA). The specific data is as follows:
| | HumorQA | CreativeQA | MagicQA | Total|
| --------|---------|------------|---------|-|
| **# Videos**| 209 | 118 | 213 | 540|
| **# Low Consensus**|1 | 1| 2|4 (Max: **0.9%** in MagicQA)
| **# Medium Consensus**| 9 | 6| 17| 32|
| **# High Consensus**| 199| 111|194|504|
**2. It seems there are some emphasis of temporal dynamics in FunQA. However, do u have some way to measure how many videos in ur datasets contain such temporal dynamics requirements? For example, ask workers to go through 100 or more cases.**
We appreciate your kind suggestion, then we did an experiment for measuring whether temporal dynamics is emphasized in FunQA. We randomly selected 40 videos from each of the three sub-datasets, totally 120 videos. For each video, we sampled 8 frames evenly. You can check these data (Videos & Frames) [here](https://drive.google.com/drive/folders/1zo_YIbaJ5i4B1gfF5PwViyrsqejXhtyG?usp=drive_link). We enlisted 10 individuals who had not seen any FunQA videos before. We had them view the sequence of 8 consecutive frames and then watch the original video along with its annotations. They were asked to determine whether they could understand and answer the counter-intuitive understanding of the original video solely based on the images. The experimental results are as follows:
| |Underatnd Viode Only Based on Frames| |
|-|------------------------|-|
| |# Ture (Avg. in 10 workers) |# False (Avg. in 10 workers) |
|HumorQA|7.8|32.2 |
|CreativeQA|6.1|33.9|
|MagicQA|2.6|37.4|
|# Total|16.5|103.5|
The experiment shows that 86.25% of participants believed that they couldn't understand and answer FunQA tasks solely based on images. This experiment qualitatively demonstrates that the FunQA dataset cannot be effectively addressed using image-based methods alone, highlighting the dataset's focus on temporal reasoning.
**3. Do you have any response regarding the ethical concerns?**
We highly respect individual privacy protection, and the teaser image we provided in this submission is solely for demonstration purposes. When it comes to the actual dataset, we handle all identifiable information carefully.
An example of privacy protection in a real dataset: [HumorQA example](https://drive.google.com/file/d/1ZUHC-ulqVxGDwAaXzK4Uy0znD9TvhKej/view?usp=drive_link). In this video, there are unfamiliar faces, and in order to safeguard their personal privacy, we have applied mosaic effects to their facial features.
# jKvF 5 / 3
We sincerely appreciate your constructive comments to improve our paper. We detail our response below point by point. Please kindly let us know if our response addresses the questions you had for this paper.
**1. The task setting seems overly complex and I am concerned whether a single model can address all of these tasks. (It's unclear what direction the community should take to address the defined tasks. Is developing larger models the solution?)**
We appreciate the valid concerns, and we believe this dataset's demanding nature will catalyze advancements in video comprehension models. We invite the community to take this as an opportunity to redefine what's achievable, including but not limited as follows:
- **Accurate understanding of the videos:** Through our analysis of failure cases, we've observed that many models struggle with accurately describing videos. While they might be adept at detecting objects within the videos, they often falter in comprehending the contextual relationship between sequential events. Such misinterpretations indicate that there's a need for further exploration in this domain. The videos we've used can indeed serve as an invaluable dataset for probing video descriptions in depth. Therefore, new techniques on precise video perception is expected to enhance the task performance.
- **Logic Reasoning:** The primary nature of our videos encompasses content that is counterintuitive and contradicts common sense. For models to understand these, it's imperative they grasp the concept of "common sense." They must deduce what would typically transpire under normal circumstances and then use that perspective to humorously interpret the video. This necessitates the model to possess strong reasoning capabilities, especially when it comes to common sense reasoning. It remains a critical research point on how to inject commonsense into the model.
- **Extra Knowledge - Sense of Humor:** To decipher the humor in a video, it's plausible that understanding the fundamental principles of humor is crucial. This type of knowledge, along with many other tidbits of common sense and additional information, might enhance the model's performance. Determining how to integrate valuable knowledge and discerning what counts as "valuable" are topics that warrant further exploration.
**Potential Solutions:**
- **Model Size:** increasing the number of parameters is a natural method to enhance the model's performance. However, this approach comes with its own set of engineering challenges, requiring improvements in model optimization and deployment. We're also curious about the relationship between the number of parameters and the performance on FunQA. This is an intriguing research point in itself, and our dataset can serve as an excellent test bed to further this exploration.
- **Data Quality:** We believe the emphasis for this task should be on data collection. Current trends with large and dynamic models have shown that having vast amounts of low-quality data isn't as effective as a smaller quantity of high-quality data. Thus, we hope the community can discover the type of data that genuinely assists in understanding counterintuitive videos. This is a crucial research direction.
- **Training Strategy:** Studying training strategies is also essential. For instance, determining which type of data to start learning from, and understanding the significance of curriculum learning, among others.
- **Model Collaboration:** Ultimately, we might not need to solely focus on a single model to solve this problem. Perhaps multiple models collaboratively working on examples in an elegant manner could be a method to enhance performance. However, this approach might necessitate paying more attention to the overall efficiency of model implementation.
Simultaneously, we have organized a competition based on the FunQA dataset with a prize of 1 million RMB — the "[See and Speak](https://iacc.pazhoulab-huangpu.com/contestdetail?id=64af50154a0ed647faca623a&award=1,000,000)" contest. We believe that with the driving force of the community, there will be even more impressive results achieved.
**2. It is necessary to analyze the performance of human on this benchmark, to understand the significance of the baseline results.**
We add a new set of experiments to compare the difference between machine and human performance in answering FunQA. The experimental design was that ten questions (H2, H3, H4) from HumorQA in FunQA were randomly selected and answered by five different people.
**The specific experimental setup is as follows:**
We randomly selected five videos from HumorQA and had five individuals who had not watched any FunQA videos provide answers to the questions (H2, H3, H4) for each video. After then, we collected 25 human responses for each task. Subsequently, we calculated the average score for each individual under the FunQA benchmark and the overall average score of human responses.
**The results are as follows:**
| Task | H2 | H3 | H4 |
|--------------|--------------|--------------|-------------|
| FunQA benchmark </br>SOTA score| 1.5 / 18.1 / 1.0 </br> 44.0 / 37.9 | 1.3 / 15.4 / 0.5 </br> 25.9 / 61.7 | 1.1 / 8.8 / 0.7 </br> 25.9 / 61.7 |
| Human (Avg) | 10.1 / 31.0 / 9.9 </br> 60.0 / <font color="red">83.6</font> | 5.8 / 24.6 / 5.0 </br> 53.1 / <font color="red">84.4</font> | 1.1 / 11.46 / 2.1 </br> 27.3 / <font color="red">79.1</font> |
| Human 1 | 6.0 / 28.8 / 6.5 </br> 46.6 / 82.4 | 2.2 / 20.8 / 1.1 </br> 45.0 / 85.2 | 0.8 / 9.2 / 1.1 </br> 20.5 / 76.6 |
| Human 2 | 13.8 / 42.5 / 11.1 </br> 55.4 / 90.5 | 13.2 / 34.1 / 10.0 </br> 54.4 / 87.5 | 1.1 / 12.8 / 2.4 </br> 33.5 / 80.6 |
| Human 3 | 0.7 / 16.5 / 0.5 </br> 36.2 / 77.5 | 0.7 / 14.7 / 0.2 </br> 44.1 / 69.0 | 0.2 / 3.1 / 0.3 </br> 15.4 / 76.8 |
| Human 4 | 26.4 / 50.5 / 27.7 </br> 75.8 / 80.3 | 10.3 / 36.8 / 11.5 </br> 74.3 / 90.0 | 2.8 / 21.7 / 6.0 </br> 43.4 / 73.4 |
| Human 5 | 3.5 / 16.7 / 3.7 </br> 48.0 / 86.6 | 2.4 / 16.4 / 1.8 </br> 47.9 / 90.3 | 0.8 / 10.5 / 0.8 </br> 23.9 / 88.2 |
It can be seen that the performance of human responses under the FunQA benchmark is much higher than that of the models, which means that the models still have a long way to go.
**3. It would be better to provide multiple reference answers for each question. This could alleviate the issues of annotation subjectiveness and language diversity. Moreover, the evaluation can be focused more on understanding of ‘fun’ instead of language generation.**
Thank you for your insightful feedback and the suggestion of incorporating multiple reference answers for each question to tackle potential subjectivity and foster language diversity. We concur that evaluation should emphasize understanding the essence of 'fun' over pure language generation.
**Multiple Annotations:** To address the concerns pertaining to annotations, we took comprehensive measures during the FunQA annotation process. We employed GPT-3.5 to generate multiple answers, followed by a secondary annotation to heighten objectivity and linguistic diversity. The specifics are as follows:
- **1. Generating Multiple Answers with GPT-3.5:**
Based on the original responses, GPT-3.5 was used to create five synonymous sentences, greatly enhancing language diversity. For instance:
Original Annotation:
```
"A child and two dogs are lying on the beach. The child kicks one of the dogs with their foot, and the dog gets up and retaliates by digging sand back at the child."
```
The expansions using GPT-3.5 are:
```
Expansion_with_Eng_1: A young child and two canines are reclining on the sandy beach. The child delivers a kick to one of the dogs, prompting the dog to rise and retaliate by digging sand back at the child.
Expansion_with_Eng_2: On the sandy shore, there is a small child accompanied by two dogs. The child uses their foot to kick one of the dogs, and in response, the dog stands up and counterattacks by digging sand towards the child.
Expansion_with_Eng_3: Resting on the beach are a young child and a pair of dogs. The child decides to give a kick to one of the dogs, resulting in the dog getting up and retaliating by digging sand back at the child.
Expansion_with_Eng_4: A scene unfolds on the shoreline where a child and two dogs are sprawled out. The child employs a foot to kick one of the dogs, provoking the dog to rise and retaliate by excavating sand towards the child.
Expansion_with_Eng_5: At the beach, there lies a small child alongside two dogs. Using their foot, the child kicks one of the dogs, and in response, the dog stands up and counterattacks by scooping sand and launching it back at the child."
```
- **2. Secondary Annotation for Enhanced Objectivity:**
Separate individuals were asked to independently watch the videos and offer their answers. These were compared with our initial annotations in a consensus evaluation. The findings revealed that only an average of 0.5% of individuals harbored biases about whether a video in the FunQA dataset was “funny” or not. This underscores the minimal subjectivity of our dataset.
**Content-Oriented Annotation:** Regarding your concerns on evaluation, our primary four metrics (BLEU / ROUGE-L / CIDEr and BLEURT) are not exclusively centered on the 'understanding of fun'. Consequently, we crafted an evaluation method leveraging GPT-4. Our meticulously designed prompts (refer to Appendix C) ensure the evaluation accentuates the model's genuine comprehension of humor, transcending mere text generation.
In detail, GPT-4’s assessment of two text passages considers five dimensions: linguistic fluency, logical consistency, common sense, understanding of details, and alignment in comprehending the humor. These dimensions carry weights of 5%, 10%, 10%, 35%, and 40% respectively. Through this methodology, we've striven to emphasize evaluation metrics that resonate with "the understanding of humor".
**4. I cannot find the IoU results for task H1/C1/M1 in Tab.2, or do I miss something? Also, how the baseline methods achieve temporal localization? There is no related introduction.**
Our explanation for the missing IOU scores in Timestamp localization can be found in the first point of our results section. However, due to word limitations, our discussion of the results might not have been sufficiently comprehensive. We have included examples of each model's inability to address such problems in Appendix C.4 Figure 17. We have updated the result "Timestamp localization task is the most challenging" in **#Line 257-263** and we intend to provide a more detailed explanation as follows:
In general, current Video-Language Models (VLMs) typically process a segment of video by reading specific frames of image information. Thus, from a structural perspective, present-day VLMs lack the capacity to address the Timestamp localization problem. Consequently, we are unable to assign scores for this task.
For caption-based models, their primary focus is on captioning tasks, and they do not necessarily require temporal information when handling captioning-related problems. Consequently, their training data does not extensively involve temporal aspects such as time point localization, relationships between objects over time, and inferential reasoning about outcomes. As a result, when responding to our Timestamp localization task, which specifically emphasizes temporal information, these models might readily provide answers related to captioning information. Hence, we are unable to assign scores for this scenario.
Regarding instruction-based models, taking Otter as an example, their input mainly consists of image frames from a video and does not incorporate temporal information. Consequently, their responses tend to be confined to specific frames, preventing them from accurately localizing time periods. Therefore, we are similarly unable to assign scores for this type of model.
The reason we emphasize the design of the Timestamp localization task is that we consider it an important and currently lacking capability in VLMs. For a given video, answering questions about which time interval is interesting is a common and meaningful inquiry.
**5. The dataset emphasizes the understanding of ‘fun’ from the visual aspect of videos. I am thus curious if audio could ease the challenge of the benchmark, or some specific sub tasks.**
We appreciate your insightful comments on audio. The role of audio in video understanding, especially in the context of detecting elements of 'fun', is undeniably an intriguing aspect to explore. We have updated the section C.4: Ablation Experiment about audio in **#Line 149-160 and Table A3 in supplementary materia**. In this response, we will address the current utilization of audio in models, the intrinsic design philosophy of FunQA, and findings from our ablation studies to shed light on this matter.
**Audio in Current Models:** While audio is an intrinsic component of videos, its utilization in current models has been limited. Several high-performing models, like Otter, mPLUG, video_chatGPT, and GIT as mentioned in our paper, operate mainly on visual data. Their performance remains relatively unaffected by audio. Some models, like VideoChat in our paper, opt to transcribe the audio to text, converting it into captions for input.
**FunQA's Emphasis on Visuals:** The essence of the FunQA dataset is to challenge models to recognize 'fun' predominantly from the visual content of videos. During our data collection, we put a premium on the visual elements. However, it's worth noting that during our experiments, we preserved the audio in the videos when they were presented to models.
**Ablation Experiments:** To probe the impact of audio, we performed ablation studies by muting the videos during both the training and testing phases. The resulting performance metrics are tabulated below:
| | HumorQA | | | | CreativeQA | | | | | MagicQA | | |
| ---- | ------- | ---- | ---- | ---- | ---------- | ---- | ---- | ---- | ---- | ------- | ---- | ---- |
| | H1 | H2 | H3 | H4 | C1 | C2 | C3 | C4 | C5 | M1 | M2 | M3 |
| VideoChat | - | 0.5 / 13.7 / 0.0 </br> 44.0 / 37.9 | 0.5 / 13.5 / 0.0 / 45.4 / 31.9 | 0.8 / 5.1 / 0.5 </br> 20.2 / 61.7 | - | 0.3 / 7.5 / 0.0 </br> 21.7 / 10.9 | 0.3 / 7.7 / 0.0 </br> 22.8 / 27.7 | 0.2 / 1.2 / 0.2 </br> 7.3 / 51.1 | 67.5 | - | 0.6 / 15.5 / 0.0 </br> 47.4 / 14.2 | 0.3 / 9.2 / 0.0 </br> 43.1 / 24.6 |
| VideoChat (MUTED) | - | 0.5 / 13.5 / 0.0 </br> 40.0 / 34.5 | 0.5 / 12.5 / 0.0 </br> 43.7 / 31.9 | 0.8 / 5.1 / 0.5 </br> 22.4 / 57.8 | - | 0.3 / 7.4 / 0.0 </br> 24.0 / 10.0 | 0.4 / 7.7 / 0.0 </br> 20.2 / 30.5 | 0.2 / 1.2 / 0.2 </br> 6.8 / 45.3 | 67.5 | - | 0.6 / 15.0 / 0.0 </br> 48.0 / 12.2 | 0.3 / 9.2 / 0.0 </br> 44.0 / 23.4 |
From the table, it's evident that when comparing the performance on the original videos to that on the muted ones, there's negligible difference. Particularly, GPT-4's scores only experience minimal fluctuations, which can be attributed to evaluation instability issues, as discussed in the "Insufficient evaluation metrics for free-text tasks" section and illustrated in Appendix C.3, Figure 15.
For a more intuitive understanding, we offer a selection of samples from the FunQA dataset: ([HumorQA Example](https://drive.google.com/file/d/11pLuTe6g1hPFLzwlU7-sw1dFkuk8ffq8/view?usp=drive_link), [CreativeQA Exmples](https://drive.google.com/file/d/1L4k1gGgIkLpXsSd1kfaFr0Oi0iwTJQ8s/view?usp=drive_link), and [MagicQA Example](https://drive.google.com/file/d/10jQBxWHO9XZETGV-YCGpZUO6EyyB45qr/view?usp=drive_link)). In these examples, the perception of 'fun' is evidently consistent, independent of the audio's presence or absence.
---
We hope that our responses address your concerns and we are happy to discuss any lingering issues.
# sSpc 6 / 4
We sincerely appreciate your constructive comments to improve our paper. We detail our response below point by point. Please kindly let us know if our response addresses the questions you had for this paper.
**1. The authors motivate their problem by citing that their tasks require commonsense knowledge. While this makes some intuitive sense, it is not backed up by their experiments (e.g., showing that commonsense reasoning improves performance). One could argue that the dataset simply addresses a new set of labels that are different/possibly more challenging than existing datasets**
It's true that illustrating the requirement for commonsense reasoning within the dataset is a daunting task. Yet, the inherent challenges models face when interpreting the FunQA dataset elucidates the existing gaps in commonsense understanding. To offer clarity, we use this case: [HumorQA Example](https://drive.google.com/file/d/18GRpCZbo8JuJf2YfCWi4TvoDnLm3aRfg/view?usp=drive_link) and test on the videoChat model.
**Discussing model performance on H2:**
```
H2: Description of the videos humorous moment.
```
```
videoChat:
In the video, a group of people dressed in red and blue clothing are on the street. The first frame shows a man holding a long stick as they walk on the street. In the second frame, there is a group of people on the street. The third frame shows a group of people walking down the street. The fourth frame shows a group of young boys in blue and red clothing. In the fifth frame, a man in red clothing is walking down the street. In the sixth frame, there is a man in blue pants walking on the street. The seventh frame shows a group of boys wearing blue and red clothing walking on the street. In the eighth frame, a man in red clothing is holding a red flag. There are also men dressed in blue, and one of them is carrying a long stick. The final frame shows a man in blue clothes walking down the street. Overall, the video depicts a comedic scene of people in red and blue clothing walking down the street. The presence of a long stick and a red flag adds to the humor. However, it is unclear why the group is wearing those specific colors and carrying the long stick.
```
The model's answer shows that the answer overly breaks down the video by showing people wearing red and blue clothing instead of the single color stated in the answer, and by mistaking the street lamp post for a long stick held in the man's hand, it is clear that the model's common sense is not enough to accurately identify the objects in the video.
Here we show the response of video-chatGPT:
```
video_chatGPT:
The comedic scene in the video shows a man standing on a pole and swinging a stick, while the people in the line are swinging their sticks as well. The scene is humorous because it is not a typical activity one would expect to see in a public setting, and the people in the line are not only swinging their sticks but also seemingly imitating the man on the pole. This creates a light-hearted and amusing atmosphere, as the people are engaged in a playful and unconventional activity.
```
The answer appears to be a non-existent action such as waving a stick and fails to describe the fact that the man hit his head on the pole.
**Discussing model performance on H3:**
When tasked with explaining humor, FunQA’s demand on a model's commonsense ability becomes increasingly evident. Both model responses, in their unique ways, fail to align with human commonsense reasoning about what constitutes humor.
```
H3: Why is the whole video humorous.
```
```
videoChat:
The video as a whole is comedic because of the humorous scenes depicted in the different frames. The fact that a group of people dressed in red and blue clothing are walking down the street is funny in itself, but the presence of a man holding a long stick and a man in blue clothes carrying a long stick adds to the comedic effect. Additionally, the presence of a group of young boys in blue and red clothing walking down the street and a man in red clothing holding a red flag also adds to the humor. Overall, the different scenes depicted in the video are funny in themselves, and the combination of the different groups dressed in different colors and carrying various objects adds to the comedic effect.
```
In this example, the model's answer considers a group of people dressed in red and blue walking down the road to be funny, and the event of someone waving a stick to be funny (although these descriptions are inaccurate), whereas in human common sense, these events are not funny, and the model lacks enough common sense knowledge to make the explanation.
```
video_chatGPT:
The video is comedic because it features a man standing on a pole and swinging a stick, while the people in the line are swinging their sticks as well. This unusual and unexpected activity creates a sense of humor and amusement, as it is not something one would typically see in a public setting. The fact that the people in the line are imitating the man on the pole adds to the comedic effect, as it shows a shared sense of playfulness and camaraderie among the individuals involved. The video captures a light-hearted moment that can be enjoyed by viewers, making it a fun and entertaining watch.
```
There is also a common sense problem with this answer, a group of people waving a pole is considered to be a very unusual, unexpected and humorous act, but this is not the case, martial arts or other artistic performances would be likely to feature such a scene, and such a scene is not humorous, which demonstrates FunQA's high demand for common sense, and is very different from datasets that focus on descriptive questions.
**Conclusion:**
It's essential to realize that these discrepancies are not mere mislabelings but highlight the core challenges in model reasoning. These examples from our dataset, juxtaposed with human interpretations, reaffirm the need for models to possess a deeper commonsense understanding.
While our emphasis is on commonsense reasoning, we also recognize that models face multifaceted challenges. Our goal is to provide a comprehensive dataset to nurture research in this direction, aspiring to drive collective advancement and collaborative efforts in the AI community.
**2. It isn't clear that performance on the proposed task is not directly correlated with improvements on other datasets. As such, it isn't completely clear what new topics can be explored beyond the differences in domains that the authors did demonstrate.**
Thank you for bringing up the very valuable discussion, and pointing out the potential correlation with other datasets. You're right; often, a performance trend on one benchmark may resemble that on another, such as the noted correlation between VQA and MSCOCO. While there may be some overlap in the skills tested by various datasets, each has its unique emphasis. Our dataset not only offers evaluation in new domains but also challenges models in ways other datasets might not. The uniqueness of FunQA is multi-dimensional:
- **Deep Temporal-Spatial Reasoning:** FunQA focuses on counterintuitive content, demanding models to first understand typical scenarios (common sense) and then discern the humorous deviation. This type of deep reasoning, especially in the context of humor and common sense, remains a challenging yet under-explored area.
- **Rich Annotations:** Unlike many datasets that lean on multi-choice or open-ended short answers, FunQA boasts free-text annotations with an average length of 34 words (这之前videoQA领域注释最丰富的是数据集是NExT-QA[1],回答的平均单词长度为11.6). This detailed annotation approach allows for richer model responses and tests their ability to generate more nuanced answers.
- **Exploring the Sense of Humor:** A nuanced understanding of humor principles might be essential for models to truly grasp the content of some videos. (在这之前并没有videoQA领域的数据有所涉及,只有visualQA的新颖数据集有所涉及,例如 *The New Yorker Caption Contest*[2]) Determining how to equip models with this knowledge, and deciding what other types of knowledge might be "valuable", are exciting avenues for research.
[1] Junbin Xiao, Xindi Shang, Angela Yao, Tat-Seng Chua. NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions[C]. In CVPR, 2021.
[2] Jack Hessel, Ana Marasović, Jena D. Hwang, et al. Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest[C]. In ACL, 2023.
These uniqueness would bring up several interesting challenges that are missed or not clear enough in the previous task, and potential solutions that might greatly support the community.
**Challenges:**
- **Accurate understanding of the imformative and long videos:** Through our analysis of failure cases, we've observed that many models struggle with accurately describing videos. While they might be adept at detecting objects within the videos, they often falter in comprehending the contextual relationship between sequential events. Such misinterpretations indicate that there's a need for further exploration in this domain. The videos we've used can indeed serve as an invaluable dataset for probing video descriptions in depth. Therefore, new techniques on precise video perception is expected to enhance the task performance.
- **Logic Reasoning:** The primary nature of our videos encompasses content that is counterintuitive and contradicts common sense. For models to understand these, it's imperative they grasp the concept of "common sense." They must deduce what would typically transpire under normal circumstances and then use that perspective to humorously interpret the video. This necessitates the model to possess strong reasoning capabilities, especially when it comes to common sense reasoning. It remains a critical research point on how to inject commonsense into the model.
- **Extra Knowledge - Sense of Humor:** To decipher the humor in a video, it's plausible that understanding the fundamental principles of humor is crucial. This type of knowledge, along with many other tidbits of common sense and additional information, might enhance the model's performance. Determining how to integrate valuable knowledge and discerning what counts as "valuable" are topics that warrant further exploration.
**Potential Solutions:**
- **Model Size:** increasing the number of parameters is a natural method to enhance the model's performance. However, this approach comes with its own set of engineering challenges, requiring improvements in model optimization and deployment. We're also curious about the relationship between the number of parameters and the performance on FunQA. This is an intriguing research point in itself, and our dataset can serve as an excellent test bed to further this exploration.
- **Data Quality:** We believe the emphasis for this task should be on data collection. Current trends with large and dynamic models have shown that having vast amounts of low-quality data isn't as effective as a smaller quantity of high-quality data. Thus, we hope the community can discover the type of data that genuinely assists in understanding counterintuitive videos. This is a crucial research direction.
- **Training Strategy:** Studying training strategies is also essential. For instance, determining which type of data to start learning from, and understanding the significance of curriculum learning, among others.
- **Model Collaboration:** Ultimately, we might not need to solely focus on a single model to solve this problem. Perhaps multiple models collaboratively working on examples in an elegant manner could be a method to enhance performance. However, this approach might necessitate paying more attention to the overall efficiency of model implementation.
**3. The authors did well describe some of the limitations of their work, but a potentially missing element is the somewhat subjective nature of tasks like humor, where some people may find it funny and others offended.**
The concept of "Humor" is indeed profound, and humor can sometimes be intertwined with potentially offensive elements. To address this, we have taken rigorous measures to avoid controversial content within our FunQA dataset, as outlined below (details can be found in Appendix A.1):
- **Defining Humor in FunQA:** At a macro level, we have defined what constitutes humor in our FunQA dataset: humor arises from the incongruity between reality and expectations, flourishing with the skillful juxtaposition and transformation of events. We adopt this definition to align with our assessment of models' spatial (juxtaposition) and temporal (transformation) capabilities.
- **Annotation Process for FunQA:** During the annotation phase, we carefully screen out videos that contain potentially offensive elements. Each video-question-answer pair in FunQA undergoes a two-step annotation and consensus evaluation process to eliminate subjectivity. The results indicate that only an average of 0.5% of individuals exhibit bias when determining whether videos in the FunQA dataset are "funny."
For instance, consider a video where the humor originates from a person's lack of understanding of physics, causing a bucket to hit their head. During the annotation and screening phase, if the person's head is depicted as seriously injured or bleeding, we would exclude it. Additionally, during the two-step annotation and consensus evaluation, if only a minority find the video funny, we would consider excluding it.
Through these strategies, we aim to minimize the potential issue of differing perceptions of humor, ensuring that the dataset avoids content that could be perceived as funny by some while offensive by others.
---
We hope that our responses address your concerns and we are happy to discuss any lingering issues.