# LAVENDER Rebuttal **Note to AC CVPR 23** Dear AC, We sincerely appreciate all the efforts during the review phase. However, despite the valuable feedback we received from the reviewers, we have some concerns about comments from Reviewer WXGF: - Reviewer WXGF found our paper with `unclear technical insight; uninteresting findings; missing key principle; unconvincing evaluation without qualitative figures in main paper`, while other reviewers have clearly recognized our contributions. As nicely summarized by Reviewer LZwc, the key principle of our work is the unified formulation with a unique MLM head to tackle various video-language tasks. LAVENDER `gives new point of view on problem formulation` and `is inspiring in other related fields` (Reviewer LZwc). Other encouraging comments regarding the design of LAVENDER include, it leads to a `simple` (Reviewer LZwc and oX2P), `lightweight` (Reviewer LZwc), and `elegant` (Reviewer oX2P) solution, with promises on `strong capability` (Reviewer LZwc), and `competitive performance` (Reviewer oX2P and RMMu) from just `a single set of weights` (Reviewer RMMu). We have tried our best to articulate more on technical contributions within one-page limit. But we found the comments from Reviewer WXGF subjective, and may not have full understanding about the paper, hence missing detailed questions. We hope AC can take this issue into consideration and re-weight the importance of each review. Thanks for reading. **Note to AC** Dear AC, We sincerely appreciate all the efforts during the review and discussion phase. However, despite the valuable feedback we received from the reviewers, we have several concerns: - Request to `compare with image-text models on image-text tasks` (by Reviewer JJFb). As stated throughout the paper, we focus on video-language modeling, and has conducted extensive experiments and adequate analyses (noted by Reviewer Kx7Q) to compare with SOTA video-language models. We believe this request is unreasonable, and should not be the reason for rejection. - Comments about `lacking fundamental experiments on the difference between task-specific models and the proposed one, especially with frozen encoder` (by Reviewer MDML). While we have conducted additional experiments, it is worth noting that freezing encoder is not the standard practice in the literature. We argue that finetuning with frozen encoder should not be regarded as a fundamental experiment, as it clearly does not bring new insights. We believe our rebuttal has successfully resolved the concerns from both reviewers, which is also confirmed by Reviewer XMkF. We are also eager to further discuss with them. However, throughout the discussion phase, we have only received a confirmation response from Reviewer XMkF and a question raised by Reviewer MDML without follow-ups. We hope AC can take this issue into consideration and re-weight the importance of each review, as we have tried our best to provide extensive rebuttal to address all concerns. Thanks for reading. ## General response We thank the reviewers for their valuable feedback. We are encouraged that they found the study of unified framework in our work to be interesting (Reviewer XMkF) and our model design to be neat (Reviewer Kx7Q). We are glad that they acknowledge the impressive (Reviewer XMkF) and competitive (Reviewer Kx7Q and MDML) performance over many video-language tasks (Reviewer JJFb) achieved by LAVENDER. We are also pleased that they found our ablation experiments to be adequate (Reviewer Kx7Q) and clear (Reviewer JJFb), and the analysis of experimental results to be convincing (Reviewer Kx7Q). Below, we address some common questions raised by reviewers and leave other questions to response to each reviewer. Due to the 9-page limit in updating the paper, we will incorporate all feedbacks and include all promised revisions and new results in the final version. **GQ1: Comparison to other models in the number of parameters (Reviewer Kx7Q and MDML).** Thanks for the suggestions, we will add another column in Table 5 and 6 to directly compare the number of parameters in LAVENDER and prior arts. Below, we include some of previous models with their parameter counts (which were reported in the original paper or calculated by follow-up work) and compare them with LAVENDER. | Model |ActBERT|TACo| ClipBERT|VIOLET|Frozen|Bridge-Former|LAVENDER| |--------| -------- |-------- |-------- |-------- |-------- |-------- |-------- | |\# parameters |275M|212M| 137M| 198M|232M|152M|198M| Compared with other models in the literature, LAVENDER is of comparable model size and requires less data to achieve better performance. **GQ2: Ablation on video backbone (Reviewer Kx7Q and XMkF).** We base our model on Video-Swin due to its strong performance on video action recognition tasks, video captioning tasks in SwinBERT and video QA/retrieval tasks in VIOLET. In addition, Video-Swin is also easier to be trained end-to-end to build a strong baseline for LAVENDER-TS, compared to object detection backbones. For strictly fair comparison, we design LAVENDER to be **only** different from LAVENDER-TS on the shared MLM heads. Hence, results in Table 2 provide evidence that the performance gain largely comes from the unified architecture. Note, that LAVENDER-TS is also our own re-implementation of the VIOLET architecture [1]. Even when compared with VIOLET in Table 5 and 6, LAVENDER outperforms VIOLET across all tasks, with much less pre-training data (16M videos + 14M images vs. 183M videos + 3M images). To further address your concern, we conduct additional experiments with shared MLM head and objective based on the ClipBERT [2] architecture (named ClipBERT-MLM), where a ResNet-50 with mean pooling is used for video encoder, text encoder + fusion encoder is initialized with BERT-base. We pre-train ClipBERT-MLM with VTM as MLM + MLM on COCO+VG data and perform single-task finetuning on downstream tasks. Results in Table below further validates the gain from our unified architecture. | Model | TGIF-Action |MSRVTT-QA |DiDeMo-Ret| |--------| -------- | -------- | -------- | |ClipBERT | 82.9 | 37.4 | 43.1| |ClipBERT-MLM (ours) | **88.9** | **40.2** |**43.8**| [1] VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling [2] Less is more: Clipbert for video-and-language learning via sparse sampling **GQ3: Freeze multi-modal encoder during finetuning (Reviewer JJFb and MDML).** We follow the standard practice and the popular trends in the literature to train LAVENDER in an end-to-end manner, during both pre-training and finetuning stages. As requested by Reviewer MDML, we compare the model performance of frozen encoder and end-to-end finetuning with both LAVENDER and LAVENDER-TS below. All results are reported with single-task fineutning, based on the pre-trained weights on 2.5M videos + 3M images. | Model | Frozen Encoder | Meta-Ave.| TGIF-Action |MSVD-QA |MSRVTT-Caption |DiDeMo-Retrieval| |--------| -------- | -------- | -------- | -------- | -------- | -------- | |LAVENDER (as in L7, Table 2)| N | **68.9** | **95.8** |**54.4** |**57.3**|**68.2**| |LAVENDER | Y | 36.1 | 28.1 |37.9| 33.4| 44.8| |LAVENDER-TS (as in L5, Table 2)| N |64.0 | 94.5 | 46.7 | 59.0 | 55.7| |LAVENDER-TS | Y | 30.0 | 21.7 |19.2| 34.8 | 44.3| Freezing encoder parameters results in severe performance drop for both models (-32.8 for LAVENDER and -34.0 for LAVENDER-TS). ## R1 (Reviewer Kx7Q ) **W1 & Q1: Comparison of parameter volumes.** Please refer to GQ1 in the general response. **W2: Details about ST and MT.** Sorry about the confusion, we will try to reorganize the text to make it more clear in revision. ST for LAVENDER is explained in L156-177. ST for LAVENDER-TS is explained in the first paragraph of section 4.2 (L214-228). For MT, LAVENDER-TS is finetuned with one head per task (as indicated by the number of parameters in Table 2, 4 tasks adds 4H parameters to the backbone with P parameters), and LAVENDER shares the same MLM head (P+H parameters in total). All results in Table 2 are based on the default pre-training setting with 2.5M videos and 3M images. In terms of implementation and training configuration (number of epochs, learning rate, etc.), we refer the reviewer to Appendix C for more details. **Q2: Ablation of video backbone.** Please refer to GQ2 in the general response. **Q3: Clarification of few-shot setting.** Sorry about the confusion. Both models (LAVENDER and LAVENDER-TS) are pre-trained on WebVid2.5M+CC3M, **without** multi-task finetuning. Therefore, both models learn from the exact same amount of data, that is, 2.5M videos + 3M images during pre-training and 5, 7, 10, 20, …, 90% of downstream data during finetuning. **Q4: Objectives and vocabulary size in fine-tune and pre-training.** LAVENDER adopts the same objectives (Masked Language Modeling objective) and vocabulary size (30,522, the same as BERT-base) for both pre-training and all finetuning experiments. **Limitation: Computationally expensive with large-scale training.** We agree that large-scale pre-training in our experiments may be computationally expensive, we will open source our code and release the pre-trained model for easier reproduction to benefit the research community. We also want to note that compared with other models in the literature, LAVENDER is of comparable model size (more detailed comparison in GQ1 of general response) and requires less data to achieve better performance. To provide more specific examples, LAVENDER achieves SOTA performance with < 30M pre-training images/videos, while most recent models such as MERLOT is pretrained on 180M videos; All-in-One [1] is pre-trained on 103M - 283M videos; and models based on CLIP (BridgeFormer, QB-Norm and CAMoE) are pre-trained on 400M images. [1] All in One: Exploring Unified Video-Language Pre-training ## R2 (Reviewer XMkF) **W1: Ablation on video backbone.** Please refer to GQ2 in the general response. **W2: Taxonomy.** We follow previous work [1] to perform multi-task finetuning with tasks from the same video source (MT by video domain) and with tasks from the same task type (MT by task type). Tables below show some initial experimental results, and we directly compare with the multi-task setting (MT with mix video domain and task type) in Table 2 and the all-task finetuning (AT) in Table 3. Overall, combining all tasks together to perform multi-task finetuning (AT in Table 3) empirically strikes a balance between sophisticated heuristic designs of multi-task setting and good model performance. | |MSRVTT-QA | MSRVTT-Caption |MSRVTT-Retrieval| |--------| -------- | -------- | -------- | |ST |**44.2** |57.3 |**58.9**| |**MT by video domain** |44.1 |56.8 |55.3| |MT with mix video domain and task type (as in Table 2) |N/A |**57.4** |N/A| |AT (best) (as in Table 3) | **44.2** |57.2 |56.4| *Note: Due to differences in data split for tasks on MSRVTT videos, we strictly filter out testing videos from all training splits for all multi-task finetuning (more detailed discussion in Appendix C). Hence, on retrieval task, ST model is finetuned with more data than all MT models.* | |MSRVTT-QA |MSVD-QA |TGIF-Frame |LSMDC-FiB| |--------| -------- | -------- | -------- | -------- | |ST |**44.2** |54.4 |**72.2** |**56.9**| |**MT by task type** |43.2 |54.3 |70.4 |55.93| |MT with mix video domain and task type (as in Table 2) |N/A |53.5 |N/A |N/A| |AT (best) (as in Table 3) |**44.2** |**55.4** |71.6 |56.7| | |MSRVTT-Caption |MSVD-Caption | |--------| -------- | -------- | |ST |57.3 |139.4 | |**MT by task type** |56.9 |141.4 | |MT with mix video domain and task type (as in Table 2) |**57.4** |N/A | |AT (best) (as in Table 3) |57.2 |**141.6** | <!-- > Experiments pending on Transfer between tasks. --> [1] VALUE: A multi-task benchmark for video-and-language understanding evaluation. **W3: Ablation on pre-training data.** We conduct ablations on using image-text only (CC3M) or video-text only (Webvid2.5M) data for pre-training, and compare it with what we reported in Table 2, when pre-trained on both WebVid2.5M and CC3M. All results in Table below are reported under single-task finetuning. | Pre-train Data |Meta-Ave. | TGIF-Action |MSVD-QA |MSRVTT-Caption |DiDeMo-Retrieval| |--------| -------- | -------- | -------- | -------- | -------- | |N/A (as in L1, Table 2) | 45.5 | 93.5 | 40.8 |47.7 | 0.0| |WebVid2.5M (video-only) | 65.1 |94.3 |53.0 |54.7|58.2| |CC3M (image-only) | 65.4 |92.9 |52.2 |55.5|61.1| |WebVid2.5M+CC3M (video+image, as in L7, Table 2) | **68.9** | **95.8** |**54.4** |**57.3**|**68.2**| Compared to without pre-training, image-text pairs alone pre-training improves on three tasks and performs comparably on TGIF-Action. Video-text pairs alone pre-training improves on all tasks, and combining image-text and video-text together achieves the best results. Our observation of this combined pre-training recipe being beneficial for video-text tasks is consistent with what was reported in [2]. We would like to clarify that evaluation on image-language models for video-text tasks is out of the scope of this paper. However, previous work [3] have shown that image-language pre-training, especially when at a large scale (e.g., pre-trained CLIP on 400M image-text pairs), can greatly improve model performance on video-text retrieval [4] and video captioning [5]. We have also directly compared LAVENDER with [2,3] in Table 5 and 6, and show that LAVENDER can achieve better performance with less pre-training data. [2] Frozen in time: A joint video and image encoder for end-to-end retrieval [3] Less is more: Clipbert for video-and-language learning via sparse sampling [4] Clip4clip: An empirical study of clip for end to end video clip retrieval [5] Clip4caption: Clip for video caption **W4: Performance drop in large-scale pre-training setting.** Based on our understanding, this question is regarding the zero-shot performance in Table 4, where the performance degradation is observed when pre-trained with larger amount of data on MSRVTT-QA and MSVD-QA. For open-ended QA tasks, during pre-training, the model is not exposed to similar data (i.e., the question-answer pairs in downstream tasks). Under zero-shot evaluation, we hypothesize that the model prediction highly depends on the word distribution in the pre-training text. Therefore, with larger amount of pre-training data, that is less carefully filtered, it may render worse zero-shot prediction. However, as shown in Table 5, larger-scale pre-training achieves better performance when finetuned. **W5: Comparison to related work in NLP.** Thanks for sharing this related work in NLP, we will cite this paper in revision. Different from this work, the goal of this paper is to provide a unified framework across all video-text tasks considered. Our unification not only focuses on the downstream tasks, but also on the design of pre-training tasks. The enabled zero-shot evaluation (which is more relevant to the multi-null prompt work), is a useful byproduct of our unified framework. **W6: Model performance on STAR.** We run additional experiments, and LAVENDER achieves the new SOTA on STAR. All results on LAVENDER below are based on single-task finetuning. | Method |Pre-train Data | val | test (mean) | test (interaction) | test (sequence) |test (prediction) | test (feasibility) | | -------- | -------- | -------- |-------- | -------- | -------- | -------- | -------- | | Current SOTA on Leaderboard | Not revealed | - | 53.98 | 58.05 | 60.27 | 53.77 | 43.83 | |LAVENDER |2.5M + 3M |57.76 |57.18 |51.67 |60.87 |60.34 |55.83| |LAVENDER|16M+14M| **58.57** | **58.77** |**52.88** |**61.04** |**61.87** |**59.30**| **Q1: Details about caption generation.** The captions are generated auto-regressively during inference, while the training objective is still the same masked language modeling. * During training, we randomly mask 15% of the tokens in the captions, and let the model predict the masked tokens. * During inference, at each generation step, a [MASK] token is appended to the previously generated tokens, and the model will predict the current tokens based on the learned embedding at the [MASK] token position. Note, that the attention mask used for caption generation is a causal attention mask. That is, for a given word, it only attends to the words before it, not the ones coming after it. **Q2 & Q3: Clarification about MT/ST (Table 4 & Figure 3) and LAVENDER/LAVENDER-TS (Table 2 and 3).** Thanks for the suggestion and sorry about the confusion. We will make it more clear in revision. To clarify, all results in Figure 3 are based on single-task finetuning, **without** multi-task finetuning. Models in Table 4 are directly evaluated under zero-shot settings after pre-training, **no** multi-task finetuning is invovled. In Table 2, the ones with task-specific heads are LAVENDER-TS. And results in Table 3 are all based on LAVENDER. **Q4: L253, Most similar work maybe UNIT.** Thanks for the suggestion, we will add citation to UNIT on L253. **Q5: L258, Why multi-task + pre-training worse than pre-training?** Similar observations have been found in [1] with a task-specific model design, where the performance drop is even more severe (-3.0 on average), comparing combining multi-task with pre-training to pre-training alone. While the observed performance drop with LAVENDER is less severe (-0.5 on average), we hypothesize that this is somewhat due to the differences in dataset size (e.g., the best performing epoch is different for different tasks) and the reduced number of training data from the strict data filtering (as mentioned in response to W2). [1] VALUE: A multi-task benchmark for video-and-language understanding evaluation. ## R3 (Reviewer JJFb) **W1 & Q1: Applying LAVENDER to other video-text tasks (e.g., video spatial/temporal grounding).** Thanks for the great suggestions about the potential application of LAVENDER. In this work, we follow most prior works in video-language literature (e.g., ClipBERT, JustAsk, Frozen, MERLOT, VIOLET, SwinBERT), to evaluate LAVENDER on the popular video-language tasks. Different from some previous work that focus on addressing a single type of video-lanugage tasks (e.g., JustAsk for QA, SwinBERT for captioning and Frozen for retrieval), we have shown LAVENDER achieves strong performance on 14 video-text tasks, across video retrieval, QA, and captioning, and we have conducted extensive experiments across multiple settings. The application to video spatial/temporal grounding, along with other potential directions discussed in L342-344, are all great directions to explore in follow-up works. Here, we briefly discuss a possible way to extend LAVENDER to video spatial/temporal grounding. Theoretically, LAVENDER can be extended to video spatial/temporal grounding. One can first project the continuous outputs (e.g., the spatial coordinates and timestamps) into discrete tokens, following the practice of pix2seq [1] for object detection tasks. Then, we can follow the caption generation finetune pipeline, to generate the bounding boxes or temporal window proposals in an autoregressive manner. However, video spatial/temporal grounding usually requires long videos or videos of higher-resolution, which poses other challenges, for example, it is hard to feed into memory for end-to-end learning. We will add more detailed discussions in revision. [1] Pix2seq: A language modeling framework for object detection. **W2 & Q2: Results on image-text tasks.** The design of LAVENDER is not specific to video-text inputs, and should be extendable to image-text tasks. However, in this work, we lay our focus on video-text tasks, and hope to inspire more works to build unified video-text models. With that said, we show some initial results by evaluating the pre-trained LAVENDER on Visual Commonsense Reasoning (a challenging image-text task), following [2,3]. | Model | Pre-train Data| VCR Q->A | | -------- | -------- | -------- | | MERLOT [2] | ~3M images | 58.9 | | MERLOT | 100M videos | 66.3 | | MERLOT | 180M videos | ++75.2++ | | VIOLET [3] | 183M videos + 3M images| **76.3** | | LAVENDER | 2.5M videos + 3M images | 71.6 | | MERLOT | 180M videos (8 x longer pre-training)| 80.6| LAVENDER can still perform competitively on this image-text task, much better than MERLOT pretrained on 3M images or 100M videos (71.6 vs. 58.9/66.3). Note that LAVENDER is pretrained and finetuned with lower-resolution input (224x224), while MERLOT is based on videos/images of much higher resolution (384x704). [2] Merlot: Multimodal neural script knowledge models [3] VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling **Q3-1: Multi-task finetuning only vs. pre-training only.** Multi-task finetuning can also be regarded as a way of pre-training. In Table 2, `VLM(modeled as MLM)+MLM` pretraining only results are reported in L7, and multi-task finetuning only is L2, we copy the results to table below for easier access. Comparing pre-training only and multi-task finetuning only, pre-training only gives better performance (+10.4 in meta-ave) than multi-task only. | Pre-train Data |Meta-Ave. | TGIF-Action |MSVD-QA |MSRVTT-Caption |DiDeMo-Retrieval| |--------| -------- | -------- | -------- | -------- | -------- | |Multi-task finetuning only (as in L2, Table 2)| 58.5 |95.9| 47.4 |41.2 | 50.0| |`VLM(modeled as MLM)+MLM` pretraining only (as in L7, Table 2) | **68.9** | **95.8** |**54.4** |**57.3**|**68.2**| **Q3-2: Full comparison between LAVENDER and LAVENDER-TS in Table 3.** We include LAVENDER-TS results under `MT (all-in-one)` below, and compare to the LAVENDER results reported in Table 3. Both models are pre-trained with 2.5M videos + 3M images. LAVENDER is better performed (+4.2 on average) than LAVENDER-TS across all tasks. Method | Meta-ave | TGIF-Act. | TGIF-Trans.| TGIF-Frame | MSRVTT-MC | MSRVTT-QA | MSRVTT-Ret.| MSRVTT-Cap.| LSMDC-MC | LSMDC-FiB| LSMDC-Ret. | MSVD-QA | MSVD-Cap. | MSVD-Ret| DiDeMo-Ret | | -------- | -------- | -------- | -------- | -------- | -------- |-------- | -------- | -------- |-------- | -------- | -------- |-------- | -------- | -------- |-------- | LAVENDER (as in Table 3) | **73.4** | **95.8** | **98.0** | **70.7** | **93.9** | **44.1** | **56.3** | **57.1** | **85.3** | **56.5** | **39.4** | **53.4** | **69.2** | **141.1** | **66.1**| LAVENDER-TS | 69.2 | 93.8 | 97.2 | 65.4 | 92.2 | 41.7 | 52.7 | 54.2 | 83.0 | 49.5 | 34.7 | 49.2 | 65.6 |133.7 | 56.5| **Q4: Freezing encoder during finetuning?** Following the standard practice in video-language research, all model parameters are trained during finetuning. We have also included results with frozen encoder in general response GQ3. **Q5: Contribution list in introduction.** Thanks for the suggestion. We will add a contirbution list in revision, which is also shown below. * To the best of our knowledge, we introduce the first unified video-language (VidL) framework, LAVENDER, that can tackle various VidL tasks with a unified Masked Language Modeling objective. * Without any task-specific architectures, LAVENDER outperforms the prior state-of-the-art on 12 out of 14 benchmarks considered, even when pre-trained with much fewer data. * Extensive experiments and analyses show that LAVENDER is better suited for multi-task learning, few-shot generalization, and zero-shot evaluation on video question answering tasks. ## R4 (Reviewer MDML) **Q1: Zero-shot can also be enabled by decoder.** We agree with you that using decoder also enables zero-shot prediction. Our claim in abstract and introduction is that compared to *task-specific VidL models* in the literature, our unified model LAVENDER enables zero-shot ability via MLM. We did not attempt to claim the zero-shot ability is only limited to the use of MLM, and we would like to note that the enabled zero-shot evaluation is not our key contribution, but a useful byproduct of our unified framework. **Q2: What is the role of CLS token in pre-training?** [CLS] is not used explicitly in pre-training. We keep the format consistent with downstream finetuning and also with the pre-trained BERT-base model weights used to initialize the text encoder and fusion encoder, especially with the captioning task, where [CLS] is used as beginning of sentence. We show results without [CLS] token during pre-training or finetuning in the reponse to Q3. **Q3 & Q8: Ablation on the position to insert [MASK].** We ablate the position to insert [MASK] token during both pre-training (row of each table) and finetuning stage (column of each table). Following your suggestion, we experiment with * **Replace [CLS]** with [MASK]. * Insert [MASK] at the **beginning** of the sentence before [CLS]. * Insert [MASK] in the **middle** of the sentence. For simplicity, we insert the [MASK] token at fixed position as the 10th token. * Insert [MASK] at the **end** of the sentence. This is the original setting in the paper. For faster iteration, we pre-train LAVENDER by varying the [MASK] position on CC3M data, and perform single-task finetuning on each task. - **MSVD-QA** |[MASK] position | replace [CLS] | begin | middle (10th) | end | | -------- | -------- | -------- | -------- | -------- | | replace [CLS] | 50.4 | 50.2 | 50.0 | 50.6 | | begin | 51.1 | 51.2 | 50.7 | 51.8| | middle (10th) | 48.7 | 48.6 | 50.3 | 50.6| | end | 50.9 | 51.2 | 51.7 | **52.2**| - **MSRVTT-Cap** |[MASK] position| end | | -------- | -------- | | replace [CLS] | 54.6 | | begin | 54.6 | | middle (10th) | 53.8 | | end | **55.5**| *Note: For auto-regressive caption generation, the [MASK] token is always appended to previously generated tokens.* - **TGIF-Action** |[MASK] position | replace [CLS] | begin | middle (10th) | end | | -------- | -------- | -------- | -------- | -------- | | replace [CLS] | 71.6| 78.7 |91.9 |91.7| | begin | 83.0 |80.8 |92.9 |92.4| | middle (10th) | 91.5|90.7 |**93.3** |91.3| | end | 90.7 |91.6 |91.8 |92.9| Results above show that *inserting [MASK] at the end of the sentence* brings competitive performance consistently over different tasks. **Q4: Encoder-decoder architecture for Table 2.** We follow the popular model architecture adopted in video-language literature, which is encoder-only architecture (e.g., VideoBERT, ClipBERT, TACo, MERLOT, VIOLET). Compared to encoder-decoder architecture, MLM head is more lightweight, as shown in Figure 1. That being said, we report performance of an encoder-decoder model on MSRVTT captioning as a comparison. All results are based on single-task finetuning without pre-training. | Model Architecture |# Layers in Decoder | MSRVTT-Cap | | ----- | ----- | ----- | | Encoder-only (LAVENDER, as in L1, Table 2) | N/A | **47.7** | | Encoder-Decoder | 12 | 42.8 | | Encoder-Decoder | 6 | 43.8 | | Encoder-Decoder | 4 | 45.0 | | Encoder-Decoder | 2 | 42.6 | The results above show some interesting findings: (i) Reducing the number of decoder layers can improve caption performance (CIDEr score), but the improvements become less prominent when using only 2 decoder layers; (ii) Encoder-only achieves better performance than the Encoder-Decoder variants, which may due to more randomly initialized parameters added in Encoder-Decoder architecture. Full encoder-decoder model pre-training is out of the scope of this paper. **Q5: Zero-performance on DiDeMo in L1, Table 2.** We empirically observe that the finetuning of DiDeMo retrieval with LAVENDER without pre-training did not converge. This result indicates that in order to model retrieval task as MLM, where the answer is limited to two words (`true` or `false`) instead of 30,522 words, the model has to learn from more data (for example, pre-training or multi-task finetuning). We will add this discussion as footnote to the table. **Q6 & Q8: Freezing encoder during finetuning. This probably will improve performance and data efficiency of these baselines.** Performing full model finetuning is the standard practice for vision-language research. As shown in general response GQ3, freezing encoder during finetuning is not beneficial to downstream performance for both LAVENDER and LAVENDER-TS. For data efficiency, we follow the same experimental setting as in Figure 3 to gradually reduce the nubmer of training examples used for downstream finetuning. The results are summarizes in Tables below, and we will include a Figure in revision (similar to Figure 3, but with frozen encoder). All results are reported under single-task finetuning with pre-training on 2.5M videos + 3M images. First, with frozen encoder, both models greatly suffer when finetuned (low performance over all tasks below). Furthermore, the percentage of training data required to achieve 90% of full finetune performance, while still produce somewhat meaningful results, has increased across all tasks for both LAVENDER-TS and LAVENDER. For example, on MSRVTT-Cap, both LAVENDER and LAVENDER-TS requires more than 4x more data to achieve 90% of full finetune performance (40% with frozen encoder vs. < 10% with end-to-end finetune). - **MSVD-QA** |Method | Frozen Encoder | 0.9 x 100% perf. (% data needed) | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% | | -------- | -------- | -------- | -------- | -------- | -------- |-------- | -------- | -------- |-------- | -------- | -------- |-------- | LAVENDER | N | 48.9 (**<20%**) | 48.2 | 50.2 | 51.4 | 51.9 | 52.7 | 53.1 | 53.1 | 53.6 | 54.0 | 54.3| LAVENDER | Y | 34.1 (~40%) | 26.0 | 30.5 | 32.6 | 34.0 | 34.8 | 35.9| 36.6 | 37.1 | 37.6| 37.87| LAVENDER-TS | N | 42.1 (**<60%**) | 24.7 | 32.5 | 36.9 | 41.0 | 41.3 | 42.6 | 43.5 | 44.9 | 46.7 | 45.9| LAVENDER-TS | Y | 17.3 (~70%) | 14.3 | 14.5 | 15.0 | 15.0 | 16.3 | 16.7 | 17.3 | 17.9 | 18.4 | 19.2| - **TGIF-Action** |Method | Frozen Encoder | 0.9 x 100% perf. (% data needed) | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% | | -------- | -------- | -------- | -------- | -------- | -------- |-------- | -------- | -------- |-------- | -------- | -------- |-------- | Random Guess | - | 20.00 (0%)|- | - |- | - | - |- | - | - | - | - | LAVENDER | N | 85.8 (**<40%**) | 24.2 | 67.8 | 80.7 | 86.8 | 89.6 | 91.7 | 92.9 | 93.9 |94.7|95.4| LAVENDER | Y | 25.3 (<30%, but close to random guess) | 23.7 | 24.3 |26.3 | 26.0| 25.9 | 27.0 | 26.6| 26.9 |27.3|28.1| LAVENDER-TS | N| 85.1 (**<60%**) | 26.8 | 37.7 | 47.2 | 69.0 | 79.3 | 85.8 | 89.0 | 92.1 | 93.4 | 94.5 | LAVENDER-TS | Y| 19.5 (<10%, but worse than random guess) | 21.1 | 21.9 | 21.6 | 21.5 | 21.7 | 21.2| 21.4 |21.3 | 21.3 | 21.7 | - **MSRVTT-Cap** |Method | Frozen Encoder |0.9 x 100% perf. (% data needed) | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% | | -------- | -------- | -------- | -------- | -------- | -------- |-------- | -------- | -------- |-------- | -------- | -------- |-------- | LAVENDER | N| 52.8 (**<10%**) | 54.8 | 56.8 |57.3 | 57.9| 58.0 | 58.3 | 58.7 | 59.2 |58.9|58.7| LAVENDER | Y| 30.0 (~40%) | 21.1 | 25.8 |28.2 | 29.9| 30.6 | 31.3 | 31.9| 32.5 |32.9|33.4| LAVENDER-TS | N| 51.9 (**<10%**) | 52.2 | 54.7 | 55.4 | 55.7 | 55.3 | 57.1 | 57.0 | 57.7 | 57.0 | 57.7 | LAVENDER-TS | Y| 31.3 (~40%) | 20.8 | 27.7 | 30.0 | 31.3 | 32.3 | 32.8| 33.2 |33.5 | 33.9 | 34.8 | **Q7: Table 5: the number of parameters for previous arts are not shown for comparison.** Please refer to GQ1 in the general response. **Weakness and Limitations: The paper lacks some fundamental experiments to shed a light on the difference between task-specific models and the proposed one. For example, there is no performance report for task-specific models, when the encoder is frozen during finetuning.** Thanks for your suggestions. We have conducted experiments with (1) finetuning with frozen encoder (GQ3 in General Response, Q6 & Q8) and (2) ablations on how to add [MASK] token (Q3 & Q8). In addition, we would like to summarize all ablation studies included in the main text, where we have shown direct and strictly fair comparison between task-specific baseline and LAVENDER under multiple settings, (1) with/without pre-training; (2) single-task and multi-task finetuning; and (3) evaluation on zero-shot and few-shot generalizability. Let us know if you have any other questions or concerns. We are open to suggestions on additional experiments to highlight more differences between task-specific baselines and LAVENDER.