Fine-grained Audio-visual Joint Representations for Multimodel Large Language Models - HackMD

<style> .red { color: red; } </style> # [Fine-grained Audio-visual Joint Representations for Multimodel Large Language Models](https://arxiv.org/abs/2310.05863) :::danger **Github:** https://github.com/the-anonymous-bs/FAVOR ::: ## 1. Introduction 1. Text-based large language models (LLM) have demonstrated remarkable performance in various natural language processing tasks, especially achieving human-level capabilities in reasoning and comprehension. :::info **instruction fine-tuning:** where data is organised as pairs of user instruction (or prompt) and reference response, has emerged as a training paradigm that enables LLMs to perform various tasks by following open-ended natural language instructions from non-expert users. ::: 2. Recently, there has been a burgeoning research interest in equipping LLMs with visual and auditory perception abilities. 3. These investigations often employ a trained **modality alignment module** that aligns the representation space of the input modality with the text one. :::info 這句話的意思是,在多模態語言模型的研究中,通常會採用一個預訓練的模態對齊(modality alignment)模塊,用來將輸入模態(比如圖像、影片等)的表示空間對齊到文本的表示空間。具體來說: 1. **Modality alignment module**是一個預先訓練好的神經網絡,用於達成不同模態之間的表示空間對齊。 2. 這些研究中採用的多模態輸入通常包括圖像、影片、音訊等。每種模態本身特徵表示的空間是不同的。 3. 為了讓這些不同模態的特徵可以融合,需要首先將其映射或對齊到同一個空間,也就是文本的詞向量表示空間。 4. 這就需要使用**Modality alignment module**完成每種模態特徵到文本表示空間的對齊。對齊之後就可以進行特徵的拼接或融合。 5. 最後融合的多模態特徵就可以輸入到語言模型中進行理解和生成任務。總結來說,**Modality alignment module**這個預訓練模型的作用是<span class='red'>實現不同模態到文本表示空間的對齊</span>,是多模態語言模型可以有效地進行特徵拼接和融合的基礎。 ::: 4. Despite the sequential nature of video and audio inputs, most aforementioned work treated video as a sampled subset of individual images and audio as a fixed-length spectrogram. :::info 儘管影片和音訊輸入本質上是連續的時序訊號,大多數這類的研究仍將影片處理為抽取的個別圖像子集,並將音訊處理為固定長度的頻譜圖。這基本上<span class='red'>忽略了影片和音訊的時序連續性</span>以及<span class='red'>失去了部分細節訊息</span>。 ::: :::success **Summary** - These models tend to ignore information and causal relations when the input sequence length increases. - Speech, as a crucial aspect of auditory input in videos that in particular relies on fine-grained information extraction, is considerably under-explored in multimodal LLM research. ::: ## 2. Methodology In this section, we present the proposed **FAVOR learning framework**, which is designed to handle audio and visual input sequences synchronously at high temporal resolution for LLMs. ![image](https://hackmd.io/_uploads/Bk7grnLLp.png =90%x) Key components that realise(實現) the fine-grained audio visual representation learning are the **temporal synchronisation module** and the **causal Q-Former**. :::info 這句話的意思是,實現細粒度(fine-grained)的audio visual聯合表示學習的關鍵組成部分是時間Synchronisation module和Causal Q-Former。具體來說: 1. 細粒度(fine-grained)的audio visual聯合表示學習是指能夠在高時間解析度下對音訊和影片進行提取和理解。 2. 實現這一目標的關鍵組成部分有兩個 (1) **temporal synchronisation module:** 用於在時間軸上對音訊和影片幀進行同步,從而能夠進行跨模態的細粒度互動 (2) **causal Q-Former:** 在提取audio visible特徵的同時,還能夠建模音訊影片幀(frame)間的因果關係 3. **temporal synchronisation module**透過對每幀音訊影片資料進行同步,能夠在時域上進行細粒度的音頻視頻交互運算。 4. **causal Q-Former**中的Self-Attention機制則可以捕捉同一窗口內音訊和影片幀之間的因果依賴關係。 5. 兩者的配合使用可以在高時間解析度下提取音訊和影片幀的特徵，並建構其中的因果邏輯關係。總結來說,這兩個components是實現模式細粒度聯合表示學習的關鍵,透過同步和因果建模使得表示可以兼顧高解析度和跨模態的關聯性。 ::: ### 2.1 Model Architecture 1. Visual and audio inputs are encoded using the corresponding **pre-trained encoders** : - The **visual encoder** in FAVOR converts the input image into a certain number of vectors via the image encoder in InstructBLIP. - The **audio encoder** used is the Whisper ASR model encoder that converts the input speech and audio events into a sequence of vectors at a 50 Hz frame rate *When video input is given, the visual encoder encodes each video frame separately as a sequence of images at a 2 Hz frame rate, and the output image features are concatenated along the temporal dimension to form a sequence of visual frames.* ::: info 這句話的意思是,當輸入影片時,影片的編碼器(encoder)會對每一幀影片圖像進行單獨編碼,輸出編碼後的圖像特徵,然後將這些特徵序列在時間的維度上連接(concatenate)起來,從而形成影片的視覺特徵序列。具體來說: 1. 影片編碼器以2Hz的頻率,也就是每秒2幀的方式抽取影片幀。 2. 對每一個抽取出來的影片幀個別進行圖像編碼,接著會輸出對應的圖像特徵表示。 3. 然後將所有時間步上提取出的圖像特徵表示進行連接(concatenate)。 4. 連接是沿著時間的維度上進行的,每一步的圖像特徵表示連接成為最終的影片序列特徵表示。 5. 這樣通過逐幀編碼,再沿時間序列連接,形成了整個影片的視覺特徵序列表示。總結來說,這種方式可以輸入完整視頻,並以序列化的形式輸出影片的視覺表徵,以便於後續的音訊影像融合運算。 ::: 2. When both audio and visual inputs are present, the two encoded feature sequences are sent to the temporal synchronisation module to obtain the time-synchronised feature sequences. ![image](https://hackmd.io/_uploads/rk0H_hILa.png =25%x) *Since video is sampled at a lower frame rate than audio, the audio and visual frames are synchronised at each video frame (i.e. every 0.5 seconds), with **zero padding** to make both sequences have equal lengths.* ::: info 這句話的意思是,由於影片的採樣頻率低於音訊,所以會在每個影片幀(即每0.5秒)對音訊和圖像進行同步,並通過零填充 (zero padding) 的方式使音訊和影片序列長度一致。更詳細的解釋如下: 1. 影片的採樣頻率較低,每秒鐘的幀數比音訊的採樣點數少。 2. 為了使音訊和影片在時間上能夠對齊互動,需要對它們進行同步。 3. 同步的粒度是在每一個影片幀進行,即每0.5秒對目前的音訊和影片進行一次同步。 4. 由於音訊資料的長度本身大於影片序列,在同步時會對音訊的序列進行零填充,使其長度和影片序列一致。 5. 這樣就可以使音訊和影片以相同的長度在時間上對齊,以便於後續更細粒度的交互運算。總結來說,這個過程實現了音訊和影片在更細的粒度上時間對齊,使其能夠進行高解析度下的相互理解和推理。 ::: **When only one input modality is present, the other modality is filled with a sequence of zero padding of the same sequence length.** 1. While an image alone is treated as a single frame. 2. when paired audio input exists, such as images with spoken captions, each image is duplicated as if it were a video input with a matched length to the audio input. ::: info 1. 當只有圖像輸入時,每張圖像會被當作一個單獨的frame來處理。 2. 如果同時有成對的音訊輸入,例如: 圖像加上語音描述。這種情況下,為了匹配音訊序列的長度,每張圖像會被複製多次,就好像它是影片輸入中多個連續的frame一樣。也就是說: 1. 單獨圖像,它就是1個frame。 2. 圖像+音訊描述: 這時圖像會重複多次,組成一個類似 “影片”,以匹配音訊序列長度。這樣做的目的是: 1. 能直接重用系統中處理影片輸入的框架,無需單獨添加圖像分支 2. 方便後續的同步和聲像對齊操作通過這種操作,單圖像也可以和音訊在frame級別上進行對齊和互動,進而學習圖像和語音描述之間的關係,完成聲像匹配、圖像語音問答等任務。 ::: 3. In order to handle variable-length inputs, the combined feature sequences are first divided into fixed-length windows spanning, e.g. every 5 or 10 seconds. Then, a causal Q-Former based on the same N trainable input query tokens q1, . . . , qN is applied to convert each sliding window and generate N output query vectors carrying the audio-visual information. :::info 為了處理可變長度的音訊和影片輸入,系統首先將結合後的音訊和影片特徵序列分割成固定長度的窗口,例如每5秒或10秒的長度。然後,對每個滑動窗口(Sliding Windows)應用基於相同N個可訓練輸入向量q1,...qN的causal Q-Former,用來將每個窗口轉換並生成N個輸出查詢向量,來承載這個窗口中的音訊和影片信息。更詳細的步驟是: 1. 音訊和影片輸入可變長 2. 結合音訊和影片特徵序列,形成聯合序列 3. 將聯合序列分割為固定長度(例如5秒)的窗口(windows) 4. 對每個窗口使用Causal Q-Former,其中包含N個可訓練的向量 5. Causal Q-Former透過self-attention和cross-attention操作,輸出N個向量 6. 這N個向量包含了當前窗口的音訊和影片訊息總結來說，通過Sliding Windows和Q-Former的運用,可變長的音訊和影片序列被轉換為固定數量的向量序列,方便後續的語言模型處理(LLM),實現音訊和影片的表徵學習。 ::: ::: warning **滑動窗口(Sliding window)** 的介紹: 使用Sliding window的目的是為了將可變長度的音訊和影片序列,轉換為多個固定長度的片段,以方便模型的處理。它的工作原理是: 1. 設置窗口的長度,比如5秒(spanning every 5 seconds) 2. 從輸入序列的開頭開始,提取一個5秒的片段作為第一個窗口 3. 然後窗口向前滑動一定長度(可以是1秒,可以overlap),提取出下一個窗口 4. 滑動提取窗口的過程持續到序列的末尾 5. 這樣就得到了多個固定長度的窗口片段,包含了整個音訊和影片序列的訊息 6. 模型(Causal Q-Former)會單獨處理每個窗口,最終把所有窗口的結果連接起來,從而應對整個可變長度序列。總結來說: 使用Sliding window，可以充分利用輸入序列的上下文訊息,並使處理過程更加靈活。窗口長度和滑動長度的設定也提供了計算資源使用的可調節性。 ::: ![image](https://hackmd.io/_uploads/ryDgi2LUp.png =70%x) , Where **w is the window index** and **k is the number of video frames in that window**, and Q-Formercausal(·) denotes the causal Q-Former computation *If the input sequence length of causal Q-Former is T, the number of sliding windows W becomes ⌈T /k⌉, and the overall output sequence length from causal Q-Former will be W × N.* ::: info 如果Causal Q-Former的輸入序列長度為T，那麼滑動窗口的總數W可以計算為 ⌈T / k⌉ (這裡的⌈ ⌉表示向上取整數)，k是每個窗口的長度。而causal Q-Former每個窗口輸出N個向量，所以全部窗口輸出的向量總長度為Sliding Windows的數量W去乘以每個窗口輸出向量數N，即 W × N 舉個例子: 輸入序列長度T=25秒窗口長度k=5秒那麼Sliding Windows的數量 W=⌈25/5⌉=5個如果每個窗口輸出10個特徵向量(N=10) 則causal Q-Former的總輸出會有 5×10=50個特徵向量這樣,通過Sliding Window和向量輸出的機制,可變長的音訊和影片序列被轉換成了固定數量的向量,便於後續的語言模型處理。 ::: ::: success 1. Through end-to-end training, the output audio-visual representations of **causal Q-Former are trained to align with the LLM input token space**. 2. Therefore, the use of sliding windows enables the LLM input token sequence length W × N to vary based on T and can achieve a good trade-off between the degree of information reserved and the computation and storage costs. ::: 4. Finally, the instruction prompt, such as questions or task descriptions will be appended to the concatenated output queries of all windows to form the input to the LLM. ![image](https://hackmd.io/_uploads/Hy393h8L6.png =75%x) ### 2.2 Q-Former with Causal Self-Attention (可以看做是對特徵的降維) ![image](https://hackmd.io/_uploads/H1Qy6nULp.png =90%x) :::info The causal attention module in the causal Q-Former with a block-wise triangular causal mask (grey cells are masked). The number of features per frame here is 2. ::: - Feed-Forward Network: 通常由兩層全連接網絡組成。用於**對輸入進行非線性映射,學習更高級的特徵表示**。 - Cross Attention: 輸入查詢向量對鍵值向量集進行注意力查詢,得到與輸入查詢相關的表徵。**可以獲取跨模態上下文的關係**。 - Self-Attention: 輸入自己對自己進行注意力查詢,得到整合上下文的新表徵。**可以建模序列內的上下文關係**。 - <span class='red'>Causal Self-Attention</span>: 在自注意力中新增了triangle causal mask,**只允許模型參考到當前及之前的框架,實現了因果建構,適用於時序信號**。 :::info 所謂block-wise triangle causal mask 指的是: 將整個序列明確劃分為若干個block,每個block內部使用triangle mask來實現Causal Self-Attention,即只能運用當前和之前的框架訊息。這樣可以更充分記錄序列的依賴訊息,增強模型的因果建構能力。 ::: *With the causal attention module, the encoding of one specific frame also includes the information of all previous frames carried in an **auto-regressive** way. This is particularly beneficial for causal reasoning questions, such as the “what happens next” questions. Such questions are sometimes difficult to learn using only the positional embeddings.* :::info 這句話的意思是,透過使用Causal attention module, 一幀特定影片框的編碼也會包含之前所有幀的訊息,並以自迴歸(self-attention)的方式進行傳遞。這對於因果推理類型的問題,例如“接下來會發生什麼”特别有幫助。僅僅依靠位置編碼的模型有時候很難去學習這類問題。更詳細的解釋如下: 1. Causal attention module使每個時間點的編碼不僅依賴該時間步的輸入,還依賴之前時間步的編碼隱狀態。 2. 這實現了自迴歸的編碼過程,每個時間點的表示內含了之前所有時間點的上下文訊息。 3. 這對於需要依賴時間上下文的因果推理類問題很有幫助,例如“接下來會發生什麼”。 4. 僅依賴位置編碼的模型sometimes難以學習這類問題,因為位置訊息難以表示很強的邏輯依賴性。 5. 而通過自迴歸編碼,模型可以學習並利用時間上下文的因果邏輯關係。總結來說,Causal attention機制透過自迴歸的方式傳遞時間上下文信息,使得模型可以更好地學習和推理因果性問題。 ::: ### 2.3 System Training and Diversity Loss 1. The training data of video tasks, such as video question-answering (QA), usually only requires one or two keyframes, and the output queries tend to repeatedly capture the same information. 2. A novel diversity loss is proposed to encourage the causal Q-Former to extract more diverse aspects of the input sequence. :::info 這句話的意思是,影片任務(比如影片問答)的訓練數據通常只需要一兩個關鍵幀,而模型輸出的查詢(output queries)傾向於重複捕捉相同的信息。為了鼓勵Causal Q-Former提取更加多樣化的輸入序列訊息,而提出了一種新的多樣性損失(diversity loss)。更詳細的解釋如下: 1. 影片任務的數據集通常只標註每個視頻的1-2個關鍵幀。 2. 模型在提取影片特徵時,輸出的查詢容易固定在這1-2個標註幀上,**缺乏對整個影片序列的理解**。 3. 為了促使模型從整個影片中提取更加充分和多樣化的訊息,因此提出了Diversity Loss。 4. Diversity Loss鼓勵Causal Q-Former輸出的output query之間儘量不相似,即提取不同的影片訊息。 6. 從而提升模型對整個影片的理解能力,而不僅僅是標註的關鍵幀。總結來說,Diversity Loss透過鼓勵不同查詢提取不同訊息的方式,提升了模型對影片序列的理解能力。 ::: ![image](https://hackmd.io/_uploads/HkIFypLU6.png =45%x) :::info - where W and N are the total number of windows and the number of output queries of each window respectively, and sim(·) is the cosine similarity between two vectors. - Cosine similarity is adopted since it is widely used for semantic similarity measurements. ::: *In FAVOR, **the output queries are aligned with a semantic space of the LLM input token representations**. This choice is also supported by the fact that the modulus of the output query tokens is very similar(<span class='red'>與LLM input的語意空間的向量長度相似 </span>) due to the layer normalisation operation of the causal Q-Former.* :::info Layer normalization的功能: - 它可以自動調節神經網路中某層輸出的數值範圍,讓輸出向量的modulus(向量長度)穩定在1附近。 - 而餘弦相似度判斷兩個向量的方向而非長度的相似性。 - 正因為通過layer normalization,輸出向量長度基本相等, - 所以使用餘弦相似度可以很好地反映向量之間的角度和語意相似性。 - 不受向量長度影響。這就是選擇餘弦相似度的依據。可以更好地實現語義空間中的多樣性約束和損失計算。 ::: *By encouraging the audio-visual frames to be orthogonal to each other, the diversity loss forces the output query representations to be more spread in the text representation space.* :::info 這句話的意思是,通過鼓勵音訊和影片幀之間<span class='red'>儘量正交,多樣性損失迫使output query在文本空間中的表示更加分散開來</span>。更詳細的解釋如下: 1. 音訊和影片幀之間正交,表示它們的特徵表示之間儘量不相似,捕捉了不同的訊息。 2. Diversity Loss正是透過計算音訊和影片幀表示之間的相似度,來鼓勵它們之間的正交性。 3. 由於output query表示對齊了文本表示空間,音訊和影片幀之間的正交性被傳遞到了output query空間。 4. Output query表示正交，代表它們在映射到文本表示空間中更加分散開來,因此涵蓋到更豐富的semantics。 5. 這樣可以提取更加多樣化的音訊和影片的訓息,輸入到後續的語言模型中。總結來說,Diversity Loss透過鼓勵output query的正交性使其在文本空間更分散,從而使音訊和影片的訊息表示更加豐富充分。 ::: Overall, the system is trained in an end-to-end fashion using the cross-entropy (CE) loss and the diversity loss, as shown below: ![image](https://hackmd.io/_uploads/H1EbRBdIT.png) where λ is the factor controlling the importance of the diversity loss, and the CE loss is calculated using the reference answer as the target. ## 3. Experimental Setup ### 3.1 Audio-visual Evaluation Benchmark (AVEB) 1. In this paper, we propose the AVEB benchmark for audio-visual LLM evaluation, which evaluates single-modal perception ability via selected representative tasks while particularly focusing on multi-modal inference. 2. AVEB contains 6 single-modal tasks, together with 5 audio-visual tasks. ![image](https://hackmd.io/_uploads/r1GaZTU8p.png =90%x) :::info **Audio-task:** - ASR: Automatic Speech Recognition - AC: Audio Captioning **Visual-Task:** - IC: Image Caption - OCR: Optical Character Recognition - VQA: Visible Question Answer **Video-Task:** - Video Question Answer **Audio-Visual:** - AVSR: Audio-Visual Speech Recognition - AVSD: Audio-Visual Scene-aware Dialogue - ISQA: Image Spoken Question Answering - AVM: Audio-Visual Matching - AVSSD: Audio-Visual Sound Source Detection ::: 3. This paper particularly proposes **ISQA and AVM tasks where audio-visual interaction is necessary**. - **ISQA** is the task where the <span class='red'>question is in the audio and the answer can be found in the image</span>. - **AVM** is the task of <span class='red'>determining whether the given spoken description</span> in the SpokenCOCO dataset <span class='red'>matches the image</span>, or <span class='red'>whether the given audio clip is compatible with the given video</span> chosen from the VGGSS dataset. ### 3.2 Model Configurations 1. - **LLM**: Vicuna model, and adapted using the low-rank adaptation (LoRA) method with a rank of 32 - **Audio encoder**: Whisper large-v2 encoder - **Visual encoder**: InstructBLIP vision Transformer (ViT) plus Q-Former 2. - The **visual encoder** outputs 32 feature vectors for each video frame (every 0.5 seconds). - The **audio encoder** outputs 50 feature vectors per second. 3. - The **causal Q-Former** has two Transformer blocks with 768-dim hidden states. The output query representations are projected to 5120-dim before being sent to the LLM. - Only the parameters of the attention query, key and value projections and feed-forward network weights are updated, which comprised 0.4% of the total number of LLM parameters. ::: info 這句話的意思是: 在訓練過程中,只有注意力機制(Self-Attention)中 Query、Key、Value的項目投影和Feed-Forward Network的權重參數被更新,這些參數只佔了語言模型中全部參數的0.4%。更詳細的解釋如下: 1. 注意力機制中有Query、Key、Value三個投影項。 2. Feed-Forward Network是模型的前向傳播網路。 3. 在訓練大型語言模型時,通常會固定語言模型本身絕大部分參數,僅更新少部分的參數以進行適應。 4. 這裡僅更新了Self-Attention與Feed-Forward Network中的參數。 5. 這些參數只佔了語言模型全部參數的0.4%,即99.6%參數沒有更新。總結來說,訓練過程中僅更新了極小比例的參數以進行適應,用意在保持語言模型本身的穩定性與推理能力。 ::: *For each video clip, five equally-spaced frames were used resulting in 160 output queries. This is the same as the number of output queries used for 25-second videos in FAVOR. Video-LLaMA was used as the multimodal baseline where only Vicuna-7B checkpoint was released for audio-visual input.* :::info 對於每個影片的片段,使用了5個等間距抽取的幀,輸出160個查詢向量。這和FAVOR中對25秒影片使用的輸出查詢向量數量相同。 Video-LLaMA被用作多模態基準模型。因為Video-LLaMA只公開了接受音訊和影片輸入的Vicuna-7B模型檢查點,所以使用了這個7B模型作為比較基準。更詳細地說明: 1. Video-LLaMA paper中使用5個均勻抽取的影片幀 2. 每個幀產生32個向量,5個幀一共160個向量 3. FAVOR中,25秒視頻也使用160個向量進行表示 4. 因此進行了公平比較 5. Video-LLaMA只比對了7B的參數規模,所以使用7B模型進行評測這樣的比較設置可以保證影片表徵和模型參數量在公平可比的條件下進行對齊,從而更加精確地反映兩者之間的差異。 ::: ### 3.3 Train Data and Specifications FAVOR directly **uses multi-task instruction fine-tuning to train the model parameters of causal QFormer and LoRA**. Training data contains both single-modal and audio-visual paired data. :::info 這句話的意思是,FAVOR直接使用多任務指令微調的方法來訓練Causal Q-Former和LoRA的模型參數。訓練數據同時包含single-modal和audio-visual對應的數據。更詳細的解釋如下: 1. FAVOR使用多任務指令微調的策略訓練模型。 2. 訓練的參數包括Causal Q-Former和LoRA中的參數。 3. 使用的訓練數據既包含單獨的音訊或影片數據,也包含音訊和影片對應的數據。 4. 單模態數據用於訓練單模態理解。對應數據用於訓練音訊影片相關的多模態理解。 5. 透過多任務fine-tune,可以同時訓練這兩類能力。總結來說,FAVOR透過多任務指令微調,使用包含single-modal和audio-visual兩種的混合訓練數據,來訓練模型的音訊影片聯合理解與推理能力。 ::: 1. *In addition to all the training datasets mentioned above, in order to explicitly encourage the model to generically combine both modalities, a **storytelling fine-tuning set** is designed. The dataset is gathered by prompting GPT-3.5 with reference audio caption or transcription, together with video caption, and asking GPT-3.5 to generate a coherent story combining both information. The model is fine-tuned on this data for only 100 steps with a very small learning rate without causing any loss in the benchmark performance.* :::info 這句話描述了為了更好地激發模型，同時理解影片和音訊的能力,作者設計了一個storytelling fine-tuning set。具體來說: 1. 作者使用 GPT-3.5 預訓練模型,輸入參考的音訊描述(包括音訊標題和語音轉文字的內容)以及影片的描述,提示 GPT-3.5 根據這些訊息去生成一個連貫的故事。 2. 根據 GPT-3.5 生成的故事作為fine-tuning set,用來進一步微調模型只有100個步驟,並且採用很小的學習率。這樣不只可以明顯地激發模型結合使用音訊和影片訊息的能力,又不會對基準性能造成任何損失。 4. 這種storytelling fine-tuning set的設計,可以讓模型在理解和生成描述時,主動學會將音訊和影片資訊結合起來思考,進一步增強模型的跨模態理解和推理能力。綜上所述,這句話描述了為模型設計了一個storytelling fine-tuning set的過程,透過讓模型學習結合音訊和影片訊息進行敘事,以此提高模型的音訊和影片互相理解的能力。 ::: 2. - Flickr30k for IC, TextVQA for OCR and GQA for VQA in the benchmark are not included in the training, and hence the model performed zero-shot learning on them. - since ISQA uses synthesised speech, this is also not a trained task and the model performed zero-shot learning. ::: info ISQA 任務中使用了合成語音,也就是電腦生成的語音而不是人類真實的語音。因此 ISQA 任務並沒有包含在模型的訓練任務當中。模型在這個任務上的表現屬於zero-shot的情況,也就是模型沒有特別針對這個任務的數據集和任務訓練。更詳細的解釋如下: 1. ISQA 任務是一個需要圖像和語音共同推理的任務,透過對圖像中的內容提出口頭問題,要求模型根據圖像的內容回答這個口頭問題。 2. 在 ISQA 任務中使用了合成語音而不是真人的問題語音,這樣的合成語音模型沒有特別針對過。 3. 由於使用合成語音,模型在 ISQA 任務上是zero-shot的情況,也就是模型沒有看過這類相關任務的訓練數據,需要透過一般化能力去理解問題並回答。 4. 這體現了模型具有非常強的遷移能力和泛化能力,可以在沒有具體訓練的情況下zero-shot完成圖像和合成語音互相理解的任務。 ::: ## 4. Experimental Result ### 4.1 Main Result 1. **AVEB single-modal task results** ![image](https://hackmd.io/_uploads/ByGw46LU6.png) 2. **AVEB audio-visual task results** ![image](https://hackmd.io/_uploads/SkRPVaILT.png) :::success - FAVOR effectively achieves **audio-visual co-reasoning(聲音-圖片互相理解)** which is reflected by the performance on ISQA, AVSSD and AVM tasks. - FAVOR obtains a similar WER compared to Whisper large-v2 and mixed results compared to the audio-only FAVOR. - Further, with the aid of visual information, FAVOR achieves a lower WER on AVSR than both models. - On visual tasks, FAVOR demonstrates the best results on IC, OCR and Video QA, and on-par results on VQA with InstructBLIP fine-tuned on the same training set. - In particular, the fine-grained causal modelling of video in FAVOR yields over 20% improvements compared to InstructBLIP even though the latter is finetuned on the same set of video data. - <span class='red'>FAVOR demonstrated a strong audio-visual co-reasoning ability based on the audio-visual matching (AVM) dataset results.</span> - <span class='red'>and is the only system to our knowledge that can perform speech-image co-reasoning based on image-spoken QA (ISQA).</span> ::: :::warning **Audio-visual co-reasoning** (including speechimage co-reasoning) is an important yet challenging ability which <span class='red'>requires the model to comprehend the visual content as well as both speech and non-speech sounds in the audio, and to capture the correlation between what it “hears” and “sees”</span>. Such tasks were almost infeasible for any other audio-visual models so far. ::: ### 4.2 Ablation Studies ![image](https://hackmd.io/_uploads/r1W04T88a.png) :::success **Summary** 1. The effect of the **causal attention module**,which both boosted the temporal causality modelling as well as provided a better audio-visual fusion before applying the cross-attention in the Q-Former. 2. - **Without the sliding window**, a fixed number of output queries are used no matter how long the audio is, which results in more deletion errors. - Besides, **using sliding window** also benefits the video QA task as they encourage the localised causal relationship to be captured. 3. - **Without synchronisation**, modality alignment is done rather independently and the correlation between audio and video is only modelled among high-level features that are aligned in the text space. - **The use of synchronisation** is crucial for audio-visual co-reasoning to work as supported in particular by the ISQA and AVM results. ::: ### 4.3 Analysis on The Sliding Window Size ![image](https://hackmd.io/_uploads/Hy3JSTLIa.png) - Although using shorter windows benefits ASR, as fewer output tokens are used to encapsulate all the visual information within that window, performance on video QA is degraded :::info 滑動窗口越小，每個窗口包含的影片訊息更少。對於ASR任務來說，訊息量少，用較少的文本符號即可封裝窗口內所有的語音信息，從而提高ASR性能。但是對於影片問答任務，滑窗太短會使每個窗口無法保存足夠的上下文和細節信息，從而導致影片問答任務的性能降低。所以窗口長度需要在ASR和影片問答之間進行折衷，維持整體性能的最佳平衡。 ::: :::success 1. Larger windows heavily reduce the ASR performance as the monotonic alignment in ASR is especially difficult to learn with 10% of the training data. 2. While low accuracy is observed when the frame rate is low, increasing FPS beyond 1.0 only receives marginal improvements at the cost of having many more output queries sent to the LLM. The choice of 2.0 FPS is chosen as it made the audio and visual sequences have the most similar lengths, and hence easier for synchronisation. ::: ### 4.4 Analysis of The Diversity Loss ![image](https://hackmd.io/_uploads/SyR78TLLp.png) :::success - For **ASR**, the model is trained to include all the speech information in the audio sequence and the cosine similarity varies according to the length of the speech. - For **videos**, the cosine similarity is close and does not vary too much for different video lengths, and hence diversity loss effectively acts as a way to encourage more diversified information to be captured. - However, when a high λ is employed, diverse information causes confusion in the model and results in a more severe hallucination problem (e.g. high insertion rate in WER) with heavily degraded model performance. ::: ### 4.5 Discussion on Incorporating Speech and Speech-video Interactions - Speech is an important source of information for video that should always be considered for audiovisual LLM to perform a comprehensive understanding. - Unlike audio events, the speech content can hardly be inferred from the visual modality, making it particularly indispensable to comprehend any videos involving people talking. - Moreover, the co-occurrence of speech and video events, which is modelled by the fine-grained temporal synchronisation in FAVOR, is required to understand the audio-visual temporal relations, ## 5. Conclusion - This paper proposes the **FAVOR learning framework** for multimodal LLMs. To the best of our knowledge, FAVOR is the first approach that is capable of performing cross-modal cognitive tasks involving audio, speech, image and video inputs with high temporal resolution. - This paper proposes the **causal Q-Former structure** which comprises a causal encoder module. Further with a novel **diversity training loss**, causal Q-Former is capable of handling audio-visual sequence input efficiently with a small number of training examples. - This paper introduces the **AVEB benchmark** comprising single-modal and cross-modal tasks <span class='red'>to quantitatively evaluate</span> the performance of audio-visual LLMs.