ON DECODER-ONLY ARCHITECTURE FOR SPEECH-TO-TEXT AND LARGE LANGUAGE MODEL INTEGRATION

# ON DECODER-ONLY ARCHITECTURE FOR SPEECH-TO-TEXT AND LARGE LANGUAGE MODEL INTEGRATION [paper](https://arxiv.org/pdf/2307.03917.pdf) ## Introduction ### Issue of previous method - **Cascaded Approach** - Sequential processing through **automatic speech recognition** and **LLMs** introduces **latency** and **potential errors** in recognizing **spoken words**. - **Alignment Challenges** - Aligning **speech** and **text** modalities is challenging due to the **different sequence lengths** of speech signals and text. - **Integration Cost** - Training LLMs is **costly**, and **integrating them with speech models** may further increase the computational expense. - **Tokenization Challenges** - Converting speech into **discrete tokens** may not capture the **continuous nature** of speech representation accurately. ### Improvements with Speech-LLaMA - **End-to-End Integration** - Speech-LLaMA proposes an efficient **end-to-end integration**, eliminating the need for separate **ASR** and **LLM** processing. - This reduces **latency** and **potential errors**. - **Alignment Enhancement** - The **acoustic compressor** in Speech-LLaMA helps align speech and text modalities by reducing the **sequence length** of the speech signal, facilitating compatibility with text sequences. - **Cost Reduction** - By incorporating a pre-existing LLM and introducing a minimal number of **free parameters**, Speech-LLaMA aims to **minimize the overall integration cost** while maintaining **exceptional performance**. - **Continuous Representation** - Speech-LLaMA directly maps **continuous speech** representation into the **semantic space** of the LLM, avoiding the need for **discretized tokens** and capturing the **nuanced nature** of speech. - **Decoder-Only Architecture** - Speech-LLaMA demonstrates the potential of a **decoder-only** architecture, showcasing competitive performance with **encoder-decoder** models and achieving **better parameter efficiency**. Overall, Speech-LLaMA aims to address the shortcomings of past methods by providing an **end-to-end integration** approach that **enhances alignment**, **reduces cost**, and leverages the advantages of **continuous speech representation**. ## Related Work ### Large language models - **General Features of LLMs** - LLMs are typically **pre-trained** on extensive textual data covering **various domains** and **languages**. - They usually consist of **a stack of Transformer layers**, employing an **auto-regressive decoder-only** architecture. - In this architecture, each **output token** serves as the **input** to predict the next token in the sequence. - **Selected Background Language Model** - The study opts for **LLaMA-7B** as the foundational language model. - LLaMA-7B comprises **32** Transformer encoder layers, each with **32** attention heads and an attention dimension of **4096**. - The LLaMA tokenizer has a vocabulary size of **32,000**, encompassing **multiple languages**. ### CTC compressor - **The use of CTC compressor** - The CTC compressor is a method aimed at **shortening sequences** by **removing redundant information** in features. - **Application in Speech Translation Task** - The method was applied in a speech translation task by adding a **linear CTC branch** in the **middle layer of the encoder**. - This branch is **jointly optimized** with the primary cross-entropy criteria. - The **hidden representations of the CTC branch** are compressed based on the distributions of CTC posteriors and **passed to subsequent layers**. - **Variations in Sequence Length Compression** - The author explored variations within this **sequence length compression** method. - It was found that **averaging consecutive hidden representations** (corresponding to consecutive CTC predictions belonging to the same class) yielded the **best performance**. ### LoRA - **Purpose of LoRA** - LoRA is employed for **adapting large models** to new datasets or tasks. - **Introduction of Additional Parameters** - LoRA introduces a **small number of free parameters** to each Transformer layer of the original large model. - **Parameter Freezing** - All the parameters of the original model are **frozen** during the adaptation process. - **Low-Rank Approximation** - For each weight matrix $W$ in a Transformer layer($W\in\mathbb{R}^{d \times k}$), two new matrices $W_a$ $(W_a \in \mathbb{R}^{d \times r})$ and $W_b$ $( W_b \in \mathbb{R}^{r \times k} )$ are introduced, where $r \ll \min(d, k)$. - During training, each matrix multiplication involves the input $x$ being multiplied with **both** the **original weight $W$** and its **introduced low-rank approximation $W_a$, $W_b$**. - The outputs from these multiplications are then **summed** to form the **final output** for subsequent computations. - **Fine-Tuning and Memory Reduction** - Only the introduced low-rank matrices **$W_a$** and **$W_b$** are **updated** during fine-tuning, while the original weight **$W$** remains **frozen**. - This selective updating significantly **reduces** the **memory footprint** during the training process. ## Method ### Overview <center> ![截圖 2024-01-15 上午12.37.13](https://hackmd.io/_uploads/HkLJCFWYp.png =70%x) </center> - **Pre-trained Text Neural LLM** - The model incorporates a **pre-trained text neural LLM**, specifically referred to as LLaMA-7B. - It's emphasized that this method can be extended to LLMs of **varying scales**, indicating flexibility in the choice of language model. - **CTC Compressor** - A CTC compressor is utilized to **reduce the sequence length** of the input speech filter-bank, aligning it with the length of the text. - The compressed speech signal generated by the CTC compressor is then further processed by the audio encoder and integrated into the semantic space of the LLM. - **Audio Encoder** - The architecture includes an audio encoder responsible for transforming the **compressed speech signal** into **continuous vectors**. - These continuous vectors exist within the **semantic space of the pre-trained text neural LLM**. ### CTC compressor - **Pre-training Objective** - The pre-training objective of the CTC compressor is to align the durations of **audio** and **text** to the **same scale**. - This is achieved by selecting **representative frames** from the audio signal. - **Sequence Length Reduction Techniques** - Two techniques are explored to reduce the sequence length of the acoustic features within the CTC compressor: **blank-removal** and **frame-averaging**. - **Blank-Removal Technique** - In "blank-removal," frames that predict the blank **symbol**, based on the distribution of the **CTC posteriors**, are simply **discarded**. - This method involves removing frames associated with blank predictions to achieve **sequence length reduction**. - **Frame-Averaging Technique** - In "frame-averaging," the **hidden states** of consecutive frames are **averaged without removing frames** associated with blank predictions. - This technique involves averaging the hidden states of frames belonging to the **same class** according to **CTC predictions**. ### Audio encoder - **Role of the Audio Encoder** - The audio encoder serves as a **bridge**, transforming representations generated from the **CTC compressor** into **text embeddings** for the text-LLM. - **Design Features** - This module is designed to be relatively **small** in size and is **initialized with random weights**. - During the **fine-tuning** process, the optimization goal for the audio encoder is to effectively integrate **audio information** into the **text-LLM**, thereby enhancing the **overall system performance**. - **Distinguishing Characteristics from Other Methods** - In contrast to other approaches, where the audio encoder is trained to initially map speech signals into discrete tokens consumed by the LLM, the proposed audio encoder takes a different approach. - Instead, it is directly optimized to map the **compressed acoustic signal** to the **continuous semantic space of the LLM**, allowing for a **deep integration** between the audio encoder and the language model. ### Instruct learning - **Training Phase** - For each training sample, a **text prompt** is prepended to briefly describe the **task**, for example, "audio ⇒ English" or "transcribe the audio into English." - **Selection of Text Prompts** - The text prompt are sampled from a **pre-defined** list, where some prompts contains the **source language ID** following the format “translate [source] audio into English”. - **Evaluation Phase** - During evaluation, **fix the text prompt** as “**translate the audio into English**” for all testing samples. ### LoRA fine-tuning - **LoRA Application** - On top of the proposed model, apply the LoRA to four **attention matrices** in **each layer of the LLaMA Transformer**. - **Two-Stage Training Scheme** - To **stabilize** the training process, a **two-stage training scheme** is adopted. - In the first stage, the **audio encoder is trained** with the **CTC compressor and the LLaMA model frozen**. - In the second stage, **LoRA** is introduced to the well-trained model, and further optimization is performed. - **Loss Caculation** - The entire system is still trained with **cross-entropy loss**. - The loss is computed between the **LLM output** and the **reference transcription sequence** on the same training data. ### From-scratch training <center> ![image](https://hackmd.io/_uploads/Hy0ZT_GKT.png =70%x) </center> - **Training Objective** - Explore the potential of **decoder-only** architecture as a foundational architecture for **speech modeling**. - **Training Approach** - Replace the **text prompt**, **audio encoder**, and **CTC compressor** with a **randomly initialized convolutional 2D encoder**. - Replace the **pretrained LLaMA network** with a **much smaller randomly initialized autoregressive network**. - An **⟨SOS⟩ token** is added at the **end of the acoustic sequence** to indicate the **starting of the generation**. - **Generation of the text sequence** \begin{equation} p(y|x; \Theta_{DEC}) = \prod_{n=0}^{Y-1} p(y_n | y_{<n}, x; \Theta_{DEC}) \end{equation} - In this case, the generation of the **text sequence $y$** with a **decoder-only** model is conditioned purely on **audio signal $x$** and **previously generated text sequence $y_{<n}$**. - $\Theta_{DEC}$ refers to the parameters of the decoder model. ## Experiment The **speech translation task** has been chosen as the primary evaluation benchmark for assessing the proposed methods. In this task, the goal is to develop a system that can accurately translate **spoken language from 13 source languages to English**. <center> ![image](https://hackmd.io/_uploads/SJStvtUK6.png) </center> ### Baselines - **Baseline Systems** - **B1** : Seq2seq model - **B2** : B1 with **LLaMA** n-best rescoring - **Performance Comparison** - Results indicate that the B2 system performs **better** than B1. - This suggests that **shallow integration with LLM** can still bring **benefits** to the **speech models**. ### Deeper integration with LLaMA - **Speech-LLaMA Configurations** - Systems **E1 ∼ E6** describe SpeechLLaMA models in various configurations. - **Performance Improvement** - All Speech-LLaMA configurations significantly **outperform** the baselines with the **limited learnable parameters**. - **System Efficacy and Integration Depth** - These results show the **efficacy** of the proposed system and also suggests the necessity for **deeper integration between the speech models and text-LLMs**. ### CTC compressor - **Comparison of CTC Compressor Performance** - Comparing **E1** to **E0** shows consistently better performance, indicating the effectiveness of the **CTC compressor** over the **convolution-based compressor**. - **Comparison of Strategies within CTC Compressor** - Within the CTC compressor, comparing **E3** to **E1**, the "**frame-averaging**" strategy shows a 1.5 **higher** average BLEU score compared to the "**blank-removal**" strategy. - Attributes this performance difference to the fact that the CTC compressor may not reliably distill **all relevant information** into non-blank representations, and the averaging strategy is more **robust** to this compression error. ### Effect of non-causal attention mask - **Expected Effect of Full Attention Mask** - It is anticipated that applying a **full attention mask** simultaneously to **text prompts** and **acoustic representations** would generally lead to **improved** speech representation and, consequently, better overall results. - **Effect of Non-Causal Attention Mask** - For each type of CTC compression strategy, experiments demonstrate that using a **non-causal attention mask** compared to a causal mask results in gains. - Comparing system **E2** to **E1**, switching to a non-causal mask brings an additional gain of 1.5 average BLEU score when using the "**blank-removal**" strategy within the CTC compressor. - Similarly, comparing systems **E5** to **E3**, a gain of 0.7 average BLEU score is observed when using the "**frame-averaging**" strategy within the CTC compressor. - Even in LoRA fine-tuning systems (e.g., comparing **E6** and **E4**), a gain of 0.8 average BLEU score is observed with a non-causal mask applied. - **Explanation for More Pronounced Effect of Non-Causal Mask** - In the "**blank-removal**" strategy, the gain with a non-causal mask is understandably larger, as **future acoustic information** can compensate for **potential information loss** caused by removing frames corresponding to the blank symbol in the CTC loss. ### LoRA fine-tuning - **Comparison of Systems E4 and E3** - E4 represents a system with **LoRA fine-tuning** using a **causal attention mask**. - Comparing E4 over E3 demonstrates the gains achieved through **LoRA fine-tuning** with a causal attention mask. - There is an additional **increase** of 1.5 average BLEU score. - **Comparison of Systems E6 and E5** - E6 represents a system with **LoRA fine-tuning** using a **non-causal attention mask**. - Comparing E6 over E5 shows the corresponding gains achieved through **LoRA fine-tuning** with a non-causal attention mask. - There is an additional **increase** of 1.6 average BLEU score. - **Parameter Addition** - Despite the notable performance gains, only **2.1 million** additional parameters are added as adaptors during LoRA fine-tuning. ### Decoder-only vs Encoder-Decoder - **Decoder-Only Model Performance** - The decoder-only model achieves **slightly lower performance** compared to the seq2seq baseline. - **Parameter Efficiency Comparison** - Despite the slight performance decrease, the **total parameters** for the **decoder-only model** are significantly **lower** than the seq2seq baseline. ## Conclusion ### Objective - The authors propose a method to **infuse** an off-theshelf **large language model** with **acoustic information**. ### Model Integration - The proposed model achieves a **deep integration between audio and the LLM** by directly **mapping acoustic representations into the semantic space of the LLM**. ### Practical Aspects Explored - The study explores several practical aspects of the proposed model aimed at improving performance. These include: - **Compression of the acoustic feature** - **Design of the attention mask** - **Fine-tuning with adapters** ### Performance Evaluation - On a **speech translation task** involving 13 languages translated to English, the proposed model demonstrates **significant outperformance** compared to a strong sequence-to-sequence baseline model. ### Decoder-Only Architecture - The study also highlights the effectiveness of a **decoder-only architecture**, trained from scratch, which achieves comparable performance with around **40% fewer parameters**. - This emphasizes the potential of decoder-only models for general **speech-to-text modeling**.