# ON DECODER-ONLY ARCHITECTURE FOR SPEECH-TO-TEXT AND LARGE LANGUAGE MODEL INTEGRATION
[paper](https://arxiv.org/pdf/2307.03917.pdf)
## Introduction
### Issue of previous method
- **Cascaded Approach**
- Sequential processing through **automatic speech recognition** and **LLMs** introduces **latency** and **potential errors** in recognizing **spoken words**.
- **Alignment Challenges**
- Aligning **speech** and **text** modalities is challenging due to the **different sequence lengths** of speech signals and text.
- **Integration Cost**
- Training LLMs is **costly**, and **integrating them with speech models** may further increase the computational expense.
- **Tokenization Challenges**
- Converting speech into **discrete tokens** may not capture the **continuous nature** of speech representation accurately.
### Improvements with Speech-LLaMA
- **End-to-End Integration**
- Speech-LLaMA proposes an efficient **end-to-end integration**, eliminating the need for separate **ASR** and **LLM** processing.
- This reduces **latency** and **potential errors**.
- **Alignment Enhancement**
- The **acoustic compressor** in Speech-LLaMA helps align speech and text modalities by reducing the **sequence length** of the speech signal, facilitating compatibility with text sequences.
- **Cost Reduction**
- By incorporating a pre-existing LLM and introducing a minimal number of **free parameters**, Speech-LLaMA aims to **minimize the overall integration cost** while maintaining **exceptional performance**.
- **Continuous Representation**
- Speech-LLaMA directly maps **continuous speech** representation into the **semantic space** of the LLM, avoiding the need for **discretized tokens** and capturing the **nuanced nature** of speech.
- **Decoder-Only Architecture**
- Speech-LLaMA demonstrates the potential of a **decoder-only** architecture, showcasing competitive performance with **encoder-decoder** models and achieving **better parameter efficiency**.
Overall, Speech-LLaMA aims to address the shortcomings of past methods by providing an **end-to-end integration** approach that **enhances alignment**, **reduces cost**, and leverages the advantages of **continuous speech representation**.
## Related Work
### Large language models
- **General Features of LLMs**
- LLMs are typically **pre-trained** on extensive textual data covering **various domains** and **languages**.
- They usually consist of **a stack of Transformer layers**, employing an **auto-regressive decoder-only** architecture.
- In this architecture, each **output token** serves as the **input** to predict the next token in the sequence.
- **Selected Background Language Model**
- The study opts for **LLaMA-7B** as the foundational language model.
- LLaMA-7B comprises **32** Transformer encoder layers, each with **32** attention heads and an attention dimension of **4096**.
- The LLaMA tokenizer has a vocabulary size of **32,000**, encompassing **multiple languages**.
### CTC compressor
- **The use of CTC compressor**
- The CTC compressor is a method aimed at **shortening sequences** by **removing redundant information** in features.
- **Application in Speech Translation Task**
- The method was applied in a speech translation task by adding a **linear CTC branch** in the **middle layer of the encoder**.
- This branch is **jointly optimized** with the primary cross-entropy criteria.
- The **hidden representations of the CTC branch** are compressed based on the distributions of CTC posteriors and **passed to subsequent layers**.
- **Variations in Sequence Length Compression**
- The author explored variations within this **sequence length compression** method.
- It was found that **averaging consecutive hidden representations** (corresponding to consecutive CTC predictions belonging to the same class) yielded the **best performance**.
### LoRA
- **Purpose of LoRA**
- LoRA is employed for **adapting large models** to new datasets or tasks.
- **Introduction of Additional Parameters**
- LoRA introduces a **small number of free parameters** to each Transformer layer of the original large model.
- **Parameter Freezing**
- All the parameters of the original model are **frozen** during the adaptation process.
- **Low-Rank Approximation**
- For each weight matrix $W$ in a Transformer layer($W\in\mathbb{R}^{d \times k}$), two new matrices $W_a$ $(W_a \in \mathbb{R}^{d \times r})$ and $W_b$ $( W_b \in \mathbb{R}^{r \times k} )$ are introduced, where $r \ll \min(d, k)$.
- During training, each matrix multiplication involves the input $x$ being multiplied with **both** the **original weight $W$** and its **introduced low-rank approximation $W_a$, $W_b$**.
- The outputs from these multiplications are then **summed** to form the **final output** for subsequent computations.
- **Fine-Tuning and Memory Reduction**
- Only the introduced low-rank matrices **$W_a$** and **$W_b$** are **updated** during fine-tuning, while the original weight **$W$** remains **frozen**.
- This selective updating significantly **reduces** the **memory footprint** during the training process.
## Method
### Overview
<center>

</center>
- **Pre-trained Text Neural LLM**
- The model incorporates a **pre-trained text neural LLM**, specifically referred to as LLaMA-7B.
- It's emphasized that this method can be extended to LLMs of **varying scales**, indicating flexibility in the choice of language model.
- **CTC Compressor**
- A CTC compressor is utilized to **reduce the sequence length** of the input speech filter-bank, aligning it with the length of the text.
- The compressed speech signal generated by the CTC compressor is then further processed by the audio encoder and integrated into the semantic space of the LLM.
- **Audio Encoder**
- The architecture includes an audio encoder responsible for transforming the **compressed speech signal** into **continuous vectors**.
- These continuous vectors exist within the **semantic space of the pre-trained text neural LLM**.
### CTC compressor
- **Pre-training Objective**
- The pre-training objective of the CTC compressor is to align the durations of **audio** and **text** to the **same scale**.
- This is achieved by selecting **representative frames** from the audio signal.
- **Sequence Length Reduction Techniques**
- Two techniques are explored to reduce the sequence length of the acoustic features within the CTC compressor: **blank-removal** and **frame-averaging**.
- **Blank-Removal Technique**
- In "blank-removal," frames that predict the blank **symbol**, based on the distribution of the **CTC posteriors**, are simply **discarded**.
- This method involves removing frames associated with blank predictions to achieve **sequence length reduction**.
- **Frame-Averaging Technique**
- In "frame-averaging," the **hidden states** of consecutive frames are **averaged without removing frames** associated with blank predictions.
- This technique involves averaging the hidden states of frames belonging to the **same class** according to **CTC predictions**.
### Audio encoder
- **Role of the Audio Encoder**
- The audio encoder serves as a **bridge**, transforming representations generated from the **CTC compressor** into **text embeddings** for the text-LLM.
- **Design Features**
- This module is designed to be relatively **small** in size and is **initialized with random weights**.
- During the **fine-tuning** process, the optimization goal for the audio encoder is to effectively integrate **audio information** into the **text-LLM**, thereby enhancing the **overall system performance**.
- **Distinguishing Characteristics from Other Methods**
- In contrast to other approaches, where the audio encoder is trained to initially map speech signals into discrete tokens consumed by the LLM, the proposed audio encoder takes a different approach.
- Instead, it is directly optimized to map the **compressed acoustic signal** to the **continuous semantic space of the LLM**, allowing for a **deep integration** between the audio encoder and the language model.
### Instruct learning
- **Training Phase**
- For each training sample, a **text prompt** is prepended to briefly describe the **task**, for example, "audio ⇒ English" or "transcribe the audio into English."
- **Selection of Text Prompts**
- The text prompt are sampled from a **pre-defined** list, where some prompts contains the **source language ID** following the format “translate [source] audio into English”.
- **Evaluation Phase**
- During evaluation, **fix the text prompt** as “**translate the audio into English**” for all testing samples.
### LoRA fine-tuning
- **LoRA Application**
- On top of the proposed model, apply the LoRA to four **attention matrices** in **each layer of the LLaMA Transformer**.
- **Two-Stage Training Scheme**
- To **stabilize** the training process, a **two-stage training scheme** is adopted.
- In the first stage, the **audio encoder is trained** with the **CTC compressor and the LLaMA model frozen**.
- In the second stage, **LoRA** is introduced to the well-trained model, and further optimization is performed.
- **Loss Caculation**
- The entire system is still trained with **cross-entropy loss**.
- The loss is computed between the **LLM output** and the **reference transcription sequence** on the same training data.
### From-scratch training
<center>

</center>
- **Training Objective**
- Explore the potential of **decoder-only** architecture as a foundational architecture for **speech modeling**.
- **Training Approach**
- Replace the **text prompt**, **audio encoder**, and **CTC compressor** with a **randomly initialized convolutional 2D encoder**.
- Replace the **pretrained LLaMA network** with a **much smaller randomly initialized autoregressive network**.
- An **⟨SOS⟩ token** is added at the **end of the acoustic sequence** to indicate the **starting of the generation**.
- **Generation of the text sequence**
\begin{equation} p(y|x; \Theta_{DEC}) = \prod_{n=0}^{Y-1} p(y_n | y_{<n}, x; \Theta_{DEC}) \end{equation}
- In this case, the generation of the **text sequence $y$** with a **decoder-only** model is conditioned purely on **audio signal $x$** and **previously generated text sequence $y_{<n}$**.
- $\Theta_{DEC}$ refers to the parameters of the decoder model.
## Experiment
The **speech translation task** has been chosen as the primary evaluation benchmark for assessing the proposed methods. In this task, the goal is to develop a system that can accurately translate **spoken language from 13 source languages to English**.
<center>

</center>
### Baselines
- **Baseline Systems**
- **B1** : Seq2seq model
- **B2** : B1 with **LLaMA** n-best rescoring
- **Performance Comparison**
- Results indicate that the B2 system performs **better** than B1.
- This suggests that **shallow integration with LLM** can still bring **benefits** to the **speech models**.
### Deeper integration with LLaMA
- **Speech-LLaMA Configurations**
- Systems **E1 ∼ E6** describe SpeechLLaMA models in various configurations.
- **Performance Improvement**
- All Speech-LLaMA configurations significantly **outperform** the baselines with the **limited learnable parameters**.
- **System Efficacy and Integration Depth**
- These results show the **efficacy** of the proposed system and also suggests the necessity for **deeper integration between the speech models and text-LLMs**.
### CTC compressor
- **Comparison of CTC Compressor Performance**
- Comparing **E1** to **E0** shows consistently better performance, indicating the effectiveness of the **CTC compressor** over the **convolution-based compressor**.
- **Comparison of Strategies within CTC Compressor**
- Within the CTC compressor, comparing **E3** to **E1**, the "**frame-averaging**" strategy shows a 1.5 **higher** average BLEU score compared to the "**blank-removal**" strategy.
- Attributes this performance difference to the fact that the CTC compressor may not reliably distill **all relevant information** into non-blank representations, and the averaging strategy is more **robust** to this compression error.
### Effect of non-causal attention mask
- **Expected Effect of Full Attention Mask**
- It is anticipated that applying a **full attention mask** simultaneously to **text prompts** and **acoustic representations** would generally lead to **improved** speech representation and, consequently, better overall results.
- **Effect of Non-Causal Attention Mask**
- For each type of CTC compression strategy, experiments demonstrate that using a **non-causal attention mask** compared to a causal mask results in gains.
- Comparing system **E2** to **E1**, switching to a non-causal mask brings an additional gain of 1.5 average BLEU score when using the "**blank-removal**" strategy within the CTC compressor.
- Similarly, comparing systems **E5** to **E3**, a gain of 0.7 average BLEU score is observed when using the "**frame-averaging**" strategy within the CTC compressor.
- Even in LoRA fine-tuning systems (e.g., comparing **E6** and **E4**), a gain of 0.8 average BLEU score is observed with a non-causal mask applied.
- **Explanation for More Pronounced Effect of Non-Causal Mask**
- In the "**blank-removal**" strategy, the gain with a non-causal mask is understandably larger, as **future acoustic information** can compensate for **potential information loss** caused by removing frames corresponding to the blank symbol in the CTC loss.
### LoRA fine-tuning
- **Comparison of Systems E4 and E3**
- E4 represents a system with **LoRA fine-tuning** using a **causal attention mask**.
- Comparing E4 over E3 demonstrates the gains achieved through **LoRA fine-tuning** with a causal attention mask.
- There is an additional **increase** of 1.5 average BLEU score.
- **Comparison of Systems E6 and E5**
- E6 represents a system with **LoRA fine-tuning** using a **non-causal attention mask**.
- Comparing E6 over E5 shows the corresponding gains achieved through **LoRA fine-tuning** with a non-causal attention mask.
- There is an additional **increase** of 1.6 average BLEU score.
- **Parameter Addition**
- Despite the notable performance gains, only **2.1 million** additional parameters are added as adaptors during LoRA fine-tuning.
### Decoder-only vs Encoder-Decoder
- **Decoder-Only Model Performance**
- The decoder-only model achieves **slightly lower performance** compared to the seq2seq baseline.
- **Parameter Efficiency Comparison**
- Despite the slight performance decrease, the **total parameters** for the **decoder-only model** are significantly **lower** than the seq2seq baseline.
## Conclusion
### Objective
- The authors propose a method to **infuse** an off-theshelf **large language model** with **acoustic information**.
### Model Integration
- The proposed model achieves a **deep integration between audio and the LLM** by directly **mapping acoustic representations into the semantic space of the LLM**.
### Practical Aspects Explored
- The study explores several practical aspects of the proposed model aimed at improving performance. These include:
- **Compression of the acoustic feature**
- **Design of the attention mask**
- **Fine-tuning with adapters**
### Performance Evaluation
- On a **speech translation task** involving 13 languages translated to English, the proposed model demonstrates **significant outperformance** compared to a strong sequence-to-sequence baseline model.
### Decoder-Only Architecture
- The study also highlights the effectiveness of a **decoder-only architecture**, trained from scratch, which achieves comparable performance with around **40% fewer parameters**.
- This emphasizes the potential of decoder-only models for general **speech-to-text modeling**.