LOCOST: State-Space Models for Long Document Abstractive Summarization - HackMD

<style> .red { color: red; } .blue{ color: blue; } .green{ color: green; } </style> # [LOCOST: State-Space Models for Long Document Abstractive Summarization](https://arxiv.org/abs/2401.17919) :::danger **Comments:** Accepted at EACL 2024 conference **Github:** https://github.com/flbbb/locost-summarization **Keywords:** State-space Model, Model with Unlimited Context Length. **Python:** pip install python = 3.10 ::: ## 1. Introduction - The introduction of transformer architectures indeed came as a major bump in performance and scalability for text generation. - However the **quadratic complexity** in the input length still restricts the application of large pre-trained models to long texts. - For instance, BERT and BART are limited to a context size of 512 and 1024 tokens respectively, which amounts to 2-3 paragraphs of standard text. - To mitigate this issue, a straightforward approach is to leverage <span class='red'>sparse-attention patterns</span> to better cope with long texts. - However, in practice, <span class='red'>even the best sparse-transformers need heavy computational resources to handle sequences of length larger than 8K tokens</span>. - Deep **state-space models (SSMs)** have been proposed for sequence processing, **with complexity O(L log L)**, initially for computer vision and audio and more recently for text. - <span class='red'>Their recurrent architectures are designed for capturing long-range dependencies.</span> - Up to now, their applications have been restrained to either unconditional autoregressive generation, i.e., with a decoder-only ; or sequence classification, i.e., with an encoder-only. Tackling conditional text generation with SSMs as required e.g. for summarization remains yet unexplored. :::info 這段話的意思是,直到目前為止,狀態空間模型(SSM)的應用還是有所限制,它們只被用於以下兩種情況: 1. 無條件自回歸生成,也就是只使用Decoder的部分。 2. 序列分類任務,即只使用Encoder部分。而對於有條件的文本生成任務,例如摘要生成,使用SSM模型來處理這類任務還是未被探索的領域。換句話說,SSM模型目前還沒有被廣泛應用於需要同時使用Encoder和Decoder的Sequence-to-Sequence生成任務,像是摘要、機器翻譯等有條件文本生成任務。這仍是一個值得進一步研究和開發的新領域。 ::: - we propose **LOCOST** an **encoder-decoder architecture** to explore the performance of SSMs for conditional text generation tasks, through the lens of abstractive summarization. :::success We demonstrate that SSMs can be competitive with transformer-based models while drastically reducing their memory requirements. ::: ## 2. Related Work ### Memory efficiency for transformers - <span class='red'>Reducing the memory consumption of transformers is an active research field</span>. - Optimization at the hardware level helped to improve the scaling of the attention computation on recent GPUs. - A line of work considers retrieving-augmented transformers, that use additional modules to enhance the language modeling backbone. - While crucial in developing memory-efficient architectures, we consider these last two topics as being orthogonal to our work that focuses on the models’ architecture. - Profuse literature focuses on tailoring the models’ architecture for long inputs. :::info 現有大量研究論文都在專注於如何針對長輸入序列,特別量身打造、調整模型架構的設計,使模型能夠有效地處理長度較長的輸入數據序列。 ::: - <span class='red'>Since the computational complexity of attention comes from the computation of the self-attention matrix, a straightforward way to reduce its cost is to approximate it using sparse-attention patterns.</span> - These patterns typically incorporate a combination of local attention and a set of carefully selected tokens: 1. **BigBird** considers **random tokens**. 2. **LSG** considers **sparse tokens through various strategy of sparsification**. 3. LongT5 chunks the sequence into blocks and averages their representations, which gives a number of global tokens equal to the number of blocks. ![Screenshot 2024-04-22 at 17.57.18](https://hackmd.io/_uploads/HJNq42XbA.png =60%x) :::success Computational complexity per encoder layer as a function of the input length L, the local window size w (typically set to 256 tokens), the number of global tokens g, random tokens r, sparse tokens s and the chunk size c. LOCOST has a much lower complexity than other sparse-attention baselines. ::: :::info 這句話說明了一種處理長序列輸入的稀疏注意力模式(sparse attention patterns)的做法,通常包含以下兩個部分: 1. 局部注意力 (local attention) 這指的是模型只關注輸入序列中某個token及其鄰近的有限個token,而不是關注整個輸入序列。這可以減少計算複雜度。 2. 一組精心挑選的tokens 除了關注局部tokens外,模型還會針對整個輸入序列中挑選出一些重要的token。這些token可能是全域性的(global)tokens,也可能是隨機挑選或其他策略挑選出的稀疏(sparse)tokens。將這兩者結合,就可讓模型在處理長序列時,不只能關注局部的細節(透過local attention),又能掌握全域的重要訊息(透過精心挑選的tokens)。這種注意力模式的設計,旨在平衡計算複雜度和模型性能,使模型能夠有效處理遠超以往長度的輸入序列,而不會像傳統的Self-Attention那樣計算量過於龐大。 ::: ### Attention-free transformers - Some variants of transformers already avoid the standard attention mechanism. - For example Katharopoulos et al. (2020); Hua et al. (2022) approximate the softmax similarity in the attention by a more efficient computation. - Mixing architectures were introduced in (Liu et al., 2021). They are the main component of the FNet model, an encoder that replaces self-attention with a Discrete Fourier Transform (DFT). - FNet has a complexity of O(L log L) and is an encoder-only model, thus restricted to classification and regression tasks. - <span class='red'>Our proposed model also bypasses attention in the encoder, reaching the same computational complexity as encoders such as FNet</span>, while being a much more versatile model, specifically designed for conditional text generation. ### State-space models (SSMs) - SSM architectures are particularly appealing for processing long sequences due to their reduced complexity compared to transformers, and their stronger theoretical guarantees compared to RNNs. :::info 這句話強調了狀態空間模型(SSM)架構在處理長序列時的兩大優勢: 1. 相比transformer,SSM具有較低的計算複雜度 transformer模型中的self-attention機制需要計算每個token與其他所有token的attention score,計算複雜度為O(n$^2$)。而SSM則採用卷積運算,只需O(nlogn)的計算複雜度,大幅降低了複雜度。因此SSM在處理長序列時更有效率。 2. 相比RNN,SSM在理論上有更強的保證傳統的RNN存在梯度消失/爆炸的問題,難以很好地捕捉長程依賴關係。而SSM是一種更穩健的序列模型,在理論上被證明對於任意長的序列,都能很好地encode其狀態。也就是說,SSM對於長程依賴的建模能力更強,有更穩健的理論保證。綜合這兩點優勢,文中認為SSM架構對於處理長序列而言是一種很有吸引力的選擇。其更低的計算複雜度使其更加高效,而更強的理論保證則意味著對長程依賴的建模更可靠。這使得SSM成為解決長序列問題的一個很有潛力的架構。 ::: - In practical applications, SSMs have found success in both classification and unconditional autoregressive generation for language modeling. 1. Gu et al. (2022b) proposed a classification model that significantly improved the **Long-Range Arena benchmark**, which includes classification tasks involving images, synthetic sequences, and texts. 2. Other studies have applied SSMs to **video classification and text classification**. 3. Regarding language modeling, many researchers have leveraged the natural causal formulation of SSMs, employing a decoder-only architecture for tasks like **audio generation** and, more recently, autoregressive language modeling. :::info 這段話指出在語言模型領域中,許多研究者正利用狀態空間模型(SSM)的自然因果公式,採用僅使用解碼器的架構,應用於像是音頻生成以及最近的自回歸語言模型等任務。更詳細地解釋: 1. "leveraged the natural causal formulation of SSMs"：指的是利用了SSM模型在數學形式上天生具有因果關係的特性。SSM的狀態傳遞方程是一種遞迴形式,後一時刻的狀態只與過去相關,符合因果律。 2. "employing a decoder-only architecture" ：表示這些研究採用了僅使用解碼器、沒有編碼器的架構。也就是模型的輸入和輸出同為序列數據。 3. "for tasks like audio generation"：例如將SSM應用於音頻生成任務。 4. "more recently, autoregressive language modeling"：而最近則有研究將SSM應用於自回歸語言模型,也就是給定文本的前幾個字,模型可自動生成後續的文本。總體來說,這段描述了SSM模型在語言模型領域中的一些最新應用,研究者們正利用SSM的數學特性,將它們應用於僅需解碼器的序列生成任務,展現出SSM在這類任務上的潛力。 ::: ## 3. Background - For contextualization, we leverage state-space models instead of self-attention. :::info 這句話指出,在本文中,作者使用狀態空間模型(state-space models)來獲得上下文的訊息,而不是使用self-attention機制。更詳細地解釋: 在transformer等現有模型中,常採用self-attention機制來捕捉輸入序列中tokens之間的長程Dependencies關係,從而獲得上下文的token表示。但self-attention計算複雜度較高,約為O(n$^2$)。因此在處理長序列時,效率會大幅降低。而狀態空間模型則提供了一種替代self-attention的上下文建模方式。它基於遞迴狀態傳遞方程,可以有效地捕捉序列中tokens之間的依賴關係,並產生上下文的token表示。最重要的是,狀態空間模型的計算複雇度僅為O(nlogn),相比self-attention大幅降低,因此更適用於長序列的處理。整體而言,作者選擇使用計算高效的狀態空間模型,而不採用標準self-attention,目的在降低對長輸入序列建模的計算代價,使模型能更好地應對長文本輸入的挑戰。這是本文提出新模型架構的主要動機之一。 ::: :::info 在自然語言處理領域中,contextualization指的是利用語句或文本中其他詞語的上下文訊息,來增強對目標詞語的語義表示和理解。更具體來說,contextualization包含以下幾層意義: 1. **獲取上下文資訊：** 對於目標詞語,從其所在的語句、段落甚至整個文檔中收集相關的上下文訊息,例如前後出現的其他詞語、句子結構、主題等。 2. **融合上下文：** 將獲取的上下文資訊與目標詞語的原始表示(如word embedding)融合,產生一個新的contextualization後的詞語表示,增強了對該詞語語義的理解。 3. **ContextuaIized Word Representations：** ContextuaIized後的詞語表示被稱為contextualized word representations,相比靜態的詞向量,它們對應用場景和上下文更加敏感。 4. **傳遞上下文訊息：** contextualized表示不僅包含詞語本身的語義,還蘊含了來自上下文的訊息,可以很好地傳遞給下游的任務模型。 contextualization過程通常由神經網路模型自動完成,例如在BERT這樣的模型中,雙向transformer encoder就專門負責對輸入的詞序列進行contextualization。透過這一過程,模型能夠更好地理解詞語的確切意思,獲得更高的語義表示能力,從而在下游任務中取得更好的表現。總之,contextualization是指利用上下文訊息增強和豐富詞語的語義表示,是自然語言理解的一個重要環節。 ::: - Throughout the paper, <span class='red'>**L** denotes the sequence length, **H** the embedding dimension and **N** the dimension of the state-space hidden state</span>. ### State-space Models - For unidimensional inputs u = (u0, ..., uL−1) ∈ R$^L$, deep SSMs are based on the recurrent equation: ![Screenshot 2024-04-23 at 00.02.15](https://hackmd.io/_uploads/H1Fz9-E-C.png =45%x) :::warning - where x$_j$ is the SSM hidden state - y$_j$ the output of the SSM. - The state matrix A ∈ R$^{N×N}$ carries and transforms the hidden state through the iterations along with b ∈ R$^N$, c ∈ R$^N$,and d ∈ R which are learned parameters. ::: :::info 這段話描述了狀態空間模型(SSM)的基本數學形式: x$_j$ 是第j個時間點的隱藏狀態向量 y$_j$ 是第j個時間點的SSM輸出其中隱藏狀態x$_j$是通過以下遞迴方程進行傳遞和更新的: x$_{j+1}$ = Ax$_j$ + bu$_{j+1}$ y$_{j+1}$= c$^T$ x$_{j+1}$ + du$_{j+1}$ 在這個方程中: A ∈ R$^{N×N}$ 是一個狀態矩陣,它負責在每個時間點傳遞和轉換隱藏狀態x$_j$ b ∈ R$^N$ 是一個學習參數向量,用來控制輸入u$_{j+1}$對隱藏狀態的影響 c ∈ R$^N$ 是另一個學習參數向量,用於將隱藏狀態x$_{j+1}$映射到輸出y$_{j+1}$ d ∈ R 是一個學習參數,允許直接將輸入u$_{j+1}$與輸出y$_{j+1}$相關聯這些矩陣A和向量b、c、d都是在模型訓練過程中需要學習的參數,以捕獲輸入序列數據的內在patterns和依賴關係。總而言之,這個狀態空間模型透過矩陣A和向量b對隱藏狀態x$_j$進行遞迴更新,並透過c、d將隱藏狀態映射為最終輸出y$_j$,從而對輸入序列建模。這種建模方式避免了self-attention的高計算量,同時理論上能有效捕捉長程依賴。 ::: ### State-space convolution - By **unrolling the recurrence above**, the output sequence y ∈ R$^L$ can be expressed as: ![Screenshot 2024-04-23 at 00.16.59](https://hackmd.io/_uploads/Hkg9pWEW0.png =45%x) , ∀l ∈ {1, ..., L} - Let ∗ denote the causal convolution operator. Then, we can define a convolution kernel κ ∈ R$^L$ that depends on A,b,c. - A SSM layer is therefore parametrized by A, b, c, d through κ and its output is defined by y as in the following: ![Screenshot 2024-04-23 at 00.20.54](https://hackmd.io/_uploads/rkcuRbEbC.png =50%x) - For multidimensional u ∈ R$^{L×H}$, we simply compute H convolutions with one kernel κ$_h$ for each dimension. :::info 上述，描述了如何將狀態空間模型(SSM)與卷積運算互相結合: 1. "Let ∗ denote the causal convolution operator" 這裡引入了因果卷積(causal convolution)運算符 ∗ 2. "we can define a convolution kernel κ ∈ R$^L$ that depends on A, b, c" 基於狀態空間模型中的矩陣A、向量b、c,我們可以定義一個卷積核(kernel) κ,其長度為L。 3. "A SSM layer is therefore parametrized by A, b, c, d through κ" 因此,一個SSM層是透過卷積核κ,使用A、b、c、d等參數來參數化的。 4. "its output is defined by y = κ ∗ u + du" SSM層的輸出y是將卷積核κ與輸入u進行因果卷積運算,再加上一個與輸入u的線性項du。 5. "For multidimensional u ∈ R$^{LxH}$, we simply compute H convolutions with one kernel κh for each dimension" 對於多維度的輸入u(LxH),我們針對每個維度h,計算一個獨立的卷積核κh,總共需計算H次獨立的卷積。總而言之,這段描述了如何將原本的狀態空間模型和卷積運算相結合,先定義一個基於SSM的參數的卷積核κ,再將輸入u與這個κ做卷積得到輸出y。這種做法使SSM層能夠高效地對長序列輸入建模,且複雜度僅為O(nlogn)。對於多維輸入,只需為每個維度分別計算一個獨立卷積即可。 ::: ### SSMs efficiency - Due to the linear time-dependency between hidden states, as shown in Equation (1), we can compute the whole output y directly as a convolution, without iteration over the time dimension, as opposed to RNNs. - A naive implementation of (2) would incur a quadratic complexity in the input length L, matching the complexity of transformers and thus be prohibitive for long sequences. - However, thanks to the FFT, this computation can be performed in O(L log L). :::info 上述是在說明實現公式(2)的計算複雜度問題: 1. "A naive implementation of (2) would incur a quadratic complexity in the input length L" 如果直接單純地實現公式(2)所描述的卷積運算,其計算複雜度將達到輸入長度L的平方級別O(L^2)。 2. "matching the complexity of transformers" 這種平方級的計算複雜度與transformer模型中的self-attention機制相當。 3. "and thus be prohibitive for long sequences" 因此對於長序列輸入來說,這種複雜度是不能接受的,會帶來過大的計算負擔。 4. "However, thanks to the FFT, this computation can be performed in O(L log L)" 不過,由於快速傅立葉變換(FFT)的存在,我們可以將這一卷積運算的計算複雜度降低到O(L logL)的程度。總的來說,單純實現卷積運算會導致與transformer一樣的高計算複雜度,對長序列來說成本過高。但是借助FFT算法,作者可以將卷積運算的複雜度大幅降低到O(LlogL)的準線性級別,這使得模型能夠高效地處理長序列輸入。這一優化是模型得以擁有長序列建模能力的關鍵所在。 ::: ## 4. Model 1. Bidirectional deep state-space model 2. Then show how to use it to enable global contextualization of the tokens. 3. We present the architecture of the LOCOST layer with an efficient contextualization that can be used as a dropin replacement for the self-attention mechanism in transformers. ### 4.1 Capturing local and global contexts - In deep SSMs, information from previous tokens flows up to the current token through the hidden states x. - The convolution view provides another angle: each output yj is a weighted sum of the previous tokens u0, . . . ,uj , whose weights are given by κ. :::info 文中提到,在深層狀態空間模型(deep SSMs)中,來自前面token的資訊會透過隱藏狀態x傳遞到當前的token。文中給出另一種觀點,也就是卷積(convolution)的觀點:每個輸出yj都是前面的token(u0, ..., uj)的加權總和,其中的權重由κ決定。更詳細的解釋如下: 1. 在狀態空間模型中,每個時間點的隱藏狀態x會由前一時間點的隱藏狀態和當前輸入token決定,形成一種遞迴關係。 2. 透過展開這種遞迴關係,模型的輸出yj可以表示為之前所有輸入token的函數,也就是一個加權總和。 3. 這些權重κ實際上是由模型參數A、b、c等決定的,稱為卷積核(convolution kernel)。 4. 因此,從卷積的角度來看,輸出yj就是將之前所有token根據κ中的權重進行加權總和的結果。 5. 這種卷積形式使得狀態空間模型能夠有效捕捉輸入序列的長程依賴關係,同時保持了較低的計算複雜度。總結來說,文中描述了狀態空間模型中隱藏狀態如何傳遞資訊,以及如何從卷積的觀點理解模型輸出與歷史輸入的關係。 ::: ### Bidirectional contextualization - <span class='red'>To aggregate information from both directions, we consider bidirectional convolutions.</span> - A first kernel,←−κ performs the regular causal convolution←−κ ∗u. A second kernel −→κ is used to compute the cross-correlation with u. - The results of these two operations are summed out (similar to bi-recurrent encoder). :::info 文中提到使用了兩個卷積核,分別是 ←−κ 和 −→κ : 1. ←−κ 是一個執行標準因果卷積(causal convolution)的kernel,即 ←−κ ∗u。因果卷積確保每個輸出yj只依賴於目前及之前的輸入(u0, ..., uj)。 2. −→κ 則是用來計算與輸入序列u的互相關(cross-correlation)。互相關類似於卷積,但允許輸出依賴於未來的輸入。簡單來說,←−κ 捕捉了從過去到現在的訊息流動, −→κ 捕捉了從未來到現在的訊息流動。將兩個kernel的結果相加,可以獲得雙向上下文(bidirectional context),即每個輸出yj同時融合了過去和未來的訊息: yj = ∑$_{l≤j}$ ←−κ [j-l]⊙u$_l$ + ∑$_{l≥j}$ −→κ [l-j]⊙u$_l$ 其中⊙表示element-wise乘積運算。透過使用這種雙向卷積核,模型可以充分利用整個輸入序列的上下文訊息,從而更好地建構長範圍依賴的關係。這種雙向卷積結構在自然語言處理任務中經常被使用,以獲得更佳的語義表示。 ::: - The overall operation is described by the following equation: ![image](https://hackmd.io/_uploads/rysdoTE-A.png =60%x) :::warning In this equation: - U ∈ R$^{L×H}$ is the embedding matrix of the input text: (u0, . . . ,uL−1). - The kernels −→κ,←−κ are computed as in Equation (2), with their respective parameters (−→A,−→c ,−→b ) and (←−A,←−c ,←−b ). - The element-wise product is denoted by ⊙ and we consider multidimensional inputs, with one kernel per dimension. ::: :::info 上述解釋了文章中使用 ⊙ 符號表示element-wise product,並描述了如何處理多維度的輸入數據。具體來說: 1. ⊙ 符號表示element-wise product運算。這種運算是指對兩個向量或矩陣進行逐元素相乘,產生一個與輸入大小相同的新向量或矩陣。 - 例如: [1, 2] ⊙ [3, 4] = [1*3, 2*4] = [3, 8] 2. 文中考慮的是多維度的輸入數據,例如自然語言任務中常見的word embedding向量就是多維度的。 3. 對於這種多維度輸入,文章中採用了為每一個維度分配一個獨立的卷積核(kernel)的做法。 4. 也就是說,如果輸入u是一個形狀為 (L*H) 的矩陣,對應H個embedding維度,那麼模型就需要H個獨立的卷積核 κh (1<=h<=H)。 5. 這H個卷積核分別作用在對應的維度上,對該維度的元素序列進行卷積運算。最終的輸出則是將這H個卷積結果在維度上疊加得到。透過這種設計,模型可以靈活地為不同維度分配不同的卷積核參數,更好地建模不同維度上的訊號特徵,達到更準確的表示。這種per-dimension kernel的設計在處理高維度數據時是常見的做法。 ::: - The output yj is now contextualized as a weighted sum of previous u$_{≤j}$ and subsequent u$_{≥j}$ inputs. - For scalar inputs, more insights on how far in the future or in the past a scalar input ul contributes to the scalar output yj are given by the →− ←− spectral radii ρ(A) and ρ(A). :::info 這段話討論了對於標量輸入(scalar input)來說,輸入u$_l$對輸出y$_j$的影響範圍,會受到狀態矩陣 −→A 和 ←−A 的譜半徑(spectral radii) ρ(−→A)和ρ(←−A)的限制。更詳細地解釋如下: 1. 對於scalar input u$_l$ , 輸出y$_j$可以表示為對u$_l$及其前後元素的加權總和。 2. 這些權重由兩個卷積核←−κ和−→κ 決定,它們分別對應狀態矩陣←−A和−→A。 3. **Spectral radii** ρ(A)是指一個矩陣A的最大特徵值的絕對值,反映了矩陣的數值穩定性和傳遞能力。 4. 對於←−κ,如果l < j,則u$_l$對y$_j$的貢獻由ρ(←−A)j-l限制。ρ(←−A)越大,u$_l$對y$_j$的影響可以傳遞得更遠。 5. 對於−→κ,如果l > j,則u$_l$對y$_j$的貢獻由ρ(−→A)l-j限制。ρ(−→A)越大,未來輸入u$_l$對現在y$_j$的影響可以傳遞得更遠。 6. 因此,透過調整←−A和−→A的譜半徑ρ,可以控制模型對過去和未來訊息的捕捉範圍。 7. 對於多維輸入,每個維度上都有獨立的狀態矩陣A,因此可以分別控制不同維度上的長程依賴建模能力。總之,spectral radii ρ(A)可以給出scalar input對輸出的影響程度和傳播距離的啟發,是狀態空間模型能否有效捕捉長程依賴的關鍵參數之一。 ::: - Indeed the sensitivity of an output yj with respect to an input ul is bounded by the following quantity: ![Screenshot 2024-04-23 at 14.50.07](https://hackmd.io/_uploads/S1uBcCNWA.png =55%x) - For multidimensional inputs, using a state-space kernel for each dimension enables a fine-grained adjustment of the spectral radii independently for each of them. :::info 上述討論了對於多維度輸入的情況,採用每一維度一個獨立的狀態空間卷積核,可以細緻地為每個維度，單獨調整其對應的spectral radii。具體來說: 1. 在處理多維度輸入(如word-embedding vector)時,文章採取了為每個維度分配一個獨立狀態空間卷積核κh的做法。 2. 每個卷積核κh都對應一個獨立的狀態矩陣Ah。 3. 一個狀態矩陣Ah的spectral radii ρ(Ah)決定了該維度上輸入對輸出的影響範圍(影響有多遠)。 4. 由於每個維度h都有自己獨立的狀態矩陣Ah,因此我們可以分別、獨立地調整每個Ah的spectral radii ρ(Ah)。 5. 這樣一來,對於不同的維度h,我們就可以進行更細部(fine-grained)的控制,使某些維度捕捉短程依賴(較小ρ(Ah))、另一些維度捕捉長程依賴(較大ρ(Ah))。 6. 這種靈活的設計允許模型充分利用多維度輸入數據中蘊含的不同尺度的依賴訊息。 7. 相比於為所有維度共用一個狀態矩陣A的做法,獨立的per-dimension kernel更具表達能力,能夠更好地建模複雜的多維序列數據。總之,這種獨立的per-dimension狀態空間核心的設計,賦予了模型在不同維度上獨立調節長程依賴建模能力的靈活性,充分發揮了多維度輸入數據的優勢。 ::: - A small value corresponds to modeling local contexts, while a large value captures global ones. ![Screenshot 2024-04-23 at 15.00.06](https://hackmd.io/_uploads/ry5KnC4bA.png =70%x) :::info - Visualization of the kernels corresponding to the first dimension for several layers of the pre-trained model. Bins show the average decay of the forward and backward kernels. - This illustrates the different scales of each kernel. Layers 1 and 10 capture short and extra-short range contextualizations, while Layers 4 and 7 model extra-long and long contexts, respectively. ::: ### 4.2 Architecture ### Encoder - Our encoder consists of a stack of LOCOST layers. ![Screenshot 2024-04-23 at 15.12.28](https://hackmd.io/_uploads/BJr9yyS-R.png =40%x) ![Screenshot 2024-04-23 at 15.12.50](https://hackmd.io/_uploads/Skh2yJrW0.png =35%x) - It is computed as follows: 1. Embedding matrix U ∈ R$^{L×H}$ is first projected onto Q,V ∈ C$^{L×H}$. 2. V is contextualized through a BiSSM. 3. A pointwise multiplication Q ⊙ BiSSM(V) acts as a first gate before passing the output through a feedforward layer. 4. This feedforward layer employs a second gating mechanism (see Figure b). For this component, we use gated GeLU that has shown to be efficient by Shazeer (2020). - The architecture of the LOCOST layer (Figure a) resembles that of a transformer layer except that the self-attention mechanism is replaced by a gated bidirectional state-space model. :::info 這段描述了LOCOST層(Layer)的架構,並將其與Transformer層進行了對比。具體來說: 1. LOCOST層的架構與Transformer層的架構有些相似之處。 2. 不同之處在於,LOCOST層中用一個閘控雙向狀態空間模型(gated bidirectional state-space model)取代了Transformer Layer中的自注意力機制(self-attention mechanism)。 3. 在LOCOST層中(如Figure a所示),輸入嵌入首先經過一個投影(Projection),分別得到Q和V。 4. 然後V經過一個BiSSM(雙向狀態空間模型),即一個結合了前向和反向狀態空間卷積核的雙向卷積操作,進行上下文建模。 5. 上下文建模的結果Q⊙BiSSM(V)通過一個閘控機制(pointwise multiplication作為第一個gate)。 6. 之後再經過一個帶有第二個閘控機制(gated feedforward layer)的前饋網絡。 7. 這個gated feedforward使用了gated GeLU激活函數,gated GeLU激活函數有不錯的性能表現。 8. 總之,LOCOST層的架構除了用閘控雙向狀態空間模型替代了自注意力機制之外,其餘部分如Projection、Feedforward Network、Normalizarion等與Transformer層非常相似。透過這種替換,LOCOST層避免了自注意力的高計算複雜度,同時利用了狀態空間模型良好的長程依賴建模能力,在保持Transformer層結構優勢的同時降低了計算開銷。 ::: - We follow Gu et al. (2022a) for the parametrization and initialization of the state-space models. ### Decoder - <span class='red'>Since our focus is on long input summarization, the generation output length is very short compared to the input.</span> - For decoding, we follow the practice of other efficient architectures and use a vanilla transformer decoder equipped with dense self- and cross-attention. ### Complexity - The LOCOST layer takes **O(H2L+ HNL + HLlogL) time** and **O(HNL) space** to compute. ## 5. Experiments - We focus on the <span class='red'>long document abstractive summarization task as it represents a typical conditional generation problem with long input requirements</span>. ### 5.1 Experimental setup ### Approach - We evaluate LOCOST following a classical **pre-training then fine-tuning** approach. - For fine-tuning, we used the official train, validation and test splits of each dataset. - We <span class='red'>train all models until convergence and select the best model based on the validation Mean ROUGE (mean of ROUGE-1/2/LSum) for test evaluation</span>. ### Metrics - We evaluate LOCOST both with **reference-based** and **reference-free** metrics: - **Reference-Based:** 1. For reference-based summarization evaluation, we use the traditional n-gram overlap summarization metrics **ROUGE-1/2/Lsum**. We average them into a single score to compare with other baselines. 2. We also report **BERTScore** (BS), a model-based metric. - **reference-free:** 1. we report the **BLANC** (BL) score, a metric that has been shown to <span class='red'>correlate well with human evaluations</span>. 2. We also assess the **throughput** (samples per second) and the **memory usage** (MiB of GPU RAM) of LOCOST compared with other state-of-the-art sparse transformers. ### Inference - In all of our experiments, we intentionally favored simplicity and opted for greedy decoding. :::info 在所有的實驗中,我們有意提高簡單性,並選擇使用greedy decoding。Greedy decoding是一種在自然語言生成任務中常用的簡單解碼方式。與其他更複雜的解碼策略(例如束搜索或採樣等)相比,greedy decoding在每個解碼步驟中僅考慮當前機率分佈中機率最高的token,而不考慮其他選擇。儘管如此,這種簡單的方法有其固有的缺點,例如無法充分探索解空間,但其計算效率較高,且在某些情況下能獲得不錯的結果。作者有意選擇貪婪解碼而非更複雜的解碼策略,以favored simplicity(提高簡單性),或許是**為了提高模型的效率**,而非獲得可能的最佳結果。 ::: ### 5.2 Pre-training ### Pre-training objective - To pre-train the model, we leverage the **gap-sentences generation** (GSG) unsupervised pre-training objective, which was introduced by PEGASUS and is well-suited for sequence-to-sequence generation. - Unlike BART or T5 pre-training objectives, GSG endows the model with zero-shot summarization capabilities. - GSG was successfully applied by subsequent generation models such as LongT5 and PEGASUS-X. - <span class='red'>Namely, a document D is split into its M sentences: D = {s1,...,sM}. Given a ratio α, GSG then identifies K = ⌊αM⌋ sentences from D that maximize the ROUGE-1 with the rest of the document</span>: ![Screenshot 2024-04-23 at 16.04.10](https://hackmd.io/_uploads/r1ZoiyBZC.png =45%x) :::success - The resulting subset U ⊆ {1, . . . , M} splits the document into a pseudo summary Yˆ = {s$_i$}$_i$ ∈ U and a pseudo-source Dˆ = {s$_i$}$_i$ ∈/ U , which are used for pre-training with the standard cross-entropy loss. ::: ### Pre-training data - We pre-train the model **exclusively on the C4 dataset**, in **BF16** for **1M steps**, using an **input sequence length of 4,096** and an **output sequence length of 910**. ### Pre-training optimization - The learning rate scheduler we use is identical to T5. - Employing an **inverse square root function**, with the **warm-up steps set to 10,000**. - We set the **GSG-ratio α = 0.2** and do not employ dropout during this phase. - <span class='red'>We follow closely the same pre-training as LongT5.</span> ### 5.3 Fine-tuning ### Fine-tuning datasets *We evaluate LOCOST on a series of long-input abstractive summarization tasks.* - **arXiv** - **PubMed** - **GovReport** - **SummScreenFD** - **BookSum (-Chapter & -Book)** ### Fine-tuning optimization - We fine-tune in BF16 using a constant learning rate of 5 × 10−4 and a dropout rate of 0.1 for all datasets. - We experiment with lengths ranging from 4,096 to 32,768 for the input and 512 for the output, except for GovReport and BookSum-Book where we use 1024. ### Baselines - We consider both competitive <span class='red'>sparse transformers</span>, including **LED**, **BigBird**, **LongT5** and **LSG**. - as well as <span class='red'>dense encoder-decoders</span> like **BART**, **T5** and **PEGASUS**. - For a fair comparison, <span class='red'>we only compare to sparse transformers architectures of equivalent size (roughly 250M parameters)</span>. ### 5.4 Results ### Long-input summarization ![Screenshot 2024-04-23 at 16.17.03](https://hackmd.io/_uploads/Hkpi01Bb0.png) :::info - Results on arXiv, PubMed and BookSum-Chapter with a input length of 4K, 4K and 8K tokens respectively. - % denotes the relative performance on the Mean ROUGE score w.r.t. LongT5, the best performing sparse-transformer at the given size, which is indicated as 100%. - BS stands for BERTScore and BL for BLANC. ::: ![Screenshot 2024-04-23 at 16.17.19](https://hackmd.io/_uploads/BJHxkxS-A.png) :::info - Results on the test set of SCROLLS for GovReport and SummScreenFD. - L denotes the considered input length. - % denotes the relative performance on the Mean ROUGE score w.r.t. the reference LongT5. - We reported baselines’ results from the official SCROLLS test leaderboard. - GovReport and SummScreen exhibit challenging long contexts sizes even for sparse transformers, as reported by the memory usage during training (MEMtrain) and inference (MEMinf) of the different architectures on 16K inputs. - ✗ means out-of-memory. ::: - Across all datasets, LOCOST reaches up to 96% of state-of-the-art Mean ROUGE while being up to 3 times more memory-efficient than the best model LongT5 during both training and inference for 16K long inputs, e.g. on GovReport or SummScreenFD. - The model is also twice as efficient as the local-attention transformer LED and up to 17 times more efficient than dense transformer BART at inference time. - LOCOST significantly improves Mean ROUGE over LED and BigBird on all datasets while performing competitively with respect to LSG. - On all datasets, the results for **LongT5** and **LED** have been obtained by fine-tuning from pre-trained checkpoints, following recommended configurations in (Guo et al., 2022) and (Beltagy et al., 2020) respectively. - The results for **BigBird** has been reported from the original paper. - **LSG** results are obtained from evaluating the publicly fine-tuned checkpoints on arXiv and PubMed and from our fine-tuning on BookSum-Chapter. - GovReport and Summ- ScreenFD results are reported from the SCROLLS test leaderboard. ### Throughput and Memory usage - We measure the memory consumption of T5, LED, LongT5 and LOCOST on input lengths ranging from 1K to 500K tokens, at training and inference time. ![Screenshot 2024-04-23 at 16.31.49](https://hackmd.io/_uploads/Hko-fxrZC.png) :::info Memory consumption during a typical training (forward + backward) (left) and inference iteration (only forward) (right). Batch size = 1. Ending cross means out-of-memory or architectural limitations after this point. ::: - Compared to LongT5, the best-performing baseline, LOCOST is able to process up to 2×longer sequences during training and 16×longer at inference time. - This correlates also with a higher throughput during both training and inference. ### Qualitative evaluation: GPT-3.5 preference - Since our input texts are very long, performing a full human-based evaluation would be very costly and time consuming. - Instead, <span class='red'>we perform a mock human evaluation using GPT-3.5.</span> - This practice has been used and has shown success in summary evaluation. - We ask the model to rate the generated summary on four dimensions: relevance, consistency, fluency, and coherence. - We perform evaluation on 500 samples randomly taken from PubMed. ![Screenshot 2024-04-23 at 23.16.39](https://hackmd.io/_uploads/rylx-USZR.png =65%x) :::success LOCOST produces summaries at a competi- tive level with respect to LongT5 (93-97%). ::: ### 5.5 Extrapolating to longer sequences - Because the lengths of the inputs considered during training are often limited due to complexity issues, a desirable property for a model would be to extrapolate at inference time to sequences much longer than the ones used during training. :::info 文章提到,由於訓練時考慮的輸入長度常常受到計算複雜度的限制,因此對一個模型來說,一個理想的特性就是在推理時能夠外推(extrapolate)到比訓練時使用的序列長度更長的序列。具體來說,文中指出在訓練時,由於複雜度問題,模型處理的輸入長度往往受到限制。但在推理的時候,如果模型能夠推廣到比訓練時使用的序列更長的序列,這樣的外推能力就是一個很可取的模型特性。這種外推能力可以讓模型在推理時處理比訓練時更長的輸入序列,不受當初訓練時輸入長度的限制。總的來說,文章認為對於一個序列生成模型來說,在訓練階段常常會受到輸入長度的限制,但能夠在推理時外推到比訓練時更長序列的能力,是一個非常理想和可取的模型特性。 ::: - We train LOCOST on a maximum input length of 4,096 and evaluate it on the test set of arXiv with a maximum input length of 8,192 tokens. ![Screenshot 2024-04-24 at 00.00.19](https://hackmd.io/_uploads/B1rVsLSWC.png =65%x) :::info - Extrapolating to longer sequences experiments. - L is the training sequence size. - Gain represents the relative Mean ROUGE (Mean-R) improvement from evaluating on 4K to 8K maximum input length. - The ROUGE increase asserts that both models are able to generalize to input lengths unseen during training. ::: :::success 1. LOCOST is indeed able to extrapolate to longer sequences than those employed in training. 2. **LongT5** leverages relative positional encod- ings, enabling extrapolation capability. However, as previously mentioned, this comes at the expense of an increased complexity compared to LOCOST. ::: ### 5.6 Extra-long sequences: towards full-book summarization ### Effect of increasing contexts during training - LOCOST exhibits a strong capability to generalize well on sequences longer than the ones seen during training. - Due to the reduced memory usage at both train and inference time, we conduct in this section an analysis of its performances when facing extremely long texts e.g. summarizing entire books. - We consider the book-level setting of BookSum. - We train multiple instances of LOCOST for 100 epochs on truncated books with a context length ranging from 1K to 32K and select the best model on Mean ROUGE on the validation set. - We evaluate these models on the test set with untruncated books. ![Screenshot 2024-04-24 at 00.11.09](https://hackmd.io/_uploads/Hy_6aLBZC.png =60%x) :::info LOCOST trained on increasing sequence lengths evaluated on BookSum-Book dataset without truncation, with texts reaching up to 600K tokens. ::: :::success 1. We found that <span class='red'>increasing the input length during training leads to an overall increase in the test Mean ROUGE scores as more contexts are being considered for optimization</span>. 2. Once more, this confirms the generalization capability of LOCOST on extra-long sequence lengths. ::: ### Results on full-book summarization - Based on the observations above, we put our best model LOCOST-32K to the test and compare it with LongT5 and current state-of-the-art models on BookSum-Book. - For **LongT5**, we fine-tune the available checkpoint on the maximum possible input length during training (16K) and report its performance on the longest possible input length at inference time (32K). - For the other models, the results come from the original papers, in which the models initially produce individual summaries for each paragraph of the book and then rank them according to the model’s level of confidence. ![Screenshot 2024-04-24 at 00.17.43](https://hackmd.io/_uploads/BJKN1DHZC.png =65%x) :::info Results on BookSum-Book. While being the smallest model, LOCOST achieves state-of-the-art on Mean ROUGE when summarizing entire books. ::: :::success - Despite being the model with the least number of parameters, LOCOST achieves state-of-the-art Mean ROUGE compared to LongT5 and even large variants of BART, T5 and PEGASUS. - LOCOST is also the only model capable of processing the full documents without truncation and handle sequence lengths of up to 600K tokens. - This reveals that <span class='red'>effectively processing full contexts without truncation can lead to strong performance enhancement</span>. ::: ## 6. Conclusion - Our contributions are threefold: 1. We propose a new encoder-decoder architecture based on state-space models. By bypassing the self-attention mechanism used in transformers, the model enjoys a complexity of O(L log L) instead of O(L2) as in traditional transformers. 2. Compared with the best-performing sparse transformers of the same size, the model achieves 93-96% of the best performance on various long document abstractive summarization while being up to 50% more memory-efficient during training and up to 87% at inference time. 3. The model is able to process entire input sequences of up to 600K tokens, a length far out of reach for sparse transformers. This allows the model to achieve a new state-of-the-art on a challenging full-book summarization task. ## 7. Limitations - Though we investigated lightweight models for computational reasons, scaling the architecture to a larger size could be studied. - We focused on long document abstractive summarization, we leave for future work the study of SSMs on other long inputs abstractive tasks. - Although replacing self-attention with state-space encoders drastically reduces the computational complexity, the use of dense cross-attention in the decoder still limits the output sequence length in terms of computation during training.