Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

tags:`論文翻譯` `deeplearning`

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

說明

排版的順序為先原文，再繁體中文，並且圖片與表格都會出現在第一次出現的段落下面

原文

繁體中文

照片或表格

個人註解，任何的翻譯不通暢部份都請留言指導
為了加速閱讀，直接採用自建的反思翻譯(Phi4-14B模型所翻譯)的結果呈現，然後快速看過，語意對了頭過身就過，多多包函
這篇論文測試利用docling將論文從pdf取出轉成markdown格式，再利用正則式取片段至dify建置的反思翻譯api取得譯文再做優化調整
解釋的部份也是直接用Phi4-14B模型來針對片段做理解說明
機器說明的部份我會把一些不必要的冗文刪除，這可能可以從提示詞中再來做回應的優化

paper hyperlink

Abstract

原文

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

翻譯後的結果

長上下文建模對於下一代語言模型至關重要，然而標準注意力機制的高計算成本給效率帶來巨大挑戰。稀疏注意力提供了一條有希望的方向，可以在保持模型能力的同時提高效率。我們提出 NSA（原生訓練稀疏注意力）機制，它將算法創新與硬體優化相結合，實現高效的長上下文建模。NSA 採用動態分層稀疏策略，將粗粒度標記壓縮與精細粒度標記選擇相結合，以保持全局上下文感知力和局部精度。我們的方案通過兩個關鍵創新推動稀疏注意力的設計：（1）我們通過算術強度平衡算法設計實現顯著加速，並對現代硬體進行實施優化。（2）我們允許端到端訓練，減少預訓練計算而不會犧牲模型性能。如圖 1 所示，實驗表明使用 NSA 預訓練的模型在通用的基准測試、長上下文任務和指令型推理方面表現至少與全注意力模型相等或超過它。同時，NSA 在 64k 長序列的解碼、前向傳播和反向傳播方面都比全注意力模型實現了顯著加速，驗證了其在整個模型生命週期中的效率。

原文

Modern sparse attention methods have made significant strides in reducing the theoretical computational complexity of transformer models. However, most approaches predominantly apply sparsity during inference while retaining a pretrained Full Attention backbone, potentially introducing architectural bias that limits their ability to fully exploit sparse attention's advantages. Before introducing our native sparse architecture, we systematically analyze these limitations through two critical lenses.

翻譯後的結果

現代稀疏注意力機制在降低變換器模型的理論計算複雜度方面取得了重大進展。但是，大多數方法主要是在推理階段應用稀疏性，同時保留預訓練的全注意力主干結構，這可能會引入架構偏差，限制其充分利用稀疏注意力的優勢。我們通過兩個核心角度系統地分析這些局限性。

2.1. The Illusion of Efficient Inference

2.2. The Myth of Trainable Sparsity

3.2. Overall Framework

3.3. Algorithm Design

原文

In this subsection, we introduce the design of our remapping strategies 𝑓 𝐾 and 𝑓 𝑉 : token compression, token selection, and sliding window.

翻譯後的結果

在本小節中，我們介紹我們的重映射策略 𝑓𝐾 和 𝑓𝑉 的設計：符號壓縮、符號選擇和滑動視窗方法。

3.3.1. Token Compression

原文

By aggregating sequential blocks of keys or values into block-level representations, we obtain compressed keys and values that capture the information of the entire block. Formally, the compressed key representation is defined as:

翻譯後的結果

透過將順序塊中的鍵或值彙總為塊級表示，我們可以得到壓縮的鍵和值，這些壓縮後的鍵和值包含整體資訊。正式地，壓縮后的鍵表示定義如下：

原文

where 𝑙 is the block length, 𝑑 is the sliding stride between adjacent blocks, and 𝜑 is a learnable MLP with intra-block position encoding to map keys in a block to a single compressed key. ˜ 𝐾 𝑡 cmp ∈ R 𝑑 𝑘 ×⌊ 𝑡 -𝑙 𝑑 ⌋ is tensor composed by compresion keys. Usually, we adopt 𝑑 < 𝑙 to mitigate information fragmentation. An analogous formulation holds for the compressed value representation ˜ 𝑉 𝑡 cmp . Compressed representations capture coarser-grained higher-level semantic information and reduce computational burden of attention.

翻譯後的結果

其中，𝑙 表示區塊長度、𝑑 表示相鄰區塊之間滑動步幅，而𝜑 是個具有區塊內位置編碼的可學習多層感知器 (MLP)，用於將每個區塊中的鍵映射至一個壓縮鍵。˜𝐾𝑡 cmp ∈ R 𝑑𝑘 ×⌊(𝑡-𝑙)/𝑑⌋ 為由壓縮鍵組成的張量。通常，我們設定𝑑 < 𝑙 來減輕資訊碎片化問題。類似的表述也適用於壓縮值表示˜𝑉𝑡 cmp 。這些壓縮表示捕捉更粗粒度的、更高層次語義信息，並降低注意力計算負擔。

3.3.3. Sliding Window

原文

In attention mechanisms, local patterns typically adapt faster and can dominate the learning process, potentially preventing the model from effectively learning from compression and selection tokens. To address this issue, we introduce a dedicated sliding window branch that explicitly handles local context, allowing other branches (compression and selection) to focus on learning their respective features without being shortcutted by local patterns. Specifically, we maintain recent tokens ˜ win 𝐾 𝑡 = k 𝑡 -𝑤 𝑡 : , ˜ win 𝑉 𝑡 = v 𝑡 -𝑤 𝑡 : in a window 𝑤 , and isolate attention computations of different information sources (compression tokens, and selected tokens, sliding window) into separate branches. These branch outputs are then aggregated through a learned gating mechanism. To further prevent shortcut learning across attention branches with marginal computational overhead, we provide independent keys and values for three branches. This architectural design enables stable learning by preventing gradient interference between local and long-range pattern recognition, while introducing minimal overhead.

翻譯後的結果

在注意力机制中，局部模式通常适应得更快且可能主导学习过程，从而阻碍模型有效地从压缩和选择标记中学习。为了解决这个问题，我们引入了专门的滑动窗口分支，明确处理局部上下文，允许其他分支（压缩和选择）专注于学习各自特征，不受局部模式影响。具体来说，我们在一个窗口 w 中保持最近的标记 ˜win_k_t = k_t - w_t：, ˜win_v_t = v_t - w_t：，并将不同信息源（压缩标记和选择标记、滑动窗口）的注意力计算隔离开到不同的分支中。这些分支输出然后通过一个学习门控机制聚合起来。为了进一步防止跨越注意力分支的捷径学习，同时保持极小的运算开销，我们为三个分支提供了独立的关键值。这种架构设计通过防止局部模式和长距离模式识别之间的梯度干扰，实现稳定学习，同时引入最小的额外开销。

原文

After obtaining all three categories of keys and values ( ˜ 𝐾 𝑡 cmp , ˜ 𝑉 𝑡 cmp ; ˜ slc 𝐾 𝑡 , ˜ slc 𝑉 𝑡 ; and ˜ win 𝐾 𝑡 , ˜ win 𝑉 𝑡 ), we compute the final attention output following Equation (5). Together with the compression, selection, and sliding window mechanisms described above, this forms the complete algorithmic framework of NSA.

翻譯後的結果

在取得所有三種類別的鍵值 (˜Kt_cmp, ˜Vt_cmp; ˜slcKt, ˜slcVt；以及˜winKt, ˜winVt) 後，我們根據公式（5）計算最終的注意力輸出。結合上述壓縮、選擇和滑動窗口機制，構成了NSA 的完整算法框架。

3.4. Kernel Design

原文

To achieve FlashAttention-level speedup during the training and prefilling, we implement hardware-aligned sparse attention kernels upon Triton. Given MHA is memory-intensive and inefficient for decoding, we focus on architectures with shared KV caches like GQA and MQA following the current state-of-the-art LLMs. While compression and sliding window attention computations are readily compatible with existing FlashAttention-2 kernels, we introduce the specialized kernel design for sparse selection attention. If we were to follow FlashAttention's strategy of loading temporally continuous query blocks into SRAM, it would result in inefficient memory access since queries within a block may require disjoint KV blocks. To address this, our key optimization lies in a different query grouping strategy: for each position on the query sequence, we load all query heads within a GQA group (they share the same sparse KV blocks) into SRAM. Figure 3 illustrates our forward pass implementation. The proposed kernel architecture is characterized by the following key features:

翻譯後的結果

為在訓練和預填過程中達成 FlashAttention 水準的加速效果，我們在 Triton 上實作硬件對齊稀疏注意力核。由於多頭注意力（MHA）對於解碼來說記憶體密集且效率低下，我們關注於具有共享 KV 快取架構的模型，例如 GQA 和 MQA，這與目前最先進的大語言模型（LLM）趨勢一致。雖然壓縮和滑動窗口注意力計算可以與現有的 FlashAttention-2 核輕鬆兼容，但我們針對稀疏選取注意力引入了專門設計的核架構。

若依循 FlashAttention 將時間連續查詢塊載入 SRAM 的策略，由於同一塊內的查詢可能需要不相交的 KV 块，將導致記憶體存取效率低下。為此，我們的關鍵優化在於不同的查詢組策略：對於查詢序列中的每個位置，我們將 GQA 組（它們共享相同的稀疏 KV 塊）中所有查詢頭載入 SRAM 中。圖 3 說明了我們的正向傳播實作方式。所提出的核架構具有以下主要特徵：

原文

1. Group-Centric Data Loading . For each inner loop, load all heads' queries 𝑄 ∈ R [ ℎ 𝑑 𝑘 , ] in the group at position 𝑡 and their shared sparse key/value block indices I 𝑡 .
1. Shared KV Fetching . In the inner loop, Sequentially load continuous key/value blocks indexed by I 𝑡 into SRAM as 𝐾 ∈ R [ 𝐵𝑘 𝑑 𝑘 , ] , 𝑉 ∈ R [ 𝐵𝑘 𝑑 𝑣 , ] to minimize memory loading, where 𝐵𝑘 is the kernel block size satisfying 𝐵𝑘 𝑙 | ' .
1. Outer Loop on Grid . Since the inner-loop length (proportional to the selected block count 𝑛 ) remains nearly identical for different query blocks, we put query/output loops in Triton's grid scheduler to simplify and optimize the kernel.

翻譯後的結果

分組中心化數據載入。每個內部循環，載入 t 位置群組的所有頭部的查詢 𝑄 ∈ R [ℎ𝑑𝑘，] 及其共享稀疏鍵/值塊指標 I𝑡。
共享 KV 獲取。內部循環中，依序將由 I𝑡 指示的連續鍵/值塊載入 SRAM 作為 𝐾 ∈ R [𝐵𝑘𝑑𝑘，] 、𝑉 ∈ R [𝐵𝑘𝑑𝑣，]，以最小化記憶體載入，其中 𝐵𝑘 是滿足 𝐵𝑘𝑙 | 的內核塊大小。
網格外循環。由於內部循環長度（與選擇的塊數 n 成比例）對於不同的查詢塊幾乎相同，我們將查詢/輸出循環加入 Triton 的網格調度器中，以簡化和優化核心運算。

原文

This design achieves near-optimal arithmetic intensity by (1) eliminating redundant KV transfers through group-wise sharing, and (2) balancing compute workloads across GPU streaming multiprocessors.

翻譯後的結果

Model	MMLU Acc. 5-shot	MMLU-PRO Acc. 5-shot	CMMLU Acc. 5-shot	BBH Acc. 3-shot	GSM8K Acc. 8-shot	MATH Acc. 4-shot	DROP F1 1-shot	MBPP Pass@1 3-shot	HumanEval Pass@1 0-shot	Avg.
Full Attn	0.567	0.279	0.576	0.497	0.486	0.263	0.503	0.482	0.335	0.443
NSA	0.565	0.286	0.587	0.521	0.52	0.264	0.545	0.466	0.348	0.456

原文

The proposed architecture achieves an effective trade-off between computation cost and model performance. For NSA, we set compression block size 𝑙 = 32, sliding stride 𝑑 = 16, selected block size 𝑙 ' = 64, selected block count 𝑛 = 16 (including fixed activating the 1 initial block and 2 local blocks), and sliding window size 𝑤 = 512. Both Full Attention and sparse attention models are pretrained on 270B tokens of 8k-length texts, followed by continued training and supervised fine-tuning on 32k-length texts with YaRN (Peng et al., 2024) to achieve long-context adaptation. Both models are trained to full convergence to ensure fair comparison. As shown in Figure 4, the pretraining loss curve of our NSA and Full Attention baseline demonstrates stable and smooth decline, with NSA consistently outperforming the Full Attention model.

翻譯後的結果

所提出的架構在運算成本與模型效能之間取得有效平衡。對於 NSA，我們設定壓縮區塊大小 𝑙 = 32、滑動步幅 𝑑 = 16、選擇區塊大小 𝑙 ' = 64、選擇區塊數量 𝑛 = 16（包括固定激活第一個區塊和兩個局部區塊），以及滑窗大小 𝑤 = 512。全注意力模型和稀疏注意力模型均在包含 270B 個 token 的 8k 長文本上進行預訓練，隨後使用 YaRN (Peng 等人，2024) 對長度為 32k 的文本進行持續訓練和監督微調，以實現長上下文自適應。兩個模型都經過充分訓練，確保公平比較。如圖 4 所示，我們的 NSA 和全注意力基線模型的預訓練損失曲線表現出穩定且平滑的下降趨勢，其中 NSA 在整個訓練過程 consistently 優於全注意力模型。

4.2. Baselines Methods

原文

In addition to comparing with Full Attention, we evaluate several state-of-the-art inference-stage sparse attention methods: H2O (Zhang et al., 2023b), infLLM (Xiao et al., 2024), Quest (Tang et al., 2024), and Exact-Top, which first computes full attention score and select the top𝑛 scores keys corresponding to each query and then calculates attention on these positions. These methods span diverse sparse attention paradigms, including KV-cache eviction, query-aware selection, and exact top𝑛 sparse selection.

翻譯後的結果

Generation Token Limit	8192	16384
Full Attention-R	0.046	0.092
NSA-R	0.121	0.146

原文

Figure 6 | Comparison of Triton-based NSA kernel with Triton-based FlashAttention-2 kernel. Our implementation significantly reduces latency across all context lengths, with the improvement becoming more pronounced as input length increases.

翻譯後的結果

注意力解碼速度主要受限於記憶存取瓶頸，與 KV 緩存載入量密切相關。在每個解碼步驟中，我們的 NSA 只需載入最多 GLYPH<4>𝑠-𝑙𝑑 GLYPH<5>壓縮記號、𝑛𝑙 個選定記號以及𝑤 個鄰近記號，其中 s 表示缓存的序列长度。如表 4 所示，隨著解碼長度的增加，我們的方案表現出顯著的延遲減少，在 64k 上下文長度下可實現最高 11.6 倍速度提升。這種記憶存取效率的優勢也隨序列长度的延長而放大。

6. Discussion

原文

In this section, we reflect on the development process of NSA and discuss key insights gained from our exploration of different sparse attention strategies. While our approach demonstrates promising results, understanding the challenges encountered with alternative strategies and analyzing attention patterns provides valuable context for future research directions. We first examine challenges with alternative token selection strategies that motivated our design choices, followed by visualizations that offer insights into attention distribution patterns.

翻譯後的結果

在此部分，我們回顧 NSA 的開發過程，並討論從探索各種稀疏注意力策略中獲得的重要見解。雖然我們的策略表現出有希望的結果，但理解這些挑戰和分析注意力模式為未來的研究方向提供了寶貴的背景。我們首先檢視替代令牌選擇策略所面臨的挑戰，這些挑戰促使我們的設計決策，接下來是可視化圖表，提供對注意力模式的洞察。

6.1. Challenges with Alternative Token Selection Strategies

6.2. Visualization

原文

To explore potential patterns in transformer attention distributions and seek inspiration for our design, we visualize the attention map from our pretrained 27B Full Attention model in Figure 8. The visualization reveals interesting patterns where attention scores tend to exhibit blockwise clustering characteristics, with nearby keys often showing similar attention scores. This observation inspired our design of NSA, suggesting that selecting key blocks based on spatial continuity might be a promising approach. The blockwise clustering phenomenon indicates that tokens adjacent in the sequence may share certain semantic relationships with query tokens, though the exact nature of these relationships requires further investigation. This observation motivated us to explore a sparse attention mechanism that operates on continuous token blocks rather than individual tokens, aiming to enhance computational efficiency and preserve high-attention patterns.

翻譯後的結果

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

tags:論文翻譯 deeplearning

說明

Abstract

原文

翻譯後的結果

1. Introduction

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

2. Rethinking Sparse Attention Methods

原文

翻譯後的結果

2.1. The Illusion of Efficient Inference

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

2.2. The Myth of Trainable Sparsity

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

2.3. Native Sparsity as an Imperative

原文

翻譯後的結果

3. Methodology

原文

翻譯後的結果

3.1. Background

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

3.2. Overall Framework

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

3.3. Algorithm Design

原文

翻譯後的結果

3.3.1. Token Compression

原文

翻譯後的結果

原文

翻譯後的結果

3.3.2. Token Selection

原文

翻譯後的結果

原文

翻譯後的結果

原文

tags:`論文翻譯` `deeplearning`