Try   HackMD

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

tags:論文翻譯 deeplearning

說明

排版的順序為先原文,再繁體中文,並且圖片與表格都會出現在第一次出現的段落下面

原文

繁體中文

照片或表格

  1. 個人註解,任何的翻譯不通暢部份都請留言指導
  2. 為了加速閱讀,直接採用自建的反思翻譯(Phi4-14B模型所翻譯)的結果呈現,然後快速看過,語意對了頭過身就過,多多包函
  3. 這篇論文測試利用docling將論文從pdf取出轉成markdown格式,再利用正則式取片段至dify建置的反思翻譯api取得譯文再做優化調整
  4. 解釋的部份也是直接用Phi4-14B模型來針對片段做理解說明
  5. 機器說明的部份我會把一些不必要的冗文刪除,這可能可以從提示詞中再來做回應的優化

Abstract

原文

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

翻譯後的結果

長上下文建模對於下一代語言模型至關重要,然而標準注意力機制的高計算成本給效率帶來巨大挑戰。稀疏注意力提供了一條有希望的方向,可以在保持模型能力的同時提高效率。我們提出 NSA(原生訓練稀疏注意力)機制,它將算法創新與硬體優化相結合,實現高效的長上下文建模。NSA 採用動態分層稀疏策略,將粗粒度標記壓縮與精細粒度標記選擇相結合,以保持全局上下文感知力和局部精度。我們的方案通過兩個關鍵創新推動稀疏注意力的設計:(1)我們通過算術強度平衡算法設計實現顯著加速,並對現代硬體進行實施優化。(2)我們允許端到端訓練,減少預訓練計算而不會犧牲模型性能。如圖 1 所示,實驗表明使用 NSA 預訓練的模型在通用的基准測試、長上下文任務和指令型推理方面表現至少與全注意力模型相等或超過它。同時,NSA 在 64k 長序列的解碼、前向傳播和反向傳播方面都比全注意力模型實現了顯著加速,驗證了其在整個模型生命週期中的效率。

1. Introduction

原文

The research community increasingly recognizes long-context modeling as a crucial capability for next-generation large language models, driven by diverse real-world applications ranging from in-depth reasoning (DeepSeek-AI, 2025; Zelikman et al., 2022), repository-level code generation (Zhang et al., 2023a; Zhang et al.) and multi-turn autonomous agent systems (Park et al., 2023). Recent breakthroughs, including OpenAI's o-series models, DeepSeek-R1 (DeepSeek-AI, 2025), and Gemini 1.5 Pro (Google et al., 2024), enabling models to process entire codebases, lengthy documents, maintain coherent multi-turn conversations over thousands of tokens, and perform complex reasoning across long-range dependencies. However, the high complexity (Zaheer et al., 2020) of vanilla Attention (Vaswani et al., 2017) mechanisms emerges as a critical latency bottleneck as sequence length increases. Theoretical estimates indicate that attention computation with softmax architectures accounts for 70-80% of total latency when decoding 64k-length contexts, underscoring the urgent need for more efficient attention mechanisms.

翻譯後的結果

研究社群日益認識到長時間上下文處理能力作為下一代大型語言模型(LLM)的核心,這受到各種實踐應用推動,涵蓋深入推理(DeepSeek-AI,2025;Zelikman 等人,2022)、程式碼庫級程式碼生成(張等,2023a;張等)以及多回合自主代理系統(朴等,2023)。近期突破,例如 OpenAI 的 o 系列模型、DeepSeek-R1(DeepSeek-AI,2025)和 Gemini 1.5 Pro(Google 等人,2024),使模型能夠處理完整的程式碼庫、冗長的文檔,並進行數千個 token 長度的連貫多回合對話,同時在長時間依賴性中執行複雜推理。然而,原始注意力機制(Vaswani 等人,2017)的複雜度隨著序列長度的增加顯著提高(Zaheer 等人,2020),成為重要的延遲瓶頸。理論估計表明,在解碼 64k 長度的上下文時,softmax 架構的注意力計算佔據了總處理時間的 70-80%,這更加突出了開發更有效率的注意力機制的重要性。

原文

Figure 1 | Comparison of performance and efficiency between Full Attention model and our NSA. Left: Despite being sparse, NSA surpasses Full Attention baseline on average across general benchmarks, long-context tasks, and reasoning evaluation. Right: For 64k-length sequence processing, NSA achieves substantial computational speedup compared to Full Attention in all stages: decoding, forward propagation, and backward propagation.

翻譯後的結果

圖 1 | 全注意力模型與我們提出的 NSA 在準確性和效率方面的比較。左側:儘管稀疏,但 NSA 在一般基准測試、長文本任務和推理評估中均超越了全注意力模型的平均水平。右側:對於長度為 64k 的序列處理,NSA 在解碼、前向傳播和反向傳播的所有階段都比全注意力模型實現了顯著的計算效率提升。

原文

Anatural approach to efficient long-context modeling is to take advantage of the inherent sparsity of softmax attention (Ge et al., 2023; Jiang et al., 2023), where selectively computing critical query-key pairs can significantly reduce computational overhead while preserving performance. Recent advances demonstrate this potential through diverse strategies: KV-cache eviction methods (Li et al., 2024; Zhang et al., 2023b; Zhou et al., 2024), blockwise KV-cache selection methods (Tang et al., 2024; Xiao et al., 2024), and sampling, clustering or hashing-based selection methods (Chen et al., 2024; Desai et al., 2024; Liu et al., 2024). Despite these promising strategies, existing sparse attention methods often fall short in practical deployments. Many approaches fail to achieve speedups comparable to their theoretical gains; also, most methods mainly focus on inference stage, lacking effective training-time support to fully exploit the sparsity patterns of attention.

翻譯後的結果

一種有效處理長文本建模的自然方法,就是充分利用softmax 注意力的內在稀疏性(Ge 等人,2023;Jiang 等人,2023)。通過選擇性計算關鍵查詢-金鑰對,可以顯著降低運算負擔,同時保持性能。最近的進展證明了這種潛力,並通過多種策略展示:KV 緩存淘汰方法(Li 等人,2024;Zhang 等人,2023b;Zhou 等人,2024)、分塊 KV 緩存選取方法(Tang 等人,2024;Xiao 等人,2024)以及基於採樣、聚類或哈希的注意力權重選擇方法(Chen 等人,2024;Desai 等人,2024;Liu 等人,2024)。儘管這些有希望的策略,但現有的稀疏注意力方法在實際部署中常常無法達到預期的效果。許多方法未能實現與理論收益相媲美的加速;此外,大部分方法主要集中在推理階段,缺乏有效的訓練時間支持,以充分利用注意力的稀疏性模式。

原文

To address these limitations, the deployment of effective sparse attention must tackle two key challenges: (1) Hardware-aligned inference speedup : Converting theoretical computation reductions into actual speed improvements requires hardware-friendly algorithm design during both prefilling and decoding stages to mitigate memory access and hardware scheduling bottlenecks; (2) Training-aware algorithm design : Enabling end-to-end computation with trainable operators to reduce training costs while maintaining model performance. These requirements are crucial for real-world applications to achieve fast long-context inference or training. When considering both aspects, existing methods still exhibit a noticeable gap.

翻譯後的結果

為克服這些限制,有效稀疏注意力机制的部署必須解決兩個關鍵挑戰: (1) 硬體對齊推理加速:將理論上的運算減縮轉化為實際速度提升,需要在預先填充和解碼階段設計友好的硬件算法,以减轻内存访问和硬件調度瓶颈;(2) 基於訓練的算法設計:啟用可訓練操作員進行端到端計算,同時降低訓練成本,並保持模型性能。 這些要求對於實踐應用至關重要,才能實現快速長上下文推理或訓練。 當考慮兩個方面時,現有的方法仍存在明顯差距。

原文

To achieve more effective and efficient sparse attention, we present NSA, a Natively trainable Sparse Attention architecture that integrates hierarchical token modeling. As shown in Figure 2, NSA reduces per-query computation by organizing keys and values into temporal blocks and processing them through three attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Then we implement specialized kernels to maximize its practical efficiency. NSA introduces two core innovations corresponding to the key requirements above: (1) Hardware-aligned system: Optimize blockwise sparse attention for Tensor Core utilization and memory access, ensuring balanced arithmetic intensity. (2) Training-aware design: Enable stable end-to-end training through efficient algorithms and backward operators. This optimization enables NSA to support both efficient deployment and end-to-end training.

翻譯後的結果

為了實現更有效率和高效的稀疏注意力,我們提出了一個名為 NSA(本質訓練稀疏注意力架構)的新模型,它整合了分層 token 模型。如圖 2 所示,NSA 通過將關鍵和值組織成時間區塊並通過三個注意力路徑進行處理來減少每個查詢的計算量:壓縮粗粒度 token、選擇性保留精細粒度 token 和滑動窗口以獲取局部上下文信息。然後我們實施專門的核,最大程度地提高其實際效率。NSA 引入兩個核心創新,滿足上述關鍵需求:(1) 與硬體對齊系統:優化塊狀稀疏注意力的 Tensor Core 顯著度和存取記憶體,確保算術強度平衡。(2) 訓練感知設計:透過高效的算法和反向運算器實現穩定且連續的端到端訓練。這種優化使 NSA 可以支持高效部署和連續的端到端訓練。

原文

Figure 2 | Overview of NSA's architecture. Left: The framework processes input sequences through three parallel attention branches: For a given query, preceding keys and values are processed into compressed attention for coarse-grained patterns, selected attention for important token blocks, and sliding attention for local context. Right: Visualization of different attention patterns produced by each branch. Green areas indicate regions where attention scores need to be computed, while white areas represent regions that can be skipped.

翻譯後的結果

圖 2 | NSA 架構概覽。左側:框架以三個並行注意力分支處理輸入序列:對於給定查詢,前序鍵值會被處理成壓縮注意力擷取粗粒度模式、選擇性注意力關注重要詞元區塊,以及滑動注意力捕捉局部上下文。右側:各分支產生的不同注意力模式可視化圖。綠色區域表示需要計算注意力分數的區域,白色區域代表可以跳過的區域。

原文

We evaluate NSA through comprehensive experiments on real-world language corpora. Pretraining on a 27B-parameter transformer backbone with 260B tokens, we assess NSA's performance across general language evaluations, long-context evaluations, and chain-of-thought reasoning evaluation. We further compare the kernel speed on A100 GPUs with optimized Triton (Tillet et al., 2019) implementations. Experimental results demonstrate that NSA achieves comparable or superior performance to full attention baseline, while outperforming existing sparse attention approaches. Additionally, NSA delivers substantial speedups across decoding, forward, and backward stages compared to Full Attention, with the speedup ratio increasing for longer sequences. These results validate that our hierarchical sparse attention design effectively balances model capability and computational efficiency.

翻譯後的結果

我們透過在真實世界語言數據集上的全方位實驗評估 NSA 的效能。在使用 27B 個參數的 Transformer 主幹模型(訓練數據量為 260B 個 token)進行預訓練後,我們在通用的語言評估、長文本理解以及推理鏈式思維評估等方面評估了 NSA 的表現。我們進一步將 A100 GPU 上的核速度與優化後的 Triton(Tillet 等人,2019)實現進行比較。實驗結果顯示,NSA 在性能上可與全注意力基準線相媲美甚至超越,同時勝過現有的稀疏注意力方法。此外,與全注意力模型相比,NSA 在解碼、前向傳播和反向傳播階段均取得顯著的加速效果,對於更長的序列,加速比率更高。這些結果驗證了我們的層次稀疏注意力設計有效地平衡了模型效能與計算效率。

2. Rethinking Sparse Attention Methods

原文

Modern sparse attention methods have made significant strides in reducing the theoretical computational complexity of transformer models. However, most approaches predominantly apply sparsity during inference while retaining a pretrained Full Attention backbone, potentially introducing architectural bias that limits their ability to fully exploit sparse attention's advantages. Before introducing our native sparse architecture, we systematically analyze these limitations through two critical lenses.

翻譯後的結果

現代稀疏注意力機制在降低變換器模型的理論計算複雜度方面取得了重大進展。但是,大多數方法主要是在推理階段應用稀疏性,同時保留預訓練的全注意力主干結構,這可能會引入架構偏差,限制其充分利用稀疏注意力的優勢。我們通過兩個核心角度系統地分析這些局限性。

2.1. The Illusion of Efficient Inference

原文

Despite achieving sparsity in attention computation, many methods fail to achieve corresponding reductions in inference latency, primarily due to two challenges:

翻譯後的結果

儘管許多方法在注意力計算中實現稀疏性,但由於以下兩個挑戰,未能有效減少推理延遲:

原文

Phase-Restricted Sparsity. Methods such as H2O (Zhang et al., 2023b) apply sparsity during autoregressive decoding while requiring computationally intensive pre-processing (e.g. attention map calculation, index building) during prefilling. In contrast, approaches like MInference (Jiang et al., 2024) focus solely on prefilling sparsity. These methods fail to achieve acceleration across all inference stages, as at least one phase remains computational costs comparable to Full Attention. The phase specialization reduces the speedup ability of these methods in prefilling-dominated workloads like book summarization and code completion, or decoding-dominated workloads like long chain-of-thought (Wei et al., 2022) reasoning.

翻譯後的結果

阶段性受限稀疏性。例如 H2O(Zhang 等人,2023b)等方法在自回归解码期间应用稀疏性,但需要在预填充阶段进行耗时的预处理(例如注意力图计算、索引建立)。相比之下,像 MInference(Jiang 等人,2024)的方法只关注于预填充稀疏性。这些方法无法在所有推理阶段实现加速,因为至少一个阶段的计算成本与完全注意力模型相当。这种阶段专精会降低这些方法在以预填充为主负荷(例如书籍摘要和代码完成)或以解码为主负荷(例如长链思维(Wei 等人,2022)推理)的工作中的加速能力。

原文

Incompatibility with Advanced Attention Architecture. Some sparse attention methods fail to adapt to modern decoding efficient architectures like Mulitiple-Query Attention (MQA) (Shazeer, 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023), which significantly reduced the memory access bottleneck during decoding by sharing KV across multiple query heads. For instance, in approaches like Quest (Tang et al., 2024), each attention head independently selects its KV-cache subset. Although it demonstrates consistent computation sparsity and memory access sparsity in Multi-Head Attention (MHA) models, it presents a different scenario in models based on architectures like GQA, where the memory access volume of KV-cache corresponds to the union of selections from all query heads within the same GQA group. This architectural characteristic means that while these methods can reduce computation operations, the required KV-cache memory access remains relatively high. This limitation forces a critical choice: while some sparse attention methods reduce computation, their scattered memory access pattern conflicts with efficient memory access design from advanced architectures.

翻譯後的結果

與先進注意力架構的不相容性。部分稀疏注意力方法難以適應現今高效編碼架構,例如多查詢注意力 (MQA) (Shazeer, 2019) 和組群查詢注意力 (GQA) (Ainslie 等人,2023),這些架構通過在多個查詢頭間共享 KV 向量來顯著減少編碼過程中的記憶體存取瓶頸。例如,Quest (Tang 等人,2024) 這樣的演算法,每個注意力頭獨立選擇其 KV-cache 子集。儘管它在多頭注意力 (MHA) 模型中展現一致的計算稀疏性和記憶體存取稀疏性,但在 GQA 等架構基礎的模型中表現不同,因為 KV-cache 的記憶體存取量等於同一組內所有查詢頭選擇的併集。這種架構特性意味著,儘管這些方法可以減少運算量,但所需的 KV-cache 記憶體存取仍相對較高。這個限制迫使我們做出重要抉擇:雖然一些稀疏注意力方法可以減小計算成本,但其分散的記憶體存取模式與先進架構高效的記憶體存取設計相衝突。

原文

These limitations arise because many existing sparse attention methods focus on KV-cache reduction or theoretical computation reduction, but struggle to achieve significant latency reduction in advanced frameworks or backends. This motivates us to develop algorithms that combine both advanced architectural and hardware-efficient implementation to fully leverage sparsity for improving model efficiency.

翻譯後的結果

這些限制源於許多現有的稀疏注意力方法專注於 KV 緩存減少或理論計算減少,但在高級框架或後端難以實現顯著的延遲降低。這促使我們開發結合先進架構和硬件高效實作算法,充分利用稀疏性來提升模型效能。

2.2. The Myth of Trainable Sparsity

原文

Our pursuit of native trainable sparse attention is motivated by two key insights from analyzing inference-only approaches: (1) Performance Degradation : Applying sparsity post-hoc forces models to deviate from their pretrained optimization trajectory. As demonstrated by Chen et al. (2024), top 20% attention can only cover 70% of the total attention scores, rendering structures like retrieval heads in pretrained models vulnerable to pruning during inference. (2) Training Efficiency Demands : Efficient handling of long-sequence training is crucial for modern LLM development. This includes both pretraining on longer documents to enhance model capacity, and subsequent adaptation phases such as long-context fine-tuning and reinforcement learning. However, existing sparse attention methods primarily target inference, leaving the computational challenges in training largely unaddressed. This limitation hinders the development of more capable long-context models through efficient training. Additionally, efforts to adapt existing sparse attention for training also expose challenges:

翻譯後的結果

我們追求原生可訓練稀疏注意力的目標源於分析推理型方法時獲得的兩個關鍵洞察: (1) 效能退化:事後應用稀疏性迫使模型偏離其預先訓練的優化軌跡。如同陳等人的研究(2024)所示,僅靠前 20% 的注意力只能覆蓋總注意力得分量的 70%,這使得預先訓練模型中的检索頭結構在推理期間更容易被剪枝。(2)訓練效率需求:高效處理長序列訓練對於現代大型語言模型 (LLM) 的發展至關重要。這包括在更長的文獻上進行預訓練以增強模型容量,以及後續的改適階段,例如長上下文微調和強化學習。然而,現有的稀疏注意力方法主要針對推理,並沒有有效解決訓練過程中的計算挑戰。這種限制阻礙了通過高效訓練開發更具能力的長上下文模型。此外,將現有稀疏注意力應用於訓練也引發了一些新的挑戰:

原文

Non-Trainable Components. Discrete operations in methods like ClusterKV (Liu et al., 2024) (includes k-means clustering) and MagicPIG (Chen et al., 2024) (includes SimHash-based selecting) create discontinuities in the computational graph. These non-trainable components prevent gradient flow through the token selection process, limiting the model's ability to learn optimal sparse patterns.

翻譯後的結果

不可訓練組成部分。 類似於 ClusterKV(劉等人,2024 年)和 MagicPIG(陳等人,2024 年)等方法中的離散操作(包括 k-均值聚類和基於 SimHash 的選擇),會在計算圖中產生不連續性。這些不可訓練的組成部分阻止了梯度流經代碼選擇過程,限制了模型學習最優稀疏模式的潛力。

原文

Inefficient Back-propagation. Some theoretically trainable sparse attention methods suffer from practical training inefficiencies. Token-granular selection strategy used in approaches like HashAttention (Desai et al., 2024) leads to the need to load a large number of individual tokens from the KV cache during attention computation. This non-contiguous memory access prevents efficient adaptation of fast attention techniques like FlashAttention, which rely on contiguous memory access and blockwise computation to achieve high throughput. As a result, implementations are forced to fall back to low hardware utilization, significantly degrading training efficiency.

翻譯後的結果

「效率低下之反向傳播。某些理論上可訓練的稀疏注意力方法在實踐中存在訓練效率問題。像 HashAttention(Desai 等人,2024)這樣的方法所使用的單元粒度選擇策略導致在注意力計算過程中需要從 KV 緩存中載入大量個別單位。這種非連續的記憶體訪問阻止了快速注意技巧(例如 FlashAttention)的高效適應,後者依賴於連續的記憶體訪問和塊狀計算以實現高吞吐量。因此,實作被迫回退到低硬體利用率,顯著降低訓練效率。」

2.3. Native Sparsity as an Imperative

原文

These limitations in inference efficiency and training viability motivate our fundamental redesign of sparse attention mechanisms. We propose NSA, a natively sparse attention framework that addresses both computational efficiency and training requirements. In the following sections, we detail the algorithmic design and operator implementation of NSA.

翻譯後的結果

這些推論效率和訓練可行性上的限制,促使我們對稀疏注意力機制進行根本性的重新設計。我們提出 NSA,一個本質上稀疏的注意力框架,能夠同時兼顧計算效率和訓練需求。接下來的各節將詳細介紹 NSA 的演算法設計及運算元實作。

3. Methodology

原文

Our technical approach spans algorithm design and kernel optimization. In the following subsections, we first introduce the background of our methodology. Then we present the overall framework of NSA, followed by its key algorithmic components. Finally, we detail our hardware-optimized kernel design that maximizes practical efficiency.

翻譯後的結果

我們的技術策略涵蓋算法設計和核函數優化。在以下子節中,我們首先介紹該方法學背景。接著,我們呈現 NSA 的整體框架,並闡述其關鍵算法組成部分。最後,我們詳細介紹了我們的硬件優化的核函數設計,以最大效率提升實際效能。

3.1. Background

原文

Attention Mechanism is widely used in language modeling where each query token q 𝑡 computes relevance scores against all preceding keys k : 𝑡 to generate a weighted sum of values v : 𝑡 . Formally, for an input sequence of length 𝑡 , the attention operation is defined as:

翻譯後的結果

注意力機制廣泛應用於語言建模中,每個查詢詞彙 q_t 會計算與所有前序關鍵字 k_{1:t} 的相關性得分,進而生成對應值 v_{1:t} 的加權和。正式地,對於長度為 t 的輸入序列,注意力運算定義如下:

原文

where Attn denotes the attention function:

翻譯後的結果

其中,Attn 代表注意力機制:

原文

Here, 𝛼𝑡 𝑖 , represents the attention weight between q 𝑡 and k 𝑖 , and 𝑑𝑘 is the feature dimension of keys. As sequence length increases, attention computation becomes increasingly dominant in the overall computational cost, presenting significant challenges for long-context processing.

翻譯後的結果

這裡,𝛼𝑡𝑖 代表 q𝑡 和 k𝑖 之間的注意力權重,而 dₖ 是鍵值的維度。隨著序列長度的增加,注意力計算在整體運算成本中占據越來越大的比重,這對處理長上下文任務帶來了顯著挑戰。

原文

Arithmetic Intensity is the ratio of compute operations to memory accesses. It intrinsically shapes algorithm optimization on hardware. Each GPU has a critical arithmetic intensity determined by its peak compute capability and memory bandwidth, calculated as the ratio of these two hardware limits. For computation tasks, arithmetic intensity above this critical threshold becomes compute-bound (limited by GPU FLOPS), while below it becomes memorybound (limited by memory bandwidth).

翻譯後的結果

算術強度是指運算操作與記憶存取的比率。它本質上影響著演算法在硬體上的優化。每個 GPU 的關鍵算術強度由其最大运算能力和記憶存取帶寬決定,計算公式為這兩個硬體極限的比值。對於計算任務來說,如果算術強度超過此臨界閾值,將成為運算限制(受 GPU FLOPS 限制);反之,若低於此閾值,則成為記憶限制(受記憶存取帶寬限制)。

原文

Specifically for causal self-attention mechanism, during training and prefilling phases, batched matrix multiplications and attention computations exhibit high arithmetic intensity, making these stages compute-bound on modern accelerators. In contrast, auto-regressive decoding becomes memory-bandwidth constrained because it generates one token per forward pass while requiring loading the entire key-value cache, resulting in low arithmetic intensity. This leads to different optimization goals - reducing computation cost during training and prefilling, while reducing memory access during decoding.

翻譯後的結果

針對因果自注意力機制而言,在訓練和預填充階段,批量矩陣乘法與注意力計算呈現高運算密度,使得這些階段在現代加速器上成為計算瓶頸。相反,自迴歸解碼因每次向前傳遞僅生成一個符號,同時需要載入整個關鍵值記憶體緩存,導致記憶帶寬限制。這導致了不同的優化目標:訓練和預填充階段降低計算成本,而解碼階段則降低記憶存取。

3.2. Overall Framework

原文

To leverage the potential of attention with natural sparse pattern, we propose replacing the original key-value pairs k : 𝑡 , v : 𝑡 in Equation (1) with a more compact and information-dense set of representation key-value pairs ˜ 𝐾𝑡 , ˜ 𝑉𝑡 given each query q 𝑡 . Specifically, we formally define the optimized attention output as follows:

翻譯後的結果

為充分利用注意力機制與自然稀疏模式的潛力,我們建議將方程式 (1) 中原始的關鍵值對 k : 𝑡 和 v : 𝑡 替換為針對每個查詢 q 𝑡 給出的更緊凑且信息密度更高的表示關鍵值對 ˜ 𝐾𝑡 和 ˜ 𝑉𝑡。正式定義優化後的注意力輸出如下:

原文

where ˜ , ˜ 𝐾𝑡 𝑉𝑡 are dynamically constructed based on the current query q 𝑡 and the contextual memory k : 𝑡 , v : 𝑡 . We can design various mapping strategies to get different categories of ˜ 𝐾 𝑐 𝑡 , ˜ 𝑉 𝑐 𝑡 , and combine them as follows:

翻譯後的結果

其中 ˜K<sub>t</sub>˜V<sub>t</sub> 是基於當前查詢 qt 和上下文記憶 k:t、v:t 動態構造的。我們可以設計各種映射策略來獲得不同類別的 ˜K<sub>c t</sub>˜V<sub>c t</sub>,並使用適當的方法將它們結合起來。

原文

As illustrated in Figure 2, NSA have three mapping strategies C = { cmp,slc, win , representing } compression, selection, and sliding window for keys and values. 𝑔 𝑐 𝑡 ∈ [ 0, 1 ] is the gate score for corresponding strategy 𝑐 , derived from input features via an MLP and sigmoid activation. Let 𝑁𝑡 denote the total number of remapped keys/values:

翻譯後的結果

如圖 2 所示,NSA 包含三個映射策略 C = {cmp、slc、win},分別代表壓縮、選擇和滑動窗口,用於對鍵值進行處理。 其中, gct ∈ [0, 1] 為每個策略 c 的門控分數,由輸入特徵經多層感知機 (MLP) 和 sigmoid 激活函數計算得出。 令𝑁𝑡 表示重映射後的鍵值總數。

原文

We maintain a high sparsity ratio by ensuring 𝑁𝑡 ≪ 𝑡 .

翻譯後的結果

我們透過確保 𝑁𝑡 ≪ 𝑡 的方式來維持高稀疏率。

3.3. Algorithm Design

原文

In this subsection, we introduce the design of our remapping strategies 𝑓 𝐾 and 𝑓 𝑉 : token compression, token selection, and sliding window.

翻譯後的結果

在本小節中,我們介紹我們的重映射策略 𝑓𝐾 和 𝑓𝑉 的設計:符號壓縮、符號選擇和滑動視窗方法。

3.3.1. Token Compression

原文

By aggregating sequential blocks of keys or values into block-level representations, we obtain compressed keys and values that capture the information of the entire block. Formally, the compressed key representation is defined as:

翻譯後的結果

透過將順序塊中的鍵或值彙總為塊級表示,我們可以得到壓縮的鍵和值,這些壓縮後的鍵和值包含整體資訊。正式地,壓縮后的鍵表示定義如下:

原文

where 𝑙 is the block length, 𝑑 is the sliding stride between adjacent blocks, and 𝜑 is a learnable MLP with intra-block position encoding to map keys in a block to a single compressed key. ˜ 𝐾 𝑡 cmp ∈ R 𝑑 𝑘 ×⌊ 𝑡 -𝑙 𝑑 ⌋ is tensor composed by compresion keys. Usually, we adopt 𝑑 < 𝑙 to mitigate information fragmentation. An analogous formulation holds for the compressed value representation ˜ 𝑉 𝑡 cmp . Compressed representations capture coarser-grained higher-level semantic information and reduce computational burden of attention.

翻譯後的結果

其中,𝑙 表示區塊長度、𝑑 表示相鄰區塊之間滑動步幅,而𝜑 是個具有區塊內位置編碼的可學習多層感知器 (MLP),用於將每個區塊中的鍵映射至一個壓縮鍵。˜𝐾𝑡 cmp ∈ R 𝑑𝑘 ×⌊(𝑡-𝑙)/𝑑⌋ 為由壓縮鍵組成的張量。通常,我們設定𝑑 < 𝑙 來減輕資訊碎片化問題。類似的表述也適用於壓縮值表示˜𝑉𝑡 cmp 。這些壓縮表示捕捉更粗粒度的、更高層次語義信息,並降低注意力計算負擔。

3.3.2. Token Selection

原文

Using only compressed keys, values might lose important fine-grained information, motivating us to selectively preserve individual keys, values. Below we describe our efficient token selection mechanism that identifies and preserves the most relevant tokens with low computational overhead.

翻譯後的結果

僅使用壓縮鍵值,可能會導致重要細微資訊的遺失,因此我們需要選擇性地保留個別關鍵和值。以下,我們將描述我們的有效標記選取機制,它能夠有效地識別並保留最相關的標記,同時具備低的計算開銷。

原文

Blockwise Selection. Our selection strategy processes key and value sequences in spacial continuous blocks, motivated by two key factors: hardware efficiency considerations and inherent distribution patterns of attention scores. Blockwise selection is crucial to achieve efficient computation on modern GPUs. That is because modern GPU architectures exhibit significantly higher throughput for continuous block accesses compared to random index-based reads. Also, blockwise computation enables optimal utilization of Tensor Cores. This architectural characteristic has established blockwise memory access and computation as a fundamental principle in high-performance attention implementations, as exemplified by FlashAttention's block-based design. Blockwise selection follows the inherent distribution patterns of attention scores. Prior works (Jiang et al., 2024) have shown that attention scores often exhibit spatial continuity, suggesting that neighboring keys tend to share similar importance levels. Our visualization in Section 6.2 also shows this spatial continuous pattern.

翻譯後的結果

「分塊選擇。我們的選擇策略以空間連續的區塊處理關鍵和值序列,主要受兩種因素的啟發:硬體效率考量和注意力權重固有的分佈模式。分塊選擇對於在現代 GPU 上實現高效計算至關重要,因為現代GPU架構對連續区块存取顯示出顯著更高的吞吐量,相較於基於随机索引的讀取。此外,分塊計算還能夠充分利用張量核。這種架構特點使分塊記憶體存取和計算成為高性能注意力實現的基本原則,例如 FlashAttention 的基於塊的設計。分塊選擇遵循注意力權重的固有分佈模式。之前的工作(Jiang, et al., 2024)表明注意力權重往往呈現空間連續性,表明相鄰關鍵傾向於共享類似的重要性水平。我們在第 6.2 節中的可視化也顯示了這種空間連續模式。」

原文

To implement blockwise selection, we first divide key, value sequences into selection blocks. To identify the most important blocks for attention computation, we need to assign importance scores to each block. Below we present our method for computing these block-level importance scores.

翻譯後的結果

為實施區塊式選擇,我們首先將關鍵和值序列分割成選擇區塊。為了在注意力計算中識別最為重要的區塊,我們需要為每個區塊分配重要性得分。以下介紹我們計算這些區塊級重要性分數的方法。

原文

Importance Score Computation. Computing block importance scores could introduce significant overhead. Fortunately, the attention computation of compression tokens produces intermediate attention scores that we can leverage to induce selection block importance scores, formulated as:

翻譯後的結果

重要性評分計算。計算區塊的重要性得分可能會帶來顯著的開銷。幸運的是,壓縮詞彙的注意力計算會產生我們可以利用的中間注意力得分,這些得分可以用於推導出選擇區塊的重要性得分,其公式如下:

原文

where p cmp 𝑡 ∈ R ⌊ 𝑡 -𝑙 𝑑 ⌋ is the attention scores between 𝑞𝑡 and compression keys ˜ 𝐾 𝑡 cmp . Let 𝑙 ' denote the selection block size. When compression blocks and selection blocks share the same blocking scheme, i.e., 𝑙 ' = 𝑙 = 𝑑 , we can directly obtain the selection block importance scores p slc 𝑡 by p slc 𝑡 = p cmp 𝑡 straightforwardly. For cases where the blocking schemes differ, we derive the importance scores for selection blocks according to their spatial relationship. Given 𝑑 | 𝑙 and 𝑑 | 𝑙 ' , we have:

翻譯後的結果

當 p cmp 𝑡 ∈ R⌊(𝑡 − 𝑙)/𝑑⌋ 為 q𝑡 和壓縮鍵˜𝐾t cmp 之間的注意力得分時,其中 𝑙 ' 表示選取區塊大小。當壓縮區塊和選取區塊共享相同的封鎖方案,即 𝑙 ' = 𝑙 = 𝑑 ,我們可以直接得到選取區塊重要性得分 pslc 𝑡 = pcmp 𝑡 。對於封鎖方案不同的情況,我們根據其空間關係推導選取區塊的重要性得分。考慮到 𝑑 | 𝑙 和 𝑑 | 𝑙 ' ,我們有:

原文

where [·] denotes the indexing operator for accessing vector element. For models employing GQAor MQA where key-value caches are shared across query heads, consistent block selection across these heads has to be ensured to minimize KV cache loading during decoding. The shared importance scores across heads in a group are formally defined as:

翻譯後的結果

其中 [·] 表示索引運算符,用於訪問向量元素。對於採用 GQA 或 MQA 模型的模型,其關鍵值快取在查詢頭間共享,需要確保每個組內各個查詢頭使用一致的區塊選取,以最小化解碼期間 KV 快取的載入量。此處一個組內各個查詢頭共享的重要性指標正式定義為:

原文

where ( ℎ ) in the superscript denotes the head index, and 𝐻 is the number of query heads in each group. This aggregation ensures consistent block selection across heads within the same group.

翻譯後的結果

其中 (ℎ) 表示頭指標,𝐻 代表每個組中的查詢頭數量。這種聚合方式確保同一個組中各個頭對應一致的區塊選擇。

原文

Top 𝑛 Block Selection. After obtaining the selection block importance scores, We retain tokens within the top𝑛 sparse blocks ranked by block importance scores, formulated as:

翻譯後的結果

前 n 個區塊篩選。在獲得區塊重要性得分後,我們保留根據區塊重要性得分排名靠前的 top-n 個稀疏區塊中的詞彙,提取如下:

原文

where rank (·) denotes the ranking position in descending order, with rank = 1 corresponding to the highest score, I 𝑡 is the set of selected blocks' indices, Cat denotes the concatenation operation. ˜ slc 𝐾 𝑡 ∈ R 𝑑 𝑘 × 𝑛𝑙 ' is tensor composed by compresion keys. An analogous formulation applies to the fine-grained value ˜ slc 𝑉 𝑡 . The selected keys and values then participate in the attention computation with q 𝑡 as defined in Equation (5).

翻譯後的結果

其中 rank(·) 表示按降序排列的排序位置,rank = 1 對應最高分數, I𝑡 是選取的塊索引集,Cat 代表拼接操作。˜ slc𝐾𝑡 ∈ R𝑑𝑘×𝑛𝑙 是由壓縮關鍵字組成的張量。類似的公式適用於細粒度的值˜ slc𝑉𝑡 。然後,選擇的關鍵字和值會與公式 (5) 中定義的 q𝑡 進行注意力計算。

3.3.3. Sliding Window

原文

In attention mechanisms, local patterns typically adapt faster and can dominate the learning process, potentially preventing the model from effectively learning from compression and selection tokens. To address this issue, we introduce a dedicated sliding window branch that explicitly handles local context, allowing other branches (compression and selection) to focus on learning their respective features without being shortcutted by local patterns. Specifically, we maintain recent tokens ˜ win 𝐾 𝑡 = k 𝑡 -𝑤 𝑡 : , ˜ win 𝑉 𝑡 = v 𝑡 -𝑤 𝑡 : in a window 𝑤 , and isolate attention computations of different information sources (compression tokens, and selected tokens, sliding window) into separate branches. These branch outputs are then aggregated through a learned gating mechanism. To further prevent shortcut learning across attention branches with marginal computational overhead, we provide independent keys and values for three branches. This architectural design enables stable learning by preventing gradient interference between local and long-range pattern recognition, while introducing minimal overhead.

翻譯後的結果

在注意力机制中,局部模式通常适应得更快且可能主导学习过程,从而阻碍模型有效地从压缩和选择标记中学习。为了解决这个问题,我们引入了专门的滑动窗口分支,明确处理局部上下文,允许其他分支(压缩和选择)专注于学习各自特征,不受局部模式影响。具体来说,我们在一个窗口 w 中保持最近的标记 ˜winkt = kt - wt:, ˜winvt = vt - wt: ,并将不同信息源(压缩标记和选择标记、滑动窗口)的注意力计算隔离开到不同的分支中。这些分支输出然后通过一个学习门控机制聚合起来。为了进一步防止跨越注意力分支的捷径学习,同时保持极小的运算开销,我们为三个分支提供了独立的关键值。这种架构设计通过防止局部模式和长距离模式识别之间的梯度干扰,实现稳定学习,同时引入最小的额外开销。

原文

After obtaining all three categories of keys and values ( ˜ 𝐾 𝑡 cmp , ˜ 𝑉 𝑡 cmp ; ˜ slc 𝐾 𝑡 , ˜ slc 𝑉 𝑡 ; and ˜ win 𝐾 𝑡 , ˜ win 𝑉 𝑡 ), we compute the final attention output following Equation (5). Together with the compression, selection, and sliding window mechanisms described above, this forms the complete algorithmic framework of NSA.

翻譯後的結果

在取得所有三種類別的鍵值 (˜Kt_cmp, ˜Vt_cmp; ˜slcKt, ˜slcVt;以及˜winKt, ˜winVt) 後,我們根據公式(5)計算最終的注意力輸出。結合上述壓縮、選擇和滑動窗口機制,構成了NSA 的完整算法框架。

3.4. Kernel Design

原文

To achieve FlashAttention-level speedup during the training and prefilling, we implement hardware-aligned sparse attention kernels upon Triton. Given MHA is memory-intensive and inefficient for decoding, we focus on architectures with shared KV caches like GQA and MQA following the current state-of-the-art LLMs. While compression and sliding window attention computations are readily compatible with existing FlashAttention-2 kernels, we introduce the specialized kernel design for sparse selection attention. If we were to follow FlashAttention's strategy of loading temporally continuous query blocks into SRAM, it would result in inefficient memory access since queries within a block may require disjoint KV blocks. To address this, our key optimization lies in a different query grouping strategy: for each position on the query sequence, we load all query heads within a GQA group (they share the same sparse KV blocks) into SRAM. Figure 3 illustrates our forward pass implementation. The proposed kernel architecture is characterized by the following key features:

翻譯後的結果

為在訓練和預填過程中達成 FlashAttention 水準的加速效果,我們在 Triton 上實作硬件對齊稀疏注意力核。由於多頭注意力(MHA)對於解碼來說記憶體密集且效率低下,我們關注於具有共享 KV 快取架構的模型,例如 GQA 和 MQA,這與目前最先進的大語言模型(LLM)趨勢一致。雖然壓縮和滑動窗口注意力計算可以與現有的 FlashAttention-2 核輕鬆兼容,但我們針對稀疏選取注意力引入了專門設計的核架構。

若依循 FlashAttention 將時間連續查詢塊載入 SRAM 的策略,由於同一塊內的查詢可能需要不相交的 KV 块,將導致記憶體存取效率低下。為此,我們的關鍵優化在於不同的查詢組策略:對於查詢序列中的每個位置,我們將 GQA 組(它們共享相同的稀疏 KV 塊)中所有查詢頭載入 SRAM 中。圖 3 說明了我們的正向傳播實作方式。 所提出的核架構具有以下主要特徵:

原文

    1. Group-Centric Data Loading . For each inner loop, load all heads' queries 𝑄 ∈ R [ ℎ 𝑑 𝑘 , ] in the group at position 𝑡 and their shared sparse key/value block indices I 𝑡 .
    1. Shared KV Fetching . In the inner loop, Sequentially load continuous key/value blocks indexed by I 𝑡 into SRAM as 𝐾 ∈ R [ 𝐵𝑘 𝑑 𝑘 , ] , 𝑉 ∈ R [ 𝐵𝑘 𝑑 𝑣 , ] to minimize memory loading, where 𝐵𝑘 is the kernel block size satisfying 𝐵𝑘 𝑙 | ' .
    1. Outer Loop on Grid . Since the inner-loop length (proportional to the selected block count 𝑛 ) remains nearly identical for different query blocks, we put query/output loops in Triton's grid scheduler to simplify and optimize the kernel.

翻譯後的結果

  1. 分組中心化數據載入。每個內部循環,載入 t 位置群組的所有頭部的查詢 𝑄 ∈ R [ℎ𝑑𝑘,] 及其共享稀疏鍵/值塊指標 I𝑡。

  2. 共享 KV 獲取。內部循環中,依序將由 I𝑡 指示的連續鍵/值塊載入 SRAM 作為 𝐾 ∈ R [𝐵𝑘𝑑𝑘,] 、𝑉 ∈ R [𝐵𝑘𝑑𝑣,],以最小化記憶體載入,其中 𝐵𝑘 是滿足 𝐵𝑘𝑙 | 的內核塊大小。

  3. 網格外循環。由於內部循環長度(與選擇的塊數 n 成比例)對於不同的查詢塊幾乎相同,我們將查詢/輸出循環加入 Triton 的網格調度器中,以簡化和優化核心運算。

原文

This design achieves near-optimal arithmetic intensity by (1) eliminating redundant KV transfers through group-wise sharing, and (2) balancing compute workloads across GPU streaming multiprocessors.

翻譯後的結果

此設計透過(1)利用組別共享消去冗餘的 KV 傳輸,以及(2)在 GPU 串流多處理器上平衡運算工作量,實現了接近最佳的密實度。

原文

Figure 3 | Kernel design for NSA. The kernel loads queries by GQA groups (Grid Loop), fetches corresponding sparse KV blocks (Inner Loop), and performs attention computation on SRAM. Green blocks indicate data on SRAM, while blue indicates data on HBM.

翻譯後的結果

圖 3 | NSA 的核設計。核心透過 GQA 群組(網格循環)載入查詢,並從相應稀疏 KV 區塊(內部循環)中提取數據,在 SRAM 上進行注意力計算。綠色方塊表示 SRAM 中的數據,藍色則表示 HBM 中的數據。

4. Experiments

原文

We evaluate NSA through three lenses: (1) general benchmarks performance, (2) long-context benchmarks performance, and (3) chain-of-thought reasoning performance, comparing against Full Attention baseline and state-of-the-art sparse attention methods. We defer the efficiency analysis of our sparse computation paradigm to Section 5, where we provide detailed discussions on training and inference speed.

翻譯後的結果

我們從三個方面評估 NSA:(1) 通用基準測試表現,(2) 長上下文基準測試表現,以及 (3) 串行推理能力表現。與全注意力基線和最新稀疏注意模型進行比較。我們的稀疏運算模式效率分析將在第 5 節詳細討論,涵蓋訓練和推理速度方面。

4.1. Pretraining Setup

原文

Following the common practice in state-of-the-art LLMs, our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27B total parameters with 3B active parameters. The model consists of 30 layers with a hidden dimension of 2560. For GQA, we set the number of groups to 4, with a total of 64 attention heads. For each head, the hidden dimensions of the query, key, and value are configured as 𝑑𝑞 = 𝑑𝑘 = 192 and 𝑑𝑣 = 128, respectively. For MoE, we utilize the DeepSeekMoE (Dai et al., 2024; DeepSeek-AI, 2024) structure, with 72 routed experts and 2 shared experts, and set the top-k experts to 6. To ensure training stability, the MoE in the first layer is replaced by an MLP in the form of SwiGLU.

翻譯後的結果

基於最新先進的語言模型 (LLM) 的慣例,我們實驗中采用一個融合了組群式查詢注意機制 (GQA) 和專家混合機制 (MoE) 的骨幹結構,總參數為 27B,其中活躍參數為 3B。該模型由 30 層組成,隱藏維度為 2560。對於 GQA,我們將組群數量設置為 4,共有 64 個注意力頭。每個注意力頭的查詢、鍵和值隱藏維度分別設定為 dq = dk = 192 和 dv = 128。對於 MoE,我們採用 DeepSeekMoE(Dai 等人,2024;DeepSeek-AI,2024)結構,其中有 72 個路由專家和 2 個共享專家,並將 top-k 專家設定為 6。為了確保訓練穩定性,第一層的 MoE 被替換為 SwiGLU 形態的多層感知機 (MLP)。

原文

Figure 4 | Pretraining loss comparison between Full Attention and our NSA on 27B-parameter model. Both models exhibit stable convergence, with NSA achieving lower loss values.

翻譯後的結果

圖 4 | 於 27B-參數模型上,全注意力與我們提出的 NSA 在預訓練損失的比較。兩者模型均表現出穩定的收斂,NSA 達到更低的損失值。

原文

Table 1 | Pretraining performance comparison between the full attention baseline and NSA on general benchmarks, across knowledge (MMLU, MMLU-PRO, CMMLU), reasoning (BBH, GSM8K, MATH, DROP), and coding (MBPP, HumanEval) tasks. NSA achieves superior average performance on most benchmarks despite high sparsity.

翻譯後的結果

表 1 | 在通用基準測試上,完整注意力基線與 NSA(非自注意力)在知識 (MMLU、MMLU-PRO、CMMLU)、推理 (BBH、GSM8K、MATH、DROP) 和編碼 (MBPP、HumanEval) 任務中的預訓練表現比較。儘管擁有高稀疏性,NSA 在大多數基準測試中都獲得了優異的平均性能。

Model MMLU Acc. 5-shot MMLU-PRO Acc. 5-shot CMMLU Acc. 5-shot BBH Acc. 3-shot GSM8K Acc. 8-shot MATH Acc. 4-shot DROP F1 1-shot MBPP Pass@1 3-shot HumanEval Pass@1 0-shot Avg.
Full Attn 0.567 0.279 0.576 0.497 0.486 0.263 0.503 0.482 0.335 0.443
NSA 0.565 0.286 0.587 0.521 0.52 0.264 0.545 0.466 0.348 0.456

原文

The proposed architecture achieves an effective trade-off between computation cost and model performance. For NSA, we set compression block size 𝑙 = 32, sliding stride 𝑑 = 16, selected block size 𝑙 ' = 64, selected block count 𝑛 = 16 (including fixed activating the 1 initial block and 2 local blocks), and sliding window size 𝑤 = 512. Both Full Attention and sparse attention models are pretrained on 270B tokens of 8k-length texts, followed by continued training and supervised fine-tuning on 32k-length texts with YaRN (Peng et al., 2024) to achieve long-context adaptation. Both models are trained to full convergence to ensure fair comparison. As shown in Figure 4, the pretraining loss curve of our NSA and Full Attention baseline demonstrates stable and smooth decline, with NSA consistently outperforming the Full Attention model.

翻譯後的結果

所提出的架構在運算成本與模型效能之間取得有效平衡。對於 NSA,我們設定壓縮區塊大小 𝑙 = 32、滑動步幅 𝑑 = 16、選擇區塊大小 𝑙 ' = 64、選擇區塊數量 𝑛 = 16(包括固定激活第一個區塊和兩個局部區塊),以及滑窗大小 𝑤 = 512。全注意力模型和稀疏注意力模型均在包含 270B 個 token 的 8k 長文本上進行預訓練,隨後使用 YaRN (Peng 等人,2024) 對長度為 32k 的文本進行持續訓練和監督微調,以實現長上下文自適應。兩個模型都經過充分訓練,確保公平比較。如圖 4 所示,我們的 NSA 和全注意力基線模型的預訓練損失曲線表現出穩定且平滑的下降趨勢,其中 NSA 在整個訓練過程 consistently 優於全注意力模型。

4.2. Baselines Methods

原文

In addition to comparing with Full Attention, we evaluate several state-of-the-art inference-stage sparse attention methods: H2O (Zhang et al., 2023b), infLLM (Xiao et al., 2024), Quest (Tang et al., 2024), and Exact-Top, which first computes full attention score and select the top𝑛 scores keys corresponding to each query and then calculates attention on these positions. These methods span diverse sparse attention paradigms, including KV-cache eviction, query-aware selection, and exact top𝑛 sparse selection.

翻譯後的結果

除了與全注意力(Full Attention)比較外,我們還評估了多種最新的推理階段稀疏注意力方法:H2O (Zhang et al., 2023b)、infLLM (Xiao et al., 2024)、Quest (Tang et al., 2024) 和 Exact-Top。Exact-Top首先計算全注意力分數,並選擇每個查詢對應的前 n 個得分鍵,然後在此處計算注意力。這些方法涵蓋了多樣化的稀疏注意力範式,包括 KV 緩存淘汰、查詢感知選取和精確前 n 個稀疏選取。

原文

Table 2 | Performance comparison between our NSA and baselines on LongBench, including subsets in single document QA, multi-document QA, synthetic and code task categories. NSA outperformed most of the baselines including Full Attention.

翻譯後的結果

表 2 | 在LongBench 上我們的 NSA 與基線模型的性能比較,包括單文件問答、多文件問答、合成任務和程式碼任務類別中的子集。NSA 在大多數基線模型中,包括Full Attention,表現出色。

Model SQA SQA SQA MQA MQA MQA MQA Synthetic Synthetic Code Avg.
MFQA-en MFQA-zh Qasper HPQ 2Wiki GovRpt Dur PassR-en PassR-zh LCC
H2O 0.428 0.429 0.308 0.112 0.101 0.231 0.208 0.704 0.421 0.092 0.303
InfLLM 0.474 0.517 0.356 0.306 0.250 0.277 0.257 0.766 0.486 0.143 0.383
Quest 0.495 0.561 0.365 0.295 0.245 0.293 0.257 0.792 0.478 0.135 0.392
Exact-Top 0.502 0.605 0.397 0.321 0.288 0.316 0.291 0.810 0.548 0.156 0.423
Full Attn 0.512 0.623 0.409 0.350 0.305 0.324 0.294 0.830 0.560 0.163 0.437
NSA 0.503 0.624 0.432 0.437 0.356 0.307 0.341 0.905 0.550 0.232 0.469

原文

For general evaluation, where most samples have lengths within the local context window of sparse attention baselines, these methods are effectively equivalent to Full Attention. Therefore, we present only the comparison results between NSA and Full Attention baseline in this setting. In the long-context evaluation, we conduct comparisons across all baseline methods, with the sparsity of all sparse attention methods set to the same to ensure a fair comparison. For chainof-thought reasoning evaluation, which requires long-text supervised fine-tuning, we limit our comparison to Full Attention, as sparse attention baselines do not support training.

翻譯後的結果

針對一般評估,其中大多數樣本的長度處於稀疏注意力基準線的局部上下文窗口內,這些方法在效果上等同於全注意力。因此,我們僅在此設置下呈現 NSA 與全注意力基線之間的比較結果。在長文本評估中,我們對所有基線方法進行了比較,並將所有稀疏注意力方法的稀疏度設定為相同,以確保公平比較。對於鏈式思維推理評估,它需要長文本監督微調,我們僅將比較限制在全注意力上,因為稀疏注意力基線不支持訓練。

4.3. Performance Comparison

原文

General Evaluation. We evaluated the pretrained NSA and Full Attention baseline, on a comprehensive suite of benchmarks spanning knowledge, reasoning, and coding capabilities, including MMLU (Hendrycks et al., 2020), MMLU-PRO (Wang et al., 2024), CMMLU (Li et al., 2023), BBH (Suzgun et al., 2022), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2020), DROP (Dua et al., 2019), MBPP (Austin et al., 2021), and HumanEval (Chen et al., 2021). The results are shown in Table 1. Despite its sparsity, NSA achieves superior overall performance, outperforming all baselines including Full Attention on 7 out of 9 metrics. This indicates that although NSA may not fully leverage its efficiency advantages on shorter sequences, it shows strong performance. Notably, NSA demonstrates significant gains in reasoning-related benchmarks (DROP: +0.042, GSM8K: +0.034), suggesting that our pretraining helps models to develop specialized attention mechanisms. This sparse attention pretraining mechanism forces model to focus on the most important information, potentially enhancing performance by filtering out noise from irrelevant attention pathways. The consistent performance across diverse evaluations also validates NSA's robustness as a general-purpose architecture.

翻譯後的結果

整體評估方面,我們在知識、推理和程式碼編寫能力等範疇內的一套全面基准上評估了預先訓練的 NSA 和 Full Attention 基線,包括 MMLU(Hendrycks 等人,2020)、MMLU-PRO(Wang 等人,2024)、CMMLU(Li 等人,2023)、BBH(Suzgun 等人,2022)、GSM8K(Cobbe 等人,2021)、MATH(Hendrycks 等人,2020)、DROP(Dua 等人,2019)、MBPP(Austin 等人,2021) 和 HumanEval(Chen 等人,2021)。結果如表 1 所示。儘管 NSA 的稀疏性較高,但其整體表現仍然優於所有基線,包括 Full Attention 在 9 個指標中的 7 個上取得勝利。這表明,即使 NSA 可能無法充分利用其在短序列上的效率優勢,但其性能仍然十分強勁。值得注意的是,NSA 在推理相關基準(DROP:+0.042,GSM8K:+0.034)方面表現出顯著的增益,這表明我們的預先訓練有助於模型開發專門的注意力機制。這種稀疏注意力預訓練機制迫使模型關注最關鍵的信息,有可能通過過濾無關信息的注意力路徑來提升性能。在各種評估中的一致性表現也驗證了 NSA 作為通用架構的穩健性。

原文

Long-Context Evaluation. As shown in Figure 5, NSA achieves perfect retrieval accuracy across all positions in 64k-context needle-in-a-haystack (Kamradt, 2023) test. This performance stems from our hierarchical sparse attention design, which combines compression tokens for efficient global context scanning, and selection tokens for precise local information retrieval. The coarse-grained compression identifies relevant context blocks at low computational cost, while the token-level attention on selected tokens ensures the preservation of critical fine-grained information. This design enables NSA to maintain both global awareness and local precision.

翻譯後的結果

長文評估。如圖 5 所示,NSA 在所有位置上都達到 64k-context 針中 haystack (Kamradt, 2023) 測試中的完美檢索準確度。這種性能來自我們的層次稀疏注意力設計,它結合了壓縮令牌以有效地掃描全局上下文,以及選擇令牌用於精確地检索局部信息。粗粒度的壓縮令牌識別相關的上下文塊具有低計算成本,而對選定令牌的元級注意力確保保留關鍵的精細信息。這種設計使 NSA 能夠同時保持全局感知和局部精度。

原文

We also evaluate NSA on LongBench (Bai et al., 2023) against state-of-the-art sparse attention methods and Full Attention baseline. To ensure consistent sparsity, we set the token activated by each query in all sparse attention baselines to 2560 tokens, which corresponds to the average number of tokens activated in NSA when handling 32k sequence lengths. Following StreamLLM (Xiao et al., 2023), this token budget includes the leading 128 tokens and 512 local tokens. We exclude certain subsets from LongBench due to their low scores across all models, which may not provide meaningful comparisons. As shown in Table 2, NSA achieves the highest average score 0.469, outperforming all baselines (+0.032 over Full Attention and +0.046 over Exact-Top). This improvement arises from two key innovations: (1) our native sparse attention design, which enables end-to-end optimization of sparse patterns during pretraining, facilitates synchronized adaptation between the sparse attention module and other model components; and (2) the hierarchical sparse attention mechanism achieves a balance between local and global information processing.

翻譯後的結果

我們也在 LongBench (Bai et al., 2023) 上評估 NSA,並與最先進的稀疏注意力方法和全注意力基準模型進行比較。為確保一致性的稀疏性,我們將所有稀疏注意力基準模型中每個查詢激活的記號設為 2560 個記號,這等於處理 32k 序列長度時 NSA 中平均激活的記號數量。參考 StreamLLM (Xiao et al., 2023),這個記號預算包括前128個記號和512個局部記號。由於某些子集在所有模型上的得分都偏低,因此我們從 LongBench 中排除它們,以避免影響比較結果的意義。如表 2 所示,NSA 取得最高平均分數 0.469,優於所有基準 (+0.032 優於全注意力和 +0.046 優於 Exact-Top)。這種改進來自兩個關鍵創新: (1) 我們的本機稀疏注意力設計,它能夠在預訓練期間對稀疏模式進行端到端的優化,促進稀疏注意力模組與其他模型組件之間的同步適應; (2) 分層稀疏注意力機制實現了局部和全局信息處理的平衡。

原文

Figure 5 | Needle-in-a-Haystack retrieval accuracy across context positions with 64k context length. NSA achieves perfect accuracy through its hierarchical sparse attention design.

翻譯後的結果

圖 5 | 在具有 64k 上下文長度時,針入沙中檢索精確度隨著不同上下文位置的改變。NSA 通過其分層稀疏注意設計實現了完美的精度。

原文

Notably, NSA demonstrates exceptional performance on tasks requiring complex reasoning over long contexts, achieving +0.087 and +0.051 improvements over Full Attention on multi-hop QA tasks (HPQ and 2Wiki), exceeding the performance of baselines on code understanding (LCC: +0.069), and outperforming other methods on passage retrieval (PassR-en: +0.075). These results validate NSA's capability to handle diverse long-context challenges, with its natively pretrained sparse attention providing additional benefits in learning task-optimal patterns.

翻譯後的結果

值得注意的是,NSA 在需要對長文本進行複雜推理的任務中表現出色,在多跳問答任務(HPQ 和 2Wiki)上相比於全注意力模型高出 0.087 和 0.051 分,在程式碼理解(LCC:+0.069)方面超越基線模型,並且在段落檢索(PassR-en:+0.075)方面勝過其他方法。這些結果驗證了 NSA 處理各種長文本挑戰的能力,其原生的預訓練稀疏注意力機制為學習任務最佳模式提供了額外優勢。

原文

Chain-of-Thought Reasoning Evaluation. To evaluate NSA's compatibility with advanced downstream training paradigms, we investigate its capacity to acquire chain-of-thought mathematical reasoning abilities via post-training. Given the limited effectiveness of reinforcement learning on smaller-scale models, we employ knowledge distillation from DeepSeek-R1, conducting supervised fine-tuning (SFT) with 10B tokens of 32k-length mathematical reasoning traces. This produces two comparable models: Full Attention-R (Full Attention baseline) and NSA-R (our sparse variant). We assess both models on the challenging American Invitational Mathematics Examination (AIME 24) benchmark. We use a sampling temperature of 0.7 and a top𝑝 value of 0.95 to generate 16 responses for each question and obtain the average score. To validate the impact of reasoning depth, we conduct experiments with two generation context limits: 8k and 16k tokens, measuring whether extended reasoning chains improve accuracy. Example comparisons of model predictions are provided in Appendix A.

翻譯後的結果

連鎖式推理評估。為探討 NSA 與高級下游訓練范式的兼容性,我們研究其透過後訓練能否獲得連鎖式思維數學推理能力。鑑於強化學習在小型模型上的效能有限,我們採用 DeepSeek-R1 的知識蒸餾,並進行 10B 個 32k 長度的數學推理軌跡監督式微調(SFT)。這產生了兩個可比較的模型:Full Attention-R(Full Attention 基線)和 NSA-R(我們的稀疏變體)。我們在具有挑戰性的美國邀請國際數學考試(AIME 24)評估標準上測試兩者。我們使用 0.7 的採樣溫度和 0.95 的 top𝑝 值,為每個問題生成 16 個回應並計算平均得分。為了驗證推理深度的影響,我們進行了兩種生成上下文限制實驗:8k 和 16k 個詞彙,觀察是否延長推理鏈路能提高準確性。附錄 A 包含模型預測的示例比較。

原文

Table 3 | AIME Instruction-based Evaluating after supervised fine-tuning. Our NSA-R demonstrates better performance than Full Attention-R at both 8k and 16k sequence lengths

翻譯後的結果

表 3 | 指令式評估後 (AIME),在監督式微調下。我們的NSA-R 在8k和16k長度序列中,表現均優於全注意力模型 (Full Attention-R)。

Generation Token Limit 8192 16384
Full Attention-R 0.046 0.092
NSA-R 0.121 0.146

原文

Figure 6 | Comparison of Triton-based NSA kernel with Triton-based FlashAttention-2 kernel. Our implementation significantly reduces latency across all context lengths, with the improvement becoming more pronounced as input length increases.

翻譯後的結果

圖 6 | 基於 Triton 的 NSA 核心與基於 Triton 的 FlashAttention-2 核心的比較。我們的實作在所有上下文長度上顯著降低了延遲,隨著輸入長度的增加,這種改進變得更加明顯。

原文

As shown in Table 3, NSA-R achieves significantly higher accuracy than Full Attention-R under the 8k context setting (+0.075), with this advantage persisting at 16k contexts (+0.054). These results validate two key benefits of native sparse attention: (1) The pretrained sparse attention patterns enable efficient capture of long-range logical dependencies critical for complex mathematical derivations; (2) Our architecture's hardware-aligned design maintains sufficient context density to support growing reasoning depth without catastrophic forgetting. The consistent outperformance across context lengths confirms sparse attention's viability for advanced reasoning tasks when natively integrated into the training pipeline.

翻譯後的結果

如表 3 所示,NSA-R 在 8k 上下文設定下,其準確度顯著高於 Full Attention-R (+0.075),且這種優勢在 16k 上下文設定下持續存在 (+0.054)。這些結果驗證了原生稀疏注意力的兩大關鍵優勢: (1) 預先訓練的稀疏注意力模式能有效捕捉長距離邏輯依賴性,對於複雜數學推導至關重要; (2) 架構與硬體對齊設計維持足夠的上下文密度,支持不斷增長的推理深度而不會出現災難性遺忘。在不同上下文長度下的持續優勢,證實了稀疏注意力作為訓練管道原生整合的一部分,在高級推理任務中的可行性。

5. Efficiency Analysis

原文

We evaluate the computational efficiency of NSA against Full Attention on an 8-GPU A100 system. In efficiency analysis, we also configure the model with GQA group 𝑔 = 4, heads per group ℎ = 16, query/key dimension 𝑑𝑘 = 192, and value dimension 𝑑𝑣 = 128. Following the same settings in Section 4, we set NSA compression block size 𝑙 = 32, sliding stride 𝑑 = 16, selected block size 𝑙 ' = 64, selected block count 𝑛 = 16, and sliding window size 𝑤 = 512.

翻譯後的結果

我們在8個GPU A100系統上評估NSA相對於全注意力機制(Full Attention)的計算效率。 在效率分析中,我們將模型配置為 GQA 組織參數𝑔=4、每個組別頭部數ℎ=16、查詢/關鍵維度𝑑𝑘 = 192 和值維度𝑑𝑣 = 128。根據第4節的設定,我們設置 NSA 壓縮塊大小𝑙 = 32、滑動步幅𝑑 = 16、選擇塊大小𝑙 ' = 64、選擇塊數量𝑛 = 16,以及滑動視窗大小𝑤 = 512。

5.1. Training Speed

原文

We compare the Triton-based implementations of our NSA attention and Full Attention with Triton-based FlashAttention-2 to ensure fair speed comparison across the same backend. As shown in Figure 6, our NSA achieves progressively greater speedups as context length increases, up to 9.0 × forward and 6.0 × backward speedup at 64k context-length. Notably, the speed advantage becomes more pronounced with longer sequences. This speedup stems from our hardware-aligned algorithm design to maximize the efficiency of sparse attention architecture: (1) The Blockwise memory access pattern maximizes Tensor Core utilization through coalesced loads, (2) The delicate loop scheduling in the kernel eliminates redundant KV transfers.

翻譯後的結果

我們在相同的後端基於Triton的實現中,將我們的NSA注意力和全注意力與基於Triton的FlashAttention-2進行比較,以確保公平的速度比較。如圖6所示,我們的NSA隨著上下文長度的增加表現出更顯著的速度提升,最高可達64k上下文長度時前向9.0倍、反向6.0倍的速度提升。值得注意的是,這種速度優勢在較長序列中更加明顯。這種加速源於我們針對硬件對齊的算法設計,旨在最大化稀疏注意力架構的效率:(1)塊狀記憶存取模式通過合併載入最大化Tensor Core 的利用率; (2)核心中的精細循環調度消除了多餘的KV轉移。

原文

Table 4 | Memory access volume (in equivalent number of tokens) per attention operation during decoding. Due to the low arithmetic intensity and memory-bound nature of decoding, the expected speedup is approximately linear with the volume of memory access.

翻譯後的結果

表 4 | 解碼期間每項注意力操作所需的記憶體存取量(以等效詞彙計)。由於解碼的算術密集度低且記憶體占用率高,預計加速與記憶體存取量的線性關係。

Context Length 8192 16384 32768 65536
Full Attention 8192 16384 32768 65536
NSA 2048 2560 3584 5632
Expected Speedup 4 × 6.4 × 9.1 × 11.6 ×

5.2. Decoding Speed

原文

The decoding speed of Attention is primarily determined by the memory access bottleneck, which is closely tied to the amount of KV cache loading. In each decoding step, Our NSA just needs to load at most GLYPH<4> 𝑠 -𝑙 𝑑 GLYPH<5> compression tokens, 𝑛𝑙 ' selected tokens, and 𝑤 neighbor tokens, where 𝑠 is the cached sequence length. As shown in Table 4, our method exhibits a significant reduction in latency as the decoding length increases, achieving up to 11.6 × speedup at 64k context-length. This advantage in memory access efficiency also amplifies with longer sequences.

翻譯後的結果

注意力解碼速度主要受限於記憶存取瓶頸,與 KV 緩存載入量密切相關。在每個解碼步驟中,我們的 NSA 只需載入最多 GLYPH<4>𝑠-𝑙𝑑 GLYPH<5>壓縮記號、𝑛𝑙 個選定記號以及𝑤 個鄰近記號,其中 s 表示缓存的序列长度。如表 4 所示,隨著解碼長度的增加,我們的方案表現出顯著的延遲減少,在 64k 上下文長度下可實現最高 11.6 倍速度提升。這種記憶存取效率的優勢也隨序列长度的延長而放大。

6. Discussion

原文

In this section, we reflect on the development process of NSA and discuss key insights gained from our exploration of different sparse attention strategies. While our approach demonstrates promising results, understanding the challenges encountered with alternative strategies and analyzing attention patterns provides valuable context for future research directions. We first examine challenges with alternative token selection strategies that motivated our design choices, followed by visualizations that offer insights into attention distribution patterns.

翻譯後的結果

在此部分,我們回顧 NSA 的開發過程,並討論從探索各種稀疏注意力策略中獲得的重要見解。雖然我們的策略表現出有希望的結果,但理解這些挑戰和分析注意力模式為未來的研究方向提供了寶貴的背景。我們首先檢視替代令牌選擇策略所面臨的挑戰,這些挑戰促使我們的設計決策,接下來是可視化圖表,提供對注意力模式的洞察。

6.1. Challenges with Alternative Token Selection Strategies

原文

Before designing NSA, we explored adapting existing sparse attention methods to the training stage. However, these attempts encountered various challenges, prompting us to design a different sparse attention architecture:

翻譯後的結果

我們曾嘗試將現有的稀疏注意力機制應用於訓練階段,但這些方法遇到了各種挑戰。因此,我們設計了 NSA 的全新稀疏注意力架構:

原文

Key-Clustering Based Strategies. We examined clustering-based strategies like ClusterKV (Liu et al., 2024). These methods store Keys and Values from the same cluster in contiguous memory regions. While theoretically feasible for training and inference, they face three significant challenges: (1) Non-trivial computational overhead introduced by dynamic clustering mechanisms; (2) Operator optimization difficulties exacerbated by inter-cluster imbalances, especially in Mixture-of-Experts (MoE) systems, where skewed Expert Parallelism (EP) group execution times lead to persistent load imbalances; (3) Implementation constraints arising from the need for mandatory periodic reclustering and chunk-sequential training protocols. These combined factors create substantial bottlenecks, significantly limiting their effectiveness for real-world deployment.

翻譯後的結果

基於關鍵字聚類的策略。我們研究了基於聚類的方法,例如 ClusterKV (Liu 等人,2024)。這些方法將同一個聚類中的 Keys 和 Values 儲存在連續的內存區域中。儘管在訓練和推理方面理論上是可行的,但它們面臨著三個重大挑戰:(1) 動態聚類機制引入的非微不足道的計算開銷;(2) 由於群間失衡,操作器優化困難加劇,尤其是在混合專家 (MoE) 體系中,這裡 專家並行組執行時間分配不均 導致持久的負載不平衡;(3) 來自需要 定期必needs重新聚類 和塊順序訓練協議的實作限制。這些綜合因素造成顯著瓶頸,顯著限制了它們在實際部署中的效能。

原文

Figure 8 | Visualization of Attention Map on a Full Attention transformer. Lightcolored regions indicate higher attention values. As shown in the figure, attention scores exhibit blockwise clustering distribution.

翻譯後的結果

圖 8 | 於全注意力變換器上的注意力圖像可視化。淺色區域表示更高的注意力值。如圖所示,注意力得分呈現塊狀聚類分布。

原文

Figure 7 | Compare training loss on a 3Bparameter model with Full Attention and different token selection strategies and. Our NSA achieves better performance.

翻譯後的結果

圖 7 | 比較訓練時 3-billion 個參數模型在完整注意力和不同 token 選擇策略下的損失。我們的 NSA 達到更好的效能。

原文

Other Blockwise Selection Strategies. We also considered blockwise key, value selection strategies different from NSA, such as Quest (Tang et al., 2024) and InfLLM (Xiao et al., 2024). These methods rely on computing an importance score for each block and selecting the top𝑛 blocks based on their similarity with 𝑞𝑡 . However, existing methods face two critical issues: (1) Since the selection operation is non-differentiable, importance score computation based on neural networks relies on auxiliary loss, which increases operator overhead and often degrades model performance; (2) Heuristic parameter-free importance score computation strategy suffer from low recall rates, leading to suboptimal performance. We evaluate both approaches on a 3B-parameter model with similar architecture and compare their loss curve with NSA and Full Attention. For the auxiliary loss-based selection method, we introduce additional queries and representative keys for each block to estimate the block importance scores. These scores are supervised by the mean attention scores between the original queries and keys within each block. For the heuristic parameter-free selection method, following the strategy of Quest, we implement direct selection using the product between queries and coordinate-wise min-max of the key chunks, without introducing additional parameters. We also explore a cold-start training approach where Full Attention is applied for the initial 1000 steps before transitioning to the heuristic blockwise selection. As shown in Figure 7, both methods exhibited inferior loss.

翻譯後的結果

其他區塊選擇策略。我們還考慮了與 NSA 不同的區塊式關鍵、值選擇策略,例如 Quest(Tang 等人,2024 年)和 InfLLM(Xiao 等人,2024 年)。這些方法依賴於計算每個區塊的重要性得分,並根據其與目標語向量 (𝑞𝑡) 的相似度選取前𝑛個區塊。然而,現有方法面臨兩個關鍵問題:(1)由於選擇操作不可微分,基於神經網路的重要性得分計算通常需要輔助損失,這會增加運算成本並經常降低模型表現;(2)啟發式無參數重要性得分計算策略通常具有較低的召回率,導致性能不足。我們在一個擁有相似架構的 3B-參數模型上評估了兩種方法,並將其損失曲線與 NSA 和全注意力模型進行比較。對於基於輔助損失的选择方法,我們引入了每個區塊額外的查詢和代表性關鍵,並根據這些關鍵和原始查詢之間的平均注意力得分來監督重要性得分。至於啟發式無參數選擇方法,遵循 Quest 的策略,我們使用查詢與每個關鍵片段的逐元素 min-max 相乘來直接進行選擇,不引入額外的參數。我們還嘗試了冷啟動訓練方法:在轉換為启發式區塊選擇之前,先使用全注意力模型訓練最初 1000 步。如圖 7 所示,兩種方法都表現出較高的損失。

6.2. Visualization

原文

To explore potential patterns in transformer attention distributions and seek inspiration for our design, we visualize the attention map from our pretrained 27B Full Attention model in Figure 8. The visualization reveals interesting patterns where attention scores tend to exhibit blockwise clustering characteristics, with nearby keys often showing similar attention scores. This observation inspired our design of NSA, suggesting that selecting key blocks based on spatial continuity might be a promising approach. The blockwise clustering phenomenon indicates that tokens adjacent in the sequence may share certain semantic relationships with query tokens, though the exact nature of these relationships requires further investigation. This observation motivated us to explore a sparse attention mechanism that operates on continuous token blocks rather than individual tokens, aiming to enhance computational efficiency and preserve high-attention patterns.

翻譯後的結果

為了探索 Transformer 注意力分配中的潛在模式,並為我們的設計尋找靈感,我們在圖 8 中可視化了預先訓練的 27B 全注意力模型的注意力圖。這種可視化揭示出有趣模式,其中注意力得分傾向於呈現塊狀聚類特徵,相鄰關鍵詞通常呈現高度相似度的注意力得分。這個觀察啟發了我們的 NSA 設計,表明根據空間連續性選擇關鍵塊可能是個很有前途的方法。塊狀聚類現象表明,序列中相鄰的詞彙可能與查詢詞彙共享某些語義關聯,儘管這些關係的確切性質需要進一步研究。這個觀察點燃了我們探索一種稀疏注意力機制的想法,這種機制作用於連續的詞彙塊而不是單個詞彙,旨在提高計算效率並保留高注意力模式。

原文

We review existing approaches that improve the efficiency of attention computation through sparse attention. These methods can be broadly categorized into three groups based on their core strategies: (1) fixed sparse pattern, (2) dynamic token pruning, and (3) query-aware selection. We introduce several representative works from each category.

翻譯後的結果

我們回顧現有的提高注意力計算效率的稀疏注意力方法。這些方法可根據其核心策略大致分為三類:(1) 固定稀疏模式,(2) 動態 token 剪枝,和 (3) 查詢感知選擇。我們從每種類別中介紹了一些代表性的工作。

7.1. Fixed Sparse Pattern

原文

SlidingWindow is a commonly used approach that allows the query to compute attention only within a fixed window. StreamingLLM (Xiao et al., 2023) addresses the challenges of processing long text streams by maintaining two critical portions of the context: an attention sink (early tokens) and a local context window. While these approaches effectively reduce memory and computation costs, their rigid pattern of ignoring contexts limits their performance on tasks requiring full context understanding.

翻譯後的結果

「滑动窗口」是一种常用的方法,允许查询仅在固定窗口内计算注意力。StreamingLLM(Xiao 等,2023)通过维护两个关键上下文部分来应对处理长文本流的挑战:一个注意力聚合区(早期标记)和一个局部上下文窗口。虽然这些方法有效地降低了内存和计算成本,但它们忽略大部分历史信息的固有模式限制了在需要全面理解上下文任务中的性能。

7.2. Dynamic Token Pruning

原文

H2O (Zhang et al., 2023b) implements an adaptive approach to reduce KV-cache memory usage during decoding. This method dynamically evicts tokens deemed less important for future predictions based on their recent utility according to attention score. SnapKV (Li et al., 2024) also introduces a token pruning strategy that reduces the KV cache by selectively retaining only the most crucial features, enabling efficient memory usage. SnapKV identifies important features through attention weight analysis and voting during prefilling, then updates KV cache by combining selected compressed features with recent context to maintain prompt consistency.

翻譯後的結果

H2O(Zhang et al.,2023b)實作一種自適應的方法來減少在解码过程中 KeyValue 缓存内存的使用。此方法根据注意力机制评估近期实用性,动态移除被认为对未来预测不太重要的标记。SnapKV(Li et al.,2024)也引入了代碼修剪策略,通过选择保留最关键特征来减少 KeyValue 缓存,从而实现高效的内存使用。SnapKV 通过自注意力机制的权重分析和预填充期间的投票来识别重要特征,然后通过将选定的压缩特征与最近上下文结合更新 KeyValue 缓存,以保持提示的一致性。

7.3. Query-Aware Selection

原文

Quest (Tang et al., 2024) employs a blockwise selection strategy where each chunk's importance is estimated by product between query and coordinate-wise min-max of the key chunks. The results scores help to select top -𝑛 important key-value chunks for attention. InfLLM (Xiao et al., 2024) combines fixed patterns with retrieval by maintaining attention sinks, local context, and retrievable chunks. This method selects representative keys from each chunk to estimate chunk importance. HashAttention (Desai et al., 2024) formulates pivotal token identification as a recommendation problem by mapping queries and keys to Hamming space using learned functions. ClusterKV (Liu et al., 2024) achieves sparsity by firstly clustering keys and then selecting the most relevant clusters for attention computation based on query-cluster similarity.

翻譯後的結果

《Quest》(唐等人,2024)採用分塊選擇策略,每個區塊的重要性由查詢與關鍵區塊逐位最小最大值乘積估算。結果得分用於選取排名前 n 個最重要的鍵值區塊作為注意力目標。InfLLM(蕭等人,2024)將固定模式與檢索相結合,通過維護注意力槽、局部上下文和可檢索區塊實現此功能。該方法從每個區塊中選取代表性鍵,以估算區塊重要性。HashAttention(德賽等人,2024)將關鍵標記識別視為推薦問題,使用學習到的函數將查詢和鍵映射到 Hamming 空間。ClusterKV(劉等人,2024)首先對關鍵進行聚類,然後基於查詢與每個聚類的相似度,選擇最相關的聚類用於注意力計算,從而實現稀疏性。

8. Conclusion

原文

We present NSA, a hardware-aligned sparse attention architecture for efficient long-context modeling. By integrating hierarchical token compression with blockwise token selection within a trainable architecture, our architecture achieves accelerated training and inference while maintaining Full Attention performance. NSA advances the state-of-the-art by demonstrating general benchmark performance matches full-attention baselines, exceeding modeling capability in long-context evaluations, and enhanced reasoning ability, all accompanied by measurable reductions in computational latency and achieving significant speedup.

翻譯後的結果

我們提出 NSA 架構,這是一種針對高效長文本建模的硬件對齊稀疏注意力架構。通過在可訓練架構中整合層次化令牌壓縮與按塊令牌選擇,我們的架構在保持全注意力效能的同時實現加速訓練和推理。NSA 在通用基準測試表現上與全注意力基線相匹配,在長文本評估中超越建模能力,並增強推理能力,所有這些都伴隨著可測量的計算延遲減少以及顯著速度提升。