# Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
[TOC]
## 說明
排版的順序為先原文,再繁體中文,並且圖片與表格都會出現在第一次出現的段落下面
原文
繁體中文
照片或表格
:::warning
1. 個人註解,任何的翻譯不通暢部份都請留言指導
2. 為了加速閱讀,直接採用自建的反思翻譯(Phi4-14B模型所翻譯)的結果呈現,然後快速看過,語意對了頭過身就過,多多包函
3. 這篇論文測試利用docling將論文從pdf取出轉成markdown格式,再利用正則式取片段至dify建置的反思翻譯api取得譯文再做優化調整
4. 解釋的部份也是直接用Phi4-14B模型來針對片段做理解說明
5. 機器說明的部份我會把一些不必要的冗文刪除,這可能可以從提示詞中再來做回應的優化
:::
:::danger
* [paper hyperlink](https://arxiv.org/abs/2502.11089)
:::
## Abstract
Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.
長上下文建模對於下一代語言模型來說是非常重要的,然而標準注意力機制的高計算成本給效率帶來巨大挑戰。稀疏注意力(Sparse attention)提供了一個有希望的方向,可以在保持模型能力的同時提高效率。我們提出 NSA(原生訓練稀疏注意力)機制 (Natively trainable Sparse Attention mechanism),它將演算法創新與硬體最佳化相結合,實現高效的長上下文建模。NSA 採用動態分層稀疏策略 (dynamic hierarchical sparse strategy),將 coarse-grained token 壓縮與 fine-grained token 選擇相結合,以保持全域上下文感知力和局部精度。我們的方法通過兩個關鍵創新的作法推動稀疏注意力的設計:(1)我們通過算術強度平衡演算法的設計來實現顯著的加速,並對現代硬體進行最佳化。(2)我們允許端到端 (end-to-end) 的訓練,在不犧牲模型效能的情況下減少預訓練的計算。如Figure 1 所示,實驗結果表明,使用 NSA 預訓練的模型在通用的基准測試、長上下文任務和指令型推理等方面表現至少與 Full Attention models 相等或超越它。同時,NSA 在 64k 長序列的解碼、前向傳播和反向傳播方面都比 Full Attention models 實現了顯著加速,驗證了其於整個模型生命週期中的效率。

Figure 1 | Comparison of performance and efficiency between Full Attention model and our NSA. Left: Despite being sparse, NSA surpasses Full Attention baseline on average across general benchmarks, long-context tasks, and reasoning evaluation. Right: For 64k-length sequence processing, NSA achieves substantial computational speedup compared to Full Attention in all stages: decoding, forward propagation, and backward propagation.
Figure 1 | Full Attention models 與我們提出的 NSA 在準確性和效率方面的比較。左側:儘管稀疏,但 NSA 在一般基准測試、長文本任務和推理評估中均超越了 Full Attention models 的平均水平。右側:對於長度為 64k 的序列處理,NSA 在解碼、前向傳播和反向傳播的所有階段都比 Full Attention models 實現了顯著的計算效率的提升。
## 1. Introduction
The research community increasingly recognizes long-context modeling as a crucial capability for next-generation large language models, driven by diverse real-world applications ranging from in-depth reasoning (DeepSeek-AI, 2025; Zelikman et al., 2022), repository-level code generation (Zhang et al., 2023a; Zhang et al.) and multi-turn autonomous agent systems (Park et al., 2023). Recent breakthroughs, including OpenAI's o-series models, DeepSeek-R1 (DeepSeek-AI, 2025), and Gemini 1.5 Pro (Google et al., 2024), enabling models to process entire codebases, lengthy documents, maintain coherent multi-turn conversations over thousands of tokens, and perform complex reasoning across long-range dependencies. However, the high complexity (Zaheer et al., 2020) of vanilla Attention (Vaswani et al., 2017) mechanisms emerges as a critical latency bottleneck as sequence length increases. Theoretical estimates indicate that attention computation with softmax architectures accounts for 70-80% of total latency when decoding 64k-length contexts, underscoring the urgent need for more efficient attention mechanisms.
研究社群日益認識到長上下文處理能力是下一代大型語言模型(LLM)的核心,這受到各種實踐應用推動,涵蓋深度推理(DeepSeek-AI, 2025; Zelikman et al., 2022)、程式碼庫層級的程式碼生成(Zhang et al., 2023a; Zhang et al.)以及多回合自主代理系統(Park et al., 2023)。近期的技術突破,例如 OpenAI 的 o 系列模型、DeepSeek-R1(DeepSeek-AI,2025)和 Gemini 1.5 Pro(Google et al., 2024),使模型能夠處理完整的程式碼庫、長篇文件,並在數千個 token 長度的的情況下維持著連貫性的多回合對話,同時在長期依賴性中執行複雜的推理。然而,原始注意力機制(Vaswani et al., 2017)的複雜度隨著序列長度的增加(Zaheer et al., 2020),而成為關鍵的延遲瓶頸。理論估計顯示,在解碼 64k 長度的上下文時,softmax 架構的注意力計算佔據了總處理時間的 70-80%,這更加突顯出了開發更有效率的注意力機制的重要性。
A natural approach to efficient long-context modeling is to take advantage of the inherent sparsity of softmax attention (Ge et al., 2023; Jiang et al., 2023), where selectively computing critical query-key pairs can significantly reduce computational overhead while preserving performance. Recent advances demonstrate this potential through diverse strategies: KV-cache eviction methods (Li et al., 2024; Zhang et al., 2023b; Zhou et al., 2024), blockwise KV-cache selection methods (Tang et al., 2024; Xiao et al., 2024), and sampling, clustering or hashing-based selection methods (Chen et al., 2024; Desai et al., 2024; Liu et al., 2024). Despite these promising strategies, existing sparse attention methods often fall short in practical deployments. Many approaches fail to achieve speedups comparable to their theoretical gains; also, most methods mainly focus on inference stage, lacking effective training-time support to fully exploit the sparsity patterns of attention.
一種有效處理長文本建模的自然方法,就是充分利用 softmax 注意力的內在稀疏性(inherent sparsity)(Ge et al., 2023; Jiang et al., 2023),透過選擇性計算關鍵的查詢與鍵(query-key)組合,可以顯著降低運算負擔,同時維持模型效能。近期的發展已經透過多種策略來證明這種潛力:KV-cache eviction methods (Li et al., 2024; Zhang et al., 2023b; Zhou et al., 2024)、blockwise KV-cache selection methods (Tang et al., 2024; Xiao et al., 2024)以及基於採樣、聚類或雜湊的注意力選取策略(Chen et al., 2024; Desai et al., 2024; Liu et al., 2024)。儘管這些方法具備潛力,但現有的稀疏注意力方法在實際部署中常常無法達到預期的效果。許多方法未能實現與理論收益相媲美的加速;此外,大部分方法主要集中在推理階段,缺乏訓練時期的有效支持,難以充分發揮注意力稀疏性的優勢。
To address these limitations, the deployment of effective sparse attention must tackle two key challenges: (1) Hardware-aligned inference speedup : Converting theoretical computation reductions into actual speed improvements requires hardware-friendly algorithm design during both prefilling and decoding stages to mitigate memory access and hardware scheduling bottlenecks; (2) Training-aware algorithm design : Enabling end-to-end computation with trainable operators to reduce training costs while maintaining model performance. These requirements are crucial for real-world applications to achieve fast long-context inference or training. When considering both aspects, existing methods still exhibit a noticeable gap.
為了克服這些限制,有效稀疏注意力機制的部署必須解決兩個關鍵挑戰:(1) 硬體對齊的推理加速:要將理論上的運算減少真正轉化為實際速度提升,必須在預填充與解碼階段設計符合硬體友善的演算法,以緩解記憶體存取與硬體排程的瓶頸;(2) 訓練感知的演算法設計:啟用可訓練操作以進行端到端計算,同時降低訓練成本,並同時維持模型效能。 這些要求對於實踐應用來說至關重要,才能實現快速長上下文的推理或訓練。當考慮兩個方面時,現有的方法仍存在明顯的差距。
To achieve more effective and efficient sparse attention, we present NSA, a Natively trainable Sparse Attention architecture that integrates hierarchical token modeling. As shown in Figure 2, NSA reduces per-query computation by organizing keys and values into temporal blocks and processing them through three attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Then we implement specialized kernels to maximize its practical efficiency. NSA introduces two core innovations corresponding to the key requirements above: (1) Hardware-aligned system: Optimize blockwise sparse attention for Tensor Core utilization and memory access, ensuring balanced arithmetic intensity. (2) Training-aware design: Enable stable end-to-end training through efficient algorithms and backward operators. This optimization enables NSA to support both efficient deployment and end-to-end training.
為了實現更有效率和高效的稀疏注意力,我們提出了一個名為 NSA 的新模型,其整合了分層 token 模型。如Figure 2 所示,NSA 通過將 Key 與 Value 組織成時間區塊(temporal blocks)並通過三個注意力路徑進行處理來減少每個查詢的計算量:coarse-grained token、選擇性保留 fine-grained token 和滑動窗口以獲取局部上下文信息。然後我們實現專用的 kernels以最大化提高其實際效率。NSA 引入兩個核心創新,滿足上述關鍵需求:(1) 硬體對齊的系統設計:針對 Tensor Core 的使用率與記憶體存取進行區塊式稀疏注意力的優化,確保計算密度的平衡。(2) 訓練感知設計:透過高效的演算法和反向運算器實現穩定的端到端訓練。這種最佳化使 NSA 可以支持高效部署和端到端的訓練。

Figure 2 | Overview of NSA's architecture. Left: The framework processes input sequences through three parallel attention branches: For a given query, preceding keys and values are processed into compressed attention for coarse-grained patterns, selected attention for important token blocks, and sliding attention for local context. Right: Visualization of different attention patterns produced by each branch. Green areas indicate regions where attention scores need to be computed, while white areas represent regions that can be skipped.
Figure 2 | NSA 架構概覽。左側:框架以三個並行注意力分支處理輸入序列:對於給定查詢,前序鍵值會被處理成壓縮注意力擷取 coarse-grained 模式、選擇性注意力關注重要詞元區塊,以及滑動注意力捕捉局部上下文。右側:各分支產生的不同注意力模式可視化圖。綠色區域表示需要計算注意力分數的區域,白色區域代表可以跳過的區域。
We evaluate NSA through comprehensive experiments on real-world language corpora. Pretraining on a 27B-parameter transformer backbone with 260B tokens, we assess NSA's performance across general language evaluations, long-context evaluations, and chain-of-thought reasoning evaluation. We further compare the kernel speed on A100 GPUs with optimized Triton (Tillet et al., 2019) implementations. Experimental results demonstrate that NSA achieves comparable or superior performance to full attention baseline, while outperforming existing sparse attention approaches. Additionally, NSA delivers substantial speedups across decoding, forward, and backward stages compared to Full Attention, with the speedup ratio increasing for longer sequences. These results validate that our hierarchical sparse attention design effectively balances model capability and computational efficiency.
我們在真實世界語言語料庫上全方位實驗評估 NSA 的效能。在使用 27B 個參數的 Transformer 主幹模型(訓練資料量為 260B 個 tokens)進行預訓練後,我們在通用語言任務、長上下文任務及鏈式思維推理任務等方面評估了 NSA 的表現。我們進一步比較在 A100 GPU 上與最佳化後的 Triton(Tillet et al., 2019)實作之間的效能差異。實驗結果顯示,NSA 在效能上可與全注意力基線相媲美甚至超越,同時勝過現有的稀疏注意力方法。此外,與 Full Attention models 相比,NSA 在解碼、前向傳播和反向傳播階段均取得顯著的加速效果,序列愈長,其加速比愈高。這些結果驗證了我們的分層稀疏注意力設計有效地平衡了模型效能與計算效率。
## 2. Rethinking Sparse Attention Methods
Modern sparse attention methods have made significant strides in reducing the theoretical computational complexity of transformer models. However, most approaches predominantly apply sparsity during inference while retaining a pretrained Full Attention backbone, potentially introducing architectural bias that limits their ability to fully exploit sparse attention's advantages. Before introducing our native sparse architecture, we systematically analyze these limitations through two critical lenses.
現代稀疏注意力機制在降低 transformer models 的理論計算複雜度方面取得了重大的進展。然而,多數方法主要是在推理階段應用稀疏性,同時保留預訓練的全注意力主幹架構,這可能會引入架構上的偏差,限制其充分利用稀疏注意力的優勢。在介紹我們的原生稀疏架構之前,我們將從兩個關鍵視角系統性地分析這些侷限性。
### 2.1. The Illusion of Efficient Inference
Despite achieving sparsity in attention computation, many methods fail to achieve corresponding reductions in inference latency, primarily due to two challenges:
儘管有不少方法可以在注意力計算中實現稀疏性,但由於下面兩個挑戰造成無法有效減少推理延遲:
**Phase-Restricted Sparsity.** Methods such as H2O (Zhang et al., 2023b) apply sparsity during autoregressive decoding while requiring computationally intensive pre-processing (e.g. attention map calculation, index building) during prefilling. In contrast, approaches like MInference (Jiang et al., 2024) focus solely on prefilling sparsity. These methods fail to achieve acceleration across all inference stages, as at least one phase remains computational costs comparable to Full Attention. The phase specialization reduces the speedup ability of these methods in prefilling-dominated workloads like book summarization and code completion, or decoding-dominated workloads like long chain-of-thought (Wei et al., 2022) reasoning.
**Phase-Restricted Sparsity.** 諸如 H2O (Zhang et al., 2023b) 等方法在自回歸解碼期間應用稀疏性,但在預填充階段仍需執行運算密集的前處理操作(如注意力映射計算與索引構建)。相比之下,MInference(Jiang et al., 2024)等方法僅針對預填充階段實施稀疏性。這類方法無法涵蓋推理過程中的所有階段,導致至少有一個階段的計算成本與完整注意力機制相當。由於稀疏性侷限於單一階段,這些方法在以預填充為主(如書籍摘要、程式碼補全)或以解碼為主(如長鏈式思維(Wei et al., 2022))的工作負載中,其加速效益將受到明顯限制。
**Incompatibility with Advanced Attention Architecture.** Some sparse attention methods fail to adapt to modern decoding efficient architectures like Mulitiple-Query Attention (MQA) (Shazeer, 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023), which significantly reduced the memory access bottleneck during decoding by sharing KV across multiple query heads. For instance, in approaches like Quest (Tang et al., 2024), each attention head independently selects its KV-cache subset. Although it demonstrates consistent computation sparsity and memory access sparsity in Multi-Head Attention (MHA) models, it presents a different scenario in models based on architectures like GQA, where the memory access volume of KV-cache corresponds to the union of selections from all query heads within the same GQA group. This architectural characteristic means that while these methods can reduce computation operations, the required KV-cache memory access remains relatively high. This limitation forces a critical choice: while some sparse attention methods reduce computation, their scattered memory access pattern conflicts with efficient memory access design from advanced architectures.
**Incompatibility with Advanced Attention Architecture.** 部分稀疏注意力方法難以相容現代高效解碼的架構,例如 Mulitiple-Query Attention (MQA) (Shazeer, 2019) 和 Grouped-Query Attention (GQA) (Ainslie et al., 2023),這些架構通過在多個 query head 間共享 KV 的方式來顯著減少解碼過程中的記憶體存取瓶頸。例如,Quest (Tang et al., 2024) 這樣的演算法,每個注意力頭都獨立地選擇其 KV-cache subset。儘管它在多頭注意力 (MHA) 模型中展現一致的計算稀疏性和記憶體存取稀疏性,但在 GQA 等架構基礎的模型中表現不同,因為 KV-cache 的記憶體存取量等於同一組內所有 query head 選擇的併集。這種架構特性意味著,儘管這些方法可以減少運算量,但所需的 KV-cache 記憶體存取仍相對較高。這個限制迫使我們做出重要抉擇:雖然一些稀疏注意力方法可以減小計算成本,但其分散的記憶體存取模式與先進架構高效的記憶體存取設計相衝突。
These limitations arise because many existing sparse attention methods focus on KV-cache reduction or theoretical computation reduction, but struggle to achieve significant latency reduction in advanced frameworks or backends. This motivates us to develop algorithms that combine both advanced architectural and hardware-efficient implementation to fully leverage sparsity for improving model efficiency.
這些限制源於許多現有的稀疏注意力方法專注於 KV-cache 的減少或理論計算的減少,但在先進框架或後端卻是難以實現顯著的延遲降低。這促使我們有了開發結合先進架構和硬件高效實作的演算法的衝動,充分利用稀疏性來提升模型效能。
## 2.2. The Myth of Trainable Sparsity
Our pursuit of native trainable sparse attention is motivated by two key insights from analyzing inference-only approaches: (1) Performance Degradation : Applying sparsity post-hoc forces models to deviate from their pretrained optimization trajectory. As demonstrated by Chen et al. (2024), top 20% attention can only cover 70% of the total attention scores, rendering structures like retrieval heads in pretrained models vulnerable to pruning during inference. (2) Training Efficiency Demands : Efficient handling of long-sequence training is crucial for modern LLM development. This includes both pretraining on longer documents to enhance model capacity, and subsequent adaptation phases such as long-context fine-tuning and reinforcement learning. However, existing sparse attention methods primarily target inference, leaving the computational challenges in training largely unaddressed. This limitation hinders the development of more capable long-context models through efficient training. Additionally, efforts to adapt existing sparse attention for training also expose challenges:
我們追求原生可訓練稀疏注意力的目標源於分析推理型方法時獲得的兩個關鍵洞察: (1) 效能退化:事後應用稀疏性迫使模型偏離其預先訓練的最佳化軌跡。如同陳等人的研究(2024)所示,排名前 20% 的注意力僅能覆蓋總注意力分數的 70%,這使得預訓練模型中的檢索頭結構在推理階段易受剪枝影響。(2)訓練效率需求:高效處理長序列訓練對於現代大型語言模型 (LLM) 的發展至關重要。這包括在更長的文獻上進行預訓練以增強模型能力,以及後續的調適階段,例如長上下文微調和強化學習。然而,現有的稀疏注意力方法主要針對推理,並沒有有效解決訓練過程中的計算挑戰。這個限制阻礙了通過高效訓練開發更具能力的長上下文模型。此外,將現有稀疏注意力應用於訓練也引發了一些新的挑戰:
**Non-Trainable Components.** Discrete operations in methods like ClusterKV (Liu et al., 2024) (includes k-means clustering) and MagicPIG (Chen et al., 2024) (includes SimHash-based selecting) create discontinuities in the computational graph. These non-trainable components prevent gradient flow through the token selection process, limiting the model's ability to learn optimal sparse patterns.
**Non-Trainable Components.** 類似 ClusterKV(Liu et al., 2024,包含 k-means clustering)與 MagicPIG(Chen et al., 2024,包含 SimHash-based selecting)等方法中的離散操作,會在計算圖中造成不連續。這些無法訓練的組件阻斷了梯度在 token 選擇過程中的流動,從而限制了模型學習最適稀疏模式的能力。
**Inefficient Back-propagation.** Some theoretically trainable sparse attention methods suffer from practical training inefficiencies. Token-granular selection strategy used in approaches like HashAttention (Desai et al., 2024) leads to the need to load a large number of individual tokens from the KV cache during attention computation. This non-contiguous memory access prevents efficient adaptation of fast attention techniques like FlashAttention, which rely on contiguous memory access and blockwise computation to achieve high throughput. As a result, implementations are forced to fall back to low hardware utilization, significantly degrading training efficiency.
**Inefficient Back-propagation.** 某些在理論上可訓練的稀疏注意力方法,在實際訓練中仍面臨效率低落的問題。例如 HashAttention(Desai et al., 2024)所採用的 Token-granular selection strategy,會在注意力計算期間導致需要從 KV cache 中載入大量不連續的個別的 tokens。這種非連續性的記憶體存取模式,使得像 FlashAttention 這類仰賴連續記憶體與區塊式運算來實現高吞吐量的快速注意力演算法,難以被有效套用。因此,實作上被迫退回到硬體資源利用率較低的替代方式,進一步導致訓練效率大幅下降。
## 2.3. Native Sparsity as an Imperative
These limitations in inference efficiency and training viability motivate our fundamental redesign of sparse attention mechanisms. We propose NSA, a natively sparse attention framework that addresses both computational efficiency and training requirements. In the following sections, we detail the algorithmic design and operator implementation of NSA.
這些推論效率和訓練可行性上的限制,促使我們對稀疏注意力機制進行根本性的重新設計。我們提出 NSA,一個原生稀疏的注意力框架,能夠同時兼顧計算效率和訓練需求。接下來的各節將詳細介紹 NSA 的演算法設計及運算元實作。
## 3. Methodology
Our technical approach spans algorithm design and kernel optimization. In the following subsections, we first introduce the background of our methodology. Then we present the overall framework of NSA, followed by its key algorithmic components. Finally, we detail our hardware-optimized kernel design that maximizes practical efficiency.
我們的技術策略涵蓋演算法設計和核心的最佳化。在以下子節中,我們將首先介紹我們方法的背景。接著,我們會說明 NSA 的整體架構,並闡述其關鍵演算法組成部分。最後,我們詳細介紹了我們的硬體最佳化的核心設計,以最大效率提升實際效能。
## 3.1. Background
**Attention Mechanism** is widely used in language modeling where each query token $\mathbf{q}_t$ computes relevance scores against all preceding keys $\mathbf{k}_{:t}$ to generate a weighted sum of values $\mathbf{v}_{:t}$ . Formally, for an input sequence of length $t$ , the attention operation is defined as:
$$
\mathbf{o}_t = \text{Attn}(\mathbf{q}_t, \mathbf{k}_{:t}, \mathbf{v}_{:t}) \tag{1}
$$
**注意力機制** 廣泛應用於語言建模中,其中每個 query token $\mathbf{q}_t$ 會計算與所有前序的 keys $\mathbf{k}_{:t}$ 的關聯性分數,進而生成 values $\mathbf{v}_{:t}$ 的加權和。正式地,對於長度為 $t$ 的輸入序列,注意力計算定義如下:
$$
\mathbf{o}_t = \text{Attn}(\mathbf{q}_t, \mathbf{k}_{:t}, \mathbf{v}_{:t}) \tag{1}
$$
GPT4o補充說明:
| 元素 | 說明 |
|------------------|----------------------------------------------------------------------|
| $\mathbf{o}_t$ | 第 $t$ 個位置的輸出向量(output vector),通常為 attention 機制的輸出。 |
| $\text{Attn}$ | 注意力函數(Attention function),例如 softmax attention 或 sparse attention。 |
| $\mathbf{q}_t$ | 第 $t$ 個位置的查詢向量(query vector),代表查詢目前所需的資訊。 |
| $\mathbf{k}_{:t}$ | 從第 $1$ 到第 $t$ 個位置的鍵向量(key vectors),冒號 `:` 表示切片範圍。 |
| $\mathbf{v}_{:t}$ | 與 $\mathbf{k}_{:t}$ 對應的值向量(value vectors),用於生成加權輸出。 |
> 此式代表 **自回歸(causal)self-attention**,即第 $t$ 個輸出僅依賴於時間步 $1$ 到 $t$ 的資訊,不包含未來($t+1$ 以後)的資訊。
where Attn denotes the attention function:
$$
\text{Attn}(\mathbf{q}_t, \mathbf{k}_{:t}, \mathbf{v}_{:t}) = \sum^t_{i=1}\dfrac{\alpha_{t,i}\mathbb{v}_i}{\sum^t_{j=1}\alpha_{t,j}}, \alpha_{t,i}=e^{\dfrac{\mathbf{q}^\top_t\mathbf{k}_i}{\sqrt{d_k}}} \tag{2}
$$
其中,Attn 表示注意力機制:
$$
\text{Attn}(\mathbf{q}_t, \mathbf{k}_{:t}, \mathbf{v}_{:t}) = \sum^t_{i=1}\dfrac{\alpha_{t,i}\mathbb{v}_i}{\sum^t_{j=1}\alpha_{t,j}}, \alpha_{t,i}=e^{\dfrac{\mathbf{q}^\top_t\mathbf{k}_i}{\sqrt{d_k}}} \tag{2}
$$
GPT4o補充說明:
| 元素 | 說明 |
|-----------------------------|------------------------------------------------------------------------------------------|
| $\text{Attn}(\cdot)$ | 表示注意力函數,其輸入為 query、key、value 三個向量組。 |
| $\mathbf{q}_t$ | 第 $t$ 個位置的查詢向量(query vector)。 |
| $\mathbf{k}_{:t}, \mathbf{v}_{:t}$ | 到第 $t$ 為止的所有 key 和 value 向量,符合自回歸注意力的因果性(causal masking)。 |
| $\alpha_{t,i}$ | 表示 query $\mathbf{q}_t$ 與第 $i$ 個 key 向量 $\mathbf{k}_i$ 的相似度分數,透過點積加上縮放。 |
| $\mathbf{q}_t^\top \mathbf{k}_i$ | 表示 query 和 key 的 dot product(點積)結果,代表相似程度。 |
| $\sqrt{d_k}$ | $d_k$ 是 key 向量的維度,做平方根後用於縮放,防止點積值過大造成 softmax 梯度消失或爆炸。 |
| $\sum_{j=1}^t \alpha_{t,j}$ | 對所有權重 $\alpha$ 做歸一化處理,使注意力權重為 softmax-like 分佈。 |
| $\sum_{i=1}^t \dfrac{\alpha_{t,i} \mathbb{v}_i}{\sum \alpha}$ | 表示根據注意力權重對所有 value 向量做加權平均,即為最終的 attention 輸出。 |
這表示在第 $t$ 個位置,模型會計算當前查詢向量 $\mathbf{q}_t$ 與所有過去(從 $1$ 到 $t$)鍵向量 $\mathbf{k}i$ 的相似度分數 $\alpha{t,i}$,並對對應的值向量 $\mathbb{v}_i$ 做加權平均。權重是透過將點積結果 $\mathbf{q}_t^\top \mathbf{k}_i$ 除以 $\sqrt{d_k}$ 後取指數,再經過歸一化而得。
Here, $\alpha_{t,i}$ , represents the attention weight between $\mathbf{q}_t$ and $\mathbf{k}_i$ , and $d_k$ is the feature dimension of keys. As sequence length increases, attention computation becomes increasingly dominant in the overall computational cost, presenting significant challenges for long-context processing.
這裡,$\alpha_{t,i}$ 代表 $\mathbf{q}_t$𝑡 和 $\mathbf{k}_i$ 之間的注意力權重,而 $d_k$ 是 keys 的維度。隨著序列長度的增加,注意力計算在整體運算成本中占據越來越大的比重,這對處理長上下文任務帶來了顯著挑戰。
**Arithmetic Intensity** is the ratio of compute operations to memory accesses. It intrinsically shapes algorithm optimization on hardware. Each GPU has a critical arithmetic intensity determined by its peak compute capability and memory bandwidth, calculated as the ratio of these two hardware limits. For computation tasks, arithmetic intensity above this critical threshold becomes compute-bound (limited by GPU FLOPS), while below it becomes memorybound (limited by memory bandwidth).
算術強度(Arithmetic Intensity)是指計算操作與記憶存取的比率。它本質上影響著演算法在硬體上的最佳化。每個 GPU 都有一個由其峰值計算能力與記憶體頻寬所決定的臨界算術強度,計算公式為這兩個硬體極限的比值。對於計算任務來說,如果算術強度超過此臨界閾值,則為[受計算界限](https://terms.naer.edu.tw/detail/ceec1461ce15ed40d7609fbc7af733aa/)(受 GPU FLOPS 限制);反之,若低於此閾值,則為[記憶體界限](https://terms.naer.edu.tw/detail/7d81fec708dbeb607d46d8d804fa9a1f/)(受到記憶體頻寬的限制)。
Specifically for causal self-attention mechanism, during training and prefilling phases, batched matrix multiplications and attention computations exhibit high arithmetic intensity, making these stages compute-bound on modern accelerators. In contrast, auto-regressive decoding becomes memory-bandwidth constrained because it generates one token per forward pass while requiring loading the entire key-value cache, resulting in low arithmetic intensity. This leads to different optimization goals - reducing computation cost during training and prefilling, while reducing memory access during decoding.
針對因果自注意力機制而言,在訓練和預填充階段,批量矩陣乘法與注意力計算呈現高算術強度,使得這些階段在現代加速器上成為計算瓶頸。相對地,自迴歸解碼因為每一次的 forward pass 僅生成一個符號,同時需要載入整個 key-value cache,導致低算術強度,進而造成記憶體頻寬受限。因此,不同階段的最佳化目標也有所不同:訓練和預填充階段降低計算成本,而解碼階段則降低記憶體存取。
## 3.2. Overall Framework
To leverage the potential of attention with natural sparse pattern, we propose replacing the original key-value pairs $\mathbf{k}_{:t},\mathbf{v}_{:t}$ in Equation (1) with a more compact and information-dense set of representation key-value pairs $\tilde{K}_{t},\tilde{V}_t$ given each query $\mathbf{q}_t$ . Specifically, we formally define the optimized attention output as follows:
$$
\tilde{K}=f_K(\mathbf{q}_t,\mathbf{k}_{:t},\mathbf{v}_{:t}),\tilde{V}_t=f_V(\mathbf{q}_t,\mathbf{k}_{:t},\mathbf{v}_{:t})\tag{3}
$$
$$
\mathbf{o}^*_t=\text{Attn}(\mathbf{q}_t, \tilde{K}_t,\tilde{V}_t)\tag{4}
$$
為了發揮注意力機制在原生稀疏模式下的潛力,我們建議將方程式 (1) 中原始的 key-value pairs $\mathbf{k}_{:t},\mathbf{v}_{:t}$ 替換為針對每個查詢向量 $\mathbf{q}_t$ 所對應的、更緊湊且資訊密度更高的表示性 key-value 對 $\tilde{K}_t,\tilde{V}_t$。具體而言,我們將優化後的注意力輸出正式定義如下:
$$
\tilde{K}=f_K(\mathbf{q}_t,\mathbf{k}_{:t},\mathbf{v}_{:t}),\tilde{V}_t=f_V(\mathbf{q}_t,\mathbf{k}_{:t},\mathbf{v}_{:t})\tag{3}
$$
$$
\mathbf{o}^*_t=\text{Attn}(\mathbf{q}_t, \tilde{K}_t,\tilde{V}_t)\tag{4}
$$
:::warning
快速回憶方程式(1)
$$
\mathbf{o}_t = \text{Attn}(\mathbf{q}_t, \mathbf{k}_{:t}, \mathbf{v}_{:t}) \tag{1}
$$
:::
where $\tilde{K}_t, \tilde{V}_t$ are dynamically constructed based on the current query $\bf{q}_t$ and the contextual memory $\bf{k}_{:t},\bf{v}_{:t}$ . We can design various mapping strategies to get different categories of $\tilde{K}^c_t,\tilde{V}^c_t$, and combine them as follows:
$$
\bf{o}^*_t=\sum_{c\in C}g^c_t\cdot\text{Attn}(\bf{q}_t,\tilde{K}^c_t,\tilde{V}^c_t)\tag{5}
$$
其中 $\tilde{K}_t, \tilde{V}_t$ 是基於當前查詢 $\bf{q}_t$ 和上下文記憶 $\bf{k}_{:t},\bf{v}_{:t}$ 動態構造的。我們可以設計各種映射策略來獲得不同類別的 $\tilde{K}^c_t,\tilde{V}^c_t$,並將它們結合起來:
$$
\bf{o}^*_t=\sum_{c\in C}g^c_t\cdot\text{Attn}(\bf{q}_t,\tilde{K}^c_t,\tilde{V}^c_t)\tag{5}
$$
:::warning
* $C$:映射策略集合(多種稀疏 attention 類型)
* $c \in C$:第 $c$ 種注意力策略
* $g^c_t$:對第 $c$ 種策略的加權係數,通常由 gating 機制(如 softmax)決定
* $\text{Attn}(\mathbf{q}_t, \tilde{K}^c_t, \tilde{V}^c_t)$:使用第 $c$ 組壓縮 key-value 所得到的注意力輸出
:::
As illustrated in Figure 2, NSA have three mapping strategies $C = \left\{ \text{cmp} , \text{slc}, \text{win} \right\}$, representing compression, selection, and sliding window for keys and values. $g^c_t\in[0,1]$ is the gate score for corresponding strategy $c$ , derived from input features via an MLP and sigmoid activation. Let $N_t$ denote the total number of remapped keys/values:
$$
N_t=\sum_{c\in C}\text{size}[\tilde{K}^c_t]\tag{6}
$$
如 Figure 2 所示,NSA 使用三種映射策略 $C = \left\{ \text{cmp} , \text{slc}, \text{win} \right\}$,分別代表對 keys、values 做壓縮、選擇和滑動窗口的處理。 其中, $g^c_t\in[0,1]$ 表示每個策略 $c$ 的門控分數,由輸入特徵經多層感知機 (MLP) 和 sigmoid activation 計算而得。 假設 $N_t$ 表示remapped keys/values 的總數:
$$
N_t=\sum_{c\in C}\text{size}[\tilde{K}^c_t]\tag{6}
$$
:::warning
* $\sum_{c\in C}$:所有映射策略進行總和計算
* $\text{size}[\tilde{K}^c_t]$:所有映射策略通道$c$中,針對查詢 $\bf{q}_t$ 所生成的壓縮大小
在時間步 $t$ 時,總共參與注意力運算的壓縮 key 數量 $N_t$,定義為所有通道 $c \in C$ 中對應壓縮 key 表示 $\tilde{K}^c_t$ 的大小之總和
:::
We maintain a high sparsity ratio by ensuring $N_t \ll t$ .
我們透過確保 $N_t \ll t$ 的方式來維持高稀疏率。
## 3.3. Algorithm Design
In this subsection, we introduce the design of our remapping strategies $f_K$ and $f_V$ : token compression, token selection, and sliding window.
在這個小節中,我們介紹我們的重映射策略 $f_K$ 和 $f_V$ 的設計:token compression、token selection 和 sliding window。
## 3.3.1. Token Compression
By aggregating sequential blocks of keys or values into block-level representations, we obtain compressed keys and values that capture the information of the entire block. Formally, the compressed key representation is defined as:
$$
\tilde{K}_t^{\text{cmp}} = f^{\text{cmp}}_K(\mathbf{K}_{:t}) =
\left\{ \varphi\left( \mathbf{k}_{id+1:id+l} \right)
\middle\vert 0 \leq i \leq \left\lfloor \dfrac{t-l}{d} \right\rfloor \right\} \tag{7}
$$
透過將 keys 或 values 的連續區塊彙總為區塊的表示(block-level representations),我們可以得到壓縮後的 keys 和 values,這些表示能夠捕捉整個區塊中的資訊。正式來說,壓縮後的 key representation 定義如下:
$$
\tilde{K}_t^{\text{cmp}} = f^{\text{cmp}}_K(\mathbf{K}_{:t}) =
\left\{ \varphi\left( \mathbf{k}_{id+1:id+l} \right)
\middle\vert 0 \leq i \leq \left\lfloor \dfrac{t-l}{d} \right\rfloor \right\} \tag{7}
$$
where $l$ is the block length, $d$ is the sliding stride between adjacent blocks, and $\varphi$ is a learnable MLP with intra-block position encoding to map keys in a block to a single compressed key. $\tilde{K}_t^{\text{cmp}}\in\mathbb{R}^{d_k\times \lfloor\dfrac{t-l}{d}\rfloor}$ is tensor composed by compresion keys. Usually, we adopt $d<l$ to mitigate information fragmentation. An analogous formulation holds for the compressed value representation $\tilde{V}_t^{\text{cmp}}$. Compressed representations capture coarser-grained higher-level semantic information and reduce computational burden of attention.
其中,$l$ 表示區塊的長度、$d$ 表示相鄰區塊之間滑動步幅,而 $\varphi$ 是個具有區塊內位置編碼的可學習多層感知器 (MLP),用於將每個區塊中的鍵映射至一個壓縮後的 keys。$\tilde{K}_t^{\text{cmp}}\in\mathbb{R}^{d_k\times \lfloor\dfrac{t-l}{d}\rfloor}$ 為由 compresion keys 所組成的張量。通常,我們會採用 $d<l$ 的設置來減輕資訊碎片化問題。類似的公式也適用於壓縮後的 value representation $\tilde{V}_t^{\text{cmp}}$ 。這些壓縮表示能夠捕捉較粗粒度的高層語義資訊,並降低注意力計算的負擔。
:::warning
GPT解釋:
* $\tilde{K}_t^{\text{cmp}} = f^{\text{cmp}}K(\mathbf{K}{:t})$:將前 $t$ 個原始 key 向量 $\mathbf{K}_{:t}$(即 $\mathbf{K}_1$ 到 $\mathbf{K}_t$)經過某個壓縮函數 $f_K^{\text{cmp}}$,生成壓縮後的 key 表示 $\tilde{K}_t^{\text{cmp}}$,簡單說就是我們針對前 $t$ 個 key 做壓縮,得到壓縮後的 key 序列
* $\left\{ \varphi\left( \mathbf{k}_{id+1:id+l} \right) \middle\vert 0 \leq i \leq \left\lfloor \dfrac{t-l}{d} \right\rfloor \right\}$
* 區塊選擇 $\mathbf{k}_{id+1:id+l}$:表示從原始 key 序列中擷取一段長度為 $l$ 的區塊,從索引 $id+1$ 到 $id+l$,其中 $i$ 為區塊編號,$d$ 是滑動步長,所以這是在做滑動區塊採樣?
* 映射函數 $\varphi(\cdot)$:將這段 key 區塊(含 $l$ 個 key 向量)餵給一個可學習函數 $\varphi$(即帶位置編碼的 MLP),映射成一個單一壓縮 key
* 索引範圍 $0 \leq i \leq \left\lfloor \frac{t - l}{d} \right\rfloor$:$i$ 的最大值保證區塊不會超出 $t$ 的邊界,使用整數下取(floor)是為了確保每個區塊都完整
公式 (7) 描述了如何以滑動區塊方式從原始 key 序列中抽取每一段長度為 $l$ 的 key 區塊,並透過一個帶位置編碼的多層感知器($\varphi$)壓縮為單一表示向量。這樣的區塊每次滑動 $d$ 個位置,直到時間步長 $t$ 為止。
:::
## 3.3.2. Token Selection
Using only compressed keys, values might lose important fine-grained information, motivating us to selectively preserve individual keys, values. Below we describe our efficient token selection mechanism that identifies and preserves the most relevant tokens with low computational overhead.
僅使用壓縮後的 keys、values 可能會導致重要細微資訊的遺失,這促使我們需要選擇性地保留個別的 keys 和 values。下面我們將介紹一種高效的 token selection 的機制,它能夠有效地識別並保留最相關的 tokens,同時具備低的計算開銷。
**Blockwise Selection.** Our selection strategy processes key and value sequences in spacial continuous blocks, motivated by two key factors: hardware efficiency considerations and inherent distribution patterns of attention scores. Blockwise selection is crucial to achieve efficient computation on modern GPUs. That is because modern GPU architectures exhibit significantly higher throughput for continuous block accesses compared to random index-based reads. Also, blockwise computation enables optimal utilization of Tensor Cores. This architectural characteristic has established blockwise memory access and computation as a fundamental principle in high-performance attention implementations, as exemplified by FlashAttention's block-based design. Blockwise selection follows the inherent distribution patterns of attention scores. Prior works (Jiang et al., 2024) have shown that attention scores often exhibit spatial continuity, suggesting that neighboring keys tend to share similar importance levels. Our visualization in Section 6.2 also shows this spatial continuous pattern.
**Blockwise Selection.** 我們的選擇策略是在一個空間連續的區塊中處理 key 和 value的序列,主要受兩種關鍵因素所啟發:硬體效率的考量和注意力分數本身的固有分佈模式。分塊選擇對於在現代 GPU 上實現高效計算至關重要。這是因為,相較於基於隨機索引的讀取方式,現代 GPU 架構對於連續區塊存取有著更高的吞吐量。還有就是,分塊計算還能夠最佳化 Tensor Cores 的效能。這種架構特點使分塊式的記憶體存取和計算成為高效能注意力實現的基本原則,例如 FlashAttention 的基於塊的設計。分塊選擇遵循注意力分數的固有分佈模式。早先的研究(Jiang, et al., 2024)也說明著,注意力分數往往會呈現空間連續性,這說明著相鄰的 keys 傾向於共享類似的重要性等級(importance levels)。我們在 Section 6.2 中的可視化也顯示了這種空間連續模式。
To implement blockwise selection, we first divide key, value sequences into selection blocks. To identify the most important blocks for attention computation, we need to assign importance scores to each block. Below we present our method for computing these block-level importance scores.
為了實作區塊式選擇,我們首先將 key 與 value 序列劃分為多個選擇區塊。為了在注意力計算中識別出最關鍵的區塊,我們需要為每個區塊分配一個重要性分數。下面將說明我們用於計算這些區塊級重要性分數的方法。
**Importance Score Computation.** Computing block importance scores could introduce significant overhead. Fortunately, the attention computation of compression tokens produces intermediate attention scores that we can leverage to induce selection block importance scores, formulated as:
$$
\mathbf{p}_t^{\text{cmp}}=\text{Softmax}(\mathbf{q}_t^T\tilde{K}_t^{\text{cmp}})\tag{8}
$$
**Importance Score Computation.** 計算區塊的重要性得分可能會帶來顯著的開銷。幸運的是,壓縮詞彙的注意力計算會產生我們可以利用的中間注意力得分,這些得分可以用於推導出選擇區塊的重要性得分,其公式如下:
$$
\mathbf{p}_t^{\text{cmp}}=\text{Softmax}(\mathbf{q}_t^T\tilde{K}_t^{\text{cmp}})\tag{8}
$$
where $\mathbf{p}_t^{\text{cmp}} \in \mathbb{R}^{\lfloor\dfrac{t-l}{d}\rfloor +1}$ is the attention scores between $q_t$ and compression keys $\tilde{K}_t^{\text{cmp}}$. Let $l'$ denote the selection block size. When compression blocks and selection blocks share the same blocking scheme, i.e., $l'=l=d$ , we can directly obtain the selection block importance scores $\mathbf{p}_t^{\text{slc}}$ by $\mathbf{p}_t^{\text{slc}}=\mathbf{p}_t^{\text{cmp}}$ straightforwardly. For cases where the blocking schemes differ, we derive the importance scores for selection blocks according to their spatial relationship. Given $l \leq l',d\vert l$ and $d\vert l'$ , we have:
$$
\mathbf{p}_t^{\text{slc}}[j]=\sum_{m=0}^{\dfrac{l'}{d}-1}\sum_{n=0}^{\dfrac{1}{d}-1}\mathbf{p}_t^{\text{cmp}}\Bigg[ \dfrac{l'}{d}j-m-n \Bigg] \tag{9}
$$
其中 $\mathbf{p}_t^{\text{cmp}} \in \mathbb{R}^{\lfloor\dfrac{t-l}{d}\rfloor +1}$ 表示查詢向量 $\mathbf{q}_t$ 與壓縮後的 keys $\tilde{K}_t^{\text{cmp}}$ 之間的注意力分數。
假設 $l'$ 表示 selection block(選擇區塊)的大小。當壓縮區塊與選擇區塊使用相同的分區策略(即 $l'=l=d$)時,我們可以直接將 $\mathbf{p}_t^{\text{cmp}}$ 作為選擇區塊的重要性分數,即 $\mathbf{p}_t^{\text{slc}}=\mathbf{p}_t^{\text{cmp}}$。
若兩者的分區策略不同,則需依據其在時間軸上的空間對應關係來推導出選擇區塊的重要性分數。
給定 $l \leq l'$,以及 $d \vert l$ 且 $d \vert l'$ 的情況下,我們可以進一步導出如下關係:
$$
\mathbf{p}_t^{\text{slc}}[j]=\sum_{m=0}^{\dfrac{l'}{d}-1}\sum_{n=0}^{\dfrac{1}{d}-1}\mathbf{p}_t^{\text{cmp}}\Bigg[ \dfrac{l'}{d}j-m-n \Bigg] \tag{9}
$$
where $[\cdot]$ denotes the indexing operator for accessing vector element. For models employing GQA or MQA where key-value caches are shared across query heads, consistent block selection across these heads has to be ensured to minimize KV cache loading during decoding. The shared importance scores across heads in a group are formally defined as:
$$
\mathbf{p}_t^{\text{slc}'}=\sum^{H}_{h=1}\mathbf{p}_t^{\text{slc},(h)}\tag{10}
$$
其中 $[\cdot]$ 表示用於存取向量元素的索引運算符。對於採用 GQA 或 MQA 架構的模型,其 key-value caches 會在 query head 間共享,因此需要確保每個群組內的 query head 使用相同的區塊選取,以最小化解碼期間 KV cache 的載入量。這個,一個群組內各個 query head 共享的重要性分數定義為:
$$
\mathbf{p}_t^{\text{slc}'}=\sum^{H}_{h=1}\mathbf{p}_t^{\text{slc},(h)}\tag{10}
$$
where $(h)$ in the superscript denotes the head index, and $H$ is the number of query heads in each group. This aggregation ensures consistent block selection across heads within the same group.
其中上標的 $(h)$ 表示 head index,$H$ 代表每個群組中的 query head 的數量。這種聚合方式確保同一個群組中各個頭對應一致的區塊選擇。
**Top $n$ Block Selection.** After obtaining the selection block importance scores, We retain tokens within the top-$n$ sparse blocks ranked by block importance scores, formulated as:
$$
\mathcal{I}_t=\left\{i\vert \text{rank}(\mathbf{p}_t^{\text{slc}'}[i])\leq n \right\} \tag{11}
$$
$$
\tilde{K}^{\text{slc}}_{t}=\text{Cat}[\left\{\mathbf{k}_{il'+1:(i+1)l'} \right \vert I \in \mathcal{I}_{t}\}]\tag{12}
$$
**Top $n$ Block Selection.** 在獲得區塊重要性得分後,我們保留根據區塊重要性得分排名靠前的 top-$n$ 個稀疏區塊中的詞彙,提取如下:
$$
\mathcal{I}_t=\left\{i\vert \text{rank}(\mathbf{p}_t^{\text{slc}'}[i])\leq n \right\} \tag{11}
$$
$$
\tilde{K}^{\text{slc}}_{t}=\text{Cat}[\left\{\mathbf{k}_{il'+1:(i+1)l'} \right \vert I \in \mathcal{I}_{t}\}]\tag{12}
$$
where $\text{rank} (\cdot)$ denotes the ranking position in descending order, with $\text{rank}=1$ corresponding to the highest score, $\mathcal{I}_t$ is the set of selected blocks' indices, $\text{Cat}$ denotes the concatenation operation. $\tilde{K}^{\text{slc}}_{t}\in\mathbb{R}^{d_k\times nl'}$ is tensor composed by compresion keys. An analogous formulation applies to the fine-grained value $\tilde{V}^{\text{slc}}_{t}$. The selected keys and values then participate in the attention computation with $\mathbf{q}_t$ as defined in Equation (5).
其中 $\text{rank} (\cdot)$ 表示按降序排列的排序位置,$\text{rank}=1$ 對應最高分數, $\mathcal{I}t$ 為所選取的區塊索引集合,$\text{Cat}$ 表示串接(concatenation)操作。$\tilde{K}^{\text{slc}}_{t}\in\mathbb{R}^{d_k\times nl'}$ 是由壓縮後的 keys 所組成的張量。類似的公式適用於 fine-grained value $\tilde{V}^{\text{slc}}_{t}$. 。然後,選擇的 keys 和 values 會與公式 (5) 中定義的 $\mathbf{q}_t$ 一起進行注意力計算。
## 3.3.3. Sliding Window
In attention mechanisms, local patterns typically adapt faster and can dominate the learning process, potentially preventing the model from effectively learning from compression and selection tokens. To address this issue, we introduce a dedicated sliding window branch that explicitly handles local context, allowing other branches (compression and selection) to focus on learning their respective features without being shortcutted by local patterns. Specifically, we maintain recent tokens $\tilde{K}^{\text{win}}_{t}=\mathbf{k}_{t-w:t}$, $\tilde{V}_t^{\text{win}}=\mathbf{v}_{t-w:t}$ in a window $w$ , and isolate attention computations of different information sources (compression tokens, and selected tokens, sliding window) into separate branches. These branch outputs are then aggregated through a learned gating mechanism. To further prevent shortcut learning across attention branches with marginal computational overhead, we provide independent keys and values for three branches. This architectural design enables stable learning by preventing gradient interference between local and long-range pattern recognition, while introducing minimal overhead.
在注意力機制中,局部模式通常適應的比較快,並可能主導整個學習過程,進而抑制模型有效地從 compression tokens 與 selection tokens 中學習。為了解決這個問題,我們引入一種特有的 sliding window branch,這方法可以顯示的處理局部上下文,允許其它分支(branch)(compression and selection)能夠專注於學習它們的表示特徵,而不會受到局部模式的捷徑影響。具體來說,我們在 window $w$ 中保留了近期的 tokens $\tilde{K}^{\text{win}}_{t}=\mathbf{k}_{t-w:t}$、$\tilde{V}_t^{\text{win}}=\mathbf{v}_{t-w:t}$,然後將來自不同來源的注意力計算(compression tokens、selected tokens、sliding window)分別置於獨立的注意力分支中處理。這些分支的輸出隨後會透過一個學習好的門控機制來進行聚合。為了進一步的預防注意力分支(attention branches)之間的 shortcut learning,而且又不能明顯的增加計算負擔,我們為三個分支提供各自的 keys 與 values。這種架構設計能夠透過預防局部與長距離模式識別之間的梯度干擾來穩定學習,同時只增加一點點的額外的負擔。
After obtaining all three categories of keys and values $(\tilde{K}_t^{\text{cmp}}, \tilde{V}_t^{\text{cmp}}, \tilde{K}_t^{\text{slc}}, \tilde{V}_t^{\text{slc}}; \tilde{K}_t^{\text{win}}, \tilde{V}_t^{\text{win}})$, we compute the final attention output following Equation (5). Together with the compression, selection, and sliding window mechanisms described above, this forms the complete algorithmic framework of NSA.
我們在取得所有三種類別的 keys 與 values $(\tilde{K}_t^{\text{cmp}}, \tilde{V}_t^{\text{cmp}}, \tilde{K}_t^{\text{slc}}, \tilde{V}_t^{\text{slc}}; \tilde{K}_t^{\text{win}}, \tilde{V}_t^{\text{win}})$ 之後,我們根據公式(5)計算最終的注意力輸出。結合上面所說的壓縮(compression)、選擇(selection)和滑動窗口(sliding window)機制,最終構成了 NSA 完整的演算法框架。
## 3.4. Kernel Design
To achieve FlashAttention-level speedup during the training and prefilling, we implement hardware-aligned sparse attention kernels upon Triton. Given MHA is memory-intensive and inefficient for decoding, we focus on architectures with shared KV caches like GQA and MQA following the current state-of-the-art LLMs. While compression and sliding window attention computations are readily compatible with existing FlashAttention-2 kernels, we introduce the specialized kernel design for sparse selection attention. If we were to follow FlashAttention's strategy of loading temporally continuous query blocks into SRAM, it would result in inefficient memory access since queries within a block may require disjoint KV blocks. To address this, our key optimization lies in a different query grouping strategy: for each position on the query sequence, we load all query heads within a GQA group (they share the same sparse KV blocks) into SRAM. Figure 3 illustrates our forward pass implementation. The proposed kernel architecture is characterized by the following key features:
為了在訓練與預填充(prefilling)階段達到如 FlashAttention 的加速效果,我們在 Triton 上實作 hardware-aligned sparse attention kernels。由於多頭注意力(Multi-Head Attention, MHA)在解碼階段屬記憶體密集且效率不佳,我們關注於具有共享 KV caches 架構的模型,例如 GQA 和 MQA,這與目前主流的大語言模型(LLM)趨勢一致。雖然壓縮和滑動窗口注意力計算可以直接相容 FlashAttention-2 kernels,不過我們針對稀疏選取注意力引入了專門設計的核架構。如果我們依循 FlashAttention 將時間連續的查詢區塊載入 SRAM 的策略,那將會導致記憶體存取效率低下,這是因為該區塊內的查詢(query)可能分別需要存取彼此不相連的 KV 區塊(disjoint KV blocks)。為了解決這個問題,我們採用一種新的查詢分組策略作為關鍵的最佳化:對於查詢序列(query sequence)中的每個位置,我們將 GQA group(它們共享相同的 sparse KV blocks)中所有 query head 載入 SRAM 中。Figure 3 說明了我們的正向傳遞的實作方式。 所提出的核架構具有以下主要特徵:
1. **Group-Centric Data Loading .** For each inner loop, load all heads' queries $Q\in\mathbb{R}^{[h,d_k]}$ in the group at position $t$ and their shared sparse key/value block indices $\mathcal{I}_t$ .
2. **Shared KV Fetching .** In the inner loop, Sequentially load continuous key/value blocks indexed by $\mathcal{I}_t$ into SRAM as $K\in\mathbb{R}^{B_k,d_k}$ , $V\in\mathbb{R}^{B_k,d_k}$ to minimize memory loading, where $B_k$ is the kernel block size satisfying $B_k\vert l'$ .
3. **Outer Loop on Grid .** Since the inner-loop length (proportional to the selected block count $n$) remains nearly identical for different query blocks, we put query/output loops in Triton's grid scheduler to simplify and optimize the kernel.
1. **Group-Centric Data Loading .** 在每一次的內部循環中,我們載入位置 $t$ 在群組內的所有頭部的查詢 $Q\in\mathbb{R}^{[h,d_k]}$ 及其共享稀疏的 key/value block 指標 $\mathcal{I}_t$。
2. **Shared KV Fetching .** 在內部循環中,根據 $\mathcal{I}_t$ 的索引指示 依序將連續的 key/value blocks 載入到 SRAM 作為 $K\in\mathbb{R}^{B_k,d_k}$、$V\in\mathbb{R}^{B_k,d_k}$,以最小化記憶體載入,其中 $B_k$ 是滿足 $B_k\vert l'$ 的內核區塊大小。
3. **Outer Loop on Grid .** 由於內部循環長度(與所選定的區塊數量 $n$ 成正比)對於不同的查詢區塊幾乎相同,我們將查詢/輸出循環加入 Triton 的網格調度器(grid scheduler)中,以簡化和最佳化核心的運算。
This design achieves near-optimal arithmetic intensity by (1) eliminating redundant KV transfers through group-wise sharing, and (2) balancing compute workloads across GPU streaming multiprocessors.
此設計透過(1)透過群組共享方式消除冗餘的 KV transfers,以及(2)在 GPU 串流多處理器上平衡工作負載,實現了接近最佳的算術強度。

Figure 3 | Kernel design for NSA. The kernel loads queries by GQA groups (Grid Loop), fetches corresponding sparse KV blocks (Inner Loop), and performs attention computation on SRAM. Green blocks indicate data on SRAM, while blue indicates data on HBM.
Figure 3 | NSA 的核設計。核心透過 GQA 群組(網格循環)載入查詢,並從相應的稀疏 KV 區塊(內部循環)中提取資料,在 SRAM 上進行注意力計算。綠色方塊表示 SRAM 中的資料,藍色則表示 HBM 中的資料。
## 4. Experiments
We evaluate NSA through three lenses: (1) general benchmarks performance, (2) long-context benchmarks performance, and (3) chain-of-thought reasoning performance, comparing against Full Attention baseline and state-of-the-art sparse attention methods. We defer the efficiency analysis of our sparse computation paradigm to Section 5, where we provide detailed discussions on training and inference speed.
我們從三個面來評估 NSA:(1) 通用的基準效能,(2) 長上下文基準效能,以及 (3) 思維鏈(chain-of-thought)推理效能,並且與全注意力(Full Attention)基線和最新稀疏注意力方法進行比較。針對我們的稀疏運算架構的效率分析,稍後將在 Section 5 討論,在那邊會詳細說明涵蓋訓練和推理速度方面的相關議題。
## 4.1. Pretraining Setup
Following the common practice in state-of-the-art LLMs, our experiments adopt a backbone combining Grouped-Query Attention (GQA) and Mixture-of-Experts (MoE), featuring 27B total parameters with 3B active parameters. The model consists of 30 layers with a hidden dimension of 2560. For GQA, we set the number of groups to 4, with a total of 64 attention heads. For each head, the hidden dimensions of the query, key, and value are configured as $d_q=d_k=192$ and $d_v=128$, respectively. For MoE, we utilize the DeepSeekMoE (Dai et al., 2024; DeepSeek-AI, 2024) structure, with 72 routed experts and 2 shared experts, and set the top-k experts to 6. To ensure training stability, the MoE in the first layer is replaced by an MLP in the form of SwiGLU.
基於當前最新的語言模型 (LLM) 的慣例,我們在實驗中採用一個結合了 Grouped-Query Attention (GQA) 和 Mixture-of-Experts (MoE) 的骨幹結構,總參數為 27B,其中活躍參數為 3B。模型總共包含 30 層,隱藏層維度為 2560。GQA 的部份,我們將組群數量設置為 4,共有 64 個注意力頭(attention heads)。對於每一個注意力頭,其查詢(query)、鍵(key)、值(value)的隱藏維度分別設定為 $d_q=d_k=192$,$d_v=128$。MoE 的部份則是採用 DeepSeekMoE (Dai et al., 2024; DeepSeek-AI, 2024)的架構,有 72 個路由專家(routed experts)和 2 個共享專家(shared experts),並將 top-k experts 設定為 6。為了確保訓練的穩定性,第一層的 MoE 被替換為 SwiGLU 形態的多層感知機 (MLP)。

Figure 4 | Pretraining loss comparison between Full Attention and our NSA on 27B-parameter model. Both models exhibit stable convergence, with NSA achieving lower loss values.
Figure 4 | 於 27B-參數模型上,Full Attention 與我們提出的 NSA 在預訓練損失的比較。兩者模型均表現出穩定的收斂,NSA 則有著更低的損失值。
| Model | MMLU Acc. 5-shot | MMLU-PRO Acc. 5-shot | CMMLU Acc. 5-shot | BBH Acc. 3-shot | GSM8K Acc. 8-shot | MATH Acc. 4-shot | DROP F1 1-shot | MBPP Pass@1 3-shot | HumanEval Pass@1 0-shot | Avg. |
|-----------|--------------------|------------------------|---------------------|-------------------|---------------------|--------------------|------------------|----------------------|---------------------------|--------|
| Full Attn | 0.567 | 0.279 | 0.576 | 0.497 | 0.486 | 0.263 | 0.503 | 0.482 | 0.335 | 0.443 |
| NSA | 0.565 | 0.286 | 0.587 | 0.521 | 0.52 | 0.264 | 0.545 | 0.466 | 0.348 | 0.456 |
Table 1 | Pretraining performance comparison between the full attention baseline and NSA on general benchmarks, across knowledge (MMLU, MMLU-PRO, CMMLU), reasoning (BBH, GSM8K, MATH, DROP), and coding (MBPP, HumanEval) tasks. NSA achieves superior average performance on most benchmarks despite high sparsity.
Table 1 | 在通用基準測試上,完整注意力基線與 NSA 在知識 (MMLU、MMLU-PRO、CMMLU)、推理 (BBH、GSM8K、MATH、DROP) 和編碼 (MBPP、HumanEval) 任務中的預訓練效能的比較。儘管擁有高稀疏性,NSA 在大多數基準測試中都獲得了優異的平均性能。
The proposed architecture achieves an effective trade-off between computation cost and model performance. For NSA, we set compression block size $l = 32$, sliding stride $d = 16$, selected block size $l' = 64$, selected block count $n = 16$ (including fixed activating the 1 initial block and 2 local blocks), and sliding window size $w = 512$. Both Full Attention and sparse attention models are pretrained on 270B tokens of 8k-length texts, followed by continued training and supervised fine-tuning on 32k-length texts with YaRN (Peng et al., 2024) to achieve long-context adaptation. Both models are trained to full convergence to ensure fair comparison. As shown in Figure 4, the pretraining loss curve of our NSA and Full Attention baseline demonstrates stable and smooth decline, with NSA consistently outperforming the Full Attention model.
由我們所提出的架構在運算成本與模型效能之間取得有效平衡。NSA 的部份,我們設定壓縮區塊大小(compression block size) $l = 32$、滑動步幅(sliding stride) $d = 16$、選擇區塊大小(selected block size) $l' = 64$、選擇區塊數量(selected block count) $n = 16$(其中包含固定啟動的 1 個初始區塊與 2 個局部區塊),以及滑動視窗大小(sliding window size) $w = 512$。 Full Attention models 和稀疏注意力模型均在包含 270B 個 token 的 8k 長文本上進行預訓練,隨後使用 YaRN (Peng et al., 2024 對長度為 32k 的文本進行持續訓練和監督微調,以實現長上下文自適應。兩種模型均訓練至完全收斂,以確保比較的公平性。如 Figure 4 所示,我們的 NSA 和全注意力基線模型的預訓練損失曲線表現出穩定且平滑的下降趨勢,其中 NSA 在整個訓練過程持續地優於 Full Attention models 。
## 4.2. Baselines Methods
In addition to comparing with Full Attention, we evaluate several state-of-the-art inference-stage sparse attention methods: H2O (Zhang et al., 2023b), infLLM (Xiao et al., 2024), Quest (Tang et al., 2024), and Exact-Top, which first computes full attention score and select the top-$n$ scores keys corresponding to each query and then calculates attention on these positions. These methods span diverse sparse attention paradigms, including KV-cache eviction, query-aware selection, and exact top-$n$ sparse selection.
除了與全注意力(Full Attention)比較外,我們同時評估了多種最新的推理階段稀疏注意力的方法:H2O (Zhang et al., 2023b)、infLLM (Xiao et al., 2024)、Quest (Tang et al., 2024) 和 Exact-Top。Exact-Top首先計算全注意力分數,再針對每一個查詢(query)選出得分最高的前 $n$ 個鍵(key),然後這些位置上進行注意力的計算。這些方法涵蓋了多樣化的稀疏注意力範式,包括 KV-cache eviction、query-aware selection 和 exact top-$n$ sparse selection。
| Model | SQA | SQA | SQA | MQA | MQA | MQA | MQA | Synthetic | Synthetic | Code | Avg. |
|-----------|---------|---------|--------|-------|-------|--------|-------|-------------|-------------|--------|--------|
| | MFQA-en | MFQA-zh | Qasper | HPQ | 2Wiki | GovRpt | Dur | PassR-en | PassR-zh | LCC | |
| H2O | 0.428 | 0.429 | 0.308 | 0.112 | 0.101 | 0.231 | 0.208 | 0.704 | 0.421 | 0.092 | 0.303 |
| InfLLM | 0.474 | 0.517 | 0.356 | 0.306 | 0.250 | 0.277 | 0.257 | 0.766 | 0.486 | 0.143 | 0.383 |
| Quest | 0.495 | 0.561 | 0.365 | 0.295 | 0.245 | 0.293 | 0.257 | 0.792 | 0.478 | 0.135 | 0.392 |
| Exact-Top | 0.502 | 0.605 | 0.397 | 0.321 | 0.288 | 0.316 | 0.291 | 0.810 | 0.548 | 0.156 | 0.423 |
| Full Attn | 0.512 | 0.623 | 0.409 | 0.350 | 0.305 | 0.324 | 0.294 | 0.830 | 0.560 | 0.163 | 0.437 |
| NSA | 0.503 | 0.624 | 0.432 | 0.437 | 0.356 | 0.307 | 0.341 | 0.905 | 0.550 | 0.232 | 0.469 |
Table 2 | Performance comparison between our NSA and baselines on LongBench, including subsets in single document QA, multi-document QA, synthetic and code task categories. NSA outperformed most of the baselines including Full Attention.
Table 2 | 我們的 NSA 與基線模型在 LongBench 上的效能比較,包括 single document QA、multi-document QA、合成任務和程式碼任務類別中的子集。NSA 優於多數基線模型(包括Full Attention)。
For general evaluation, where most samples have lengths within the local context window of sparse attention baselines, these methods are effectively equivalent to Full Attention. Therefore, we present only the comparison results between NSA and Full Attention baseline in this setting. In the long-context evaluation, we conduct comparisons across all baseline methods, with the sparsity of all sparse attention methods set to the same to ensure a fair comparison. For chain-of-thought reasoning evaluation, which requires long-text supervised fine-tuning, we limit our comparison to Full Attention, as sparse attention baselines do not support training.
針對一般性的評估,由於大多數樣本的長度皆落在稀疏注意力基線所涵蓋的局部上下文視窗內,因此這些方法在效果上等同於全注意力機制。因此,在此情境下,我們僅呈現 NSA 與全注意力基線之間的比較結果。在長上下文的評估中,我們就直接對所有基線方法進行了比較,並將所有稀疏注意力方法的稀疏度設定為相同,以確保公平比較。對於 chain-of-thought 的推理評估,它需要長文本監督微調,然而稀疏注意力基線並不支援訓練,因此我們僅將 NSA 與全注意力進行比較。
## 4.3. Performance Comparison
**General Evaluation.** We evaluated the pretrained NSA and Full Attention baseline, on a comprehensive suite of benchmarks spanning knowledge, reasoning, and coding capabilities, including MMLU (Hendrycks et al., 2020), MMLU-PRO (Wang et al., 2024), CMMLU (Li et al., 2023), BBH (Suzgun et al., 2022), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2020), DROP (Dua et al., 2019), MBPP (Austin et al., 2021), and HumanEval (Chen et al., 2021). The results are shown in Table 1. Despite its sparsity, NSA achieves superior overall performance, outperforming all baselines including Full Attention on 7 out of 9 metrics. This indicates that although NSA may not fully leverage its efficiency advantages on shorter sequences, it shows strong performance. Notably, NSA demonstrates significant gains in reasoning-related benchmarks (DROP: +0.042, GSM8K: +0.034), suggesting that our pretraining helps models to develop specialized attention mechanisms. This sparse attention pretraining mechanism forces model to focus on the most important information, potentially enhancing performance by filtering out noise from irrelevant attention pathways. The consistent performance across diverse evaluations also validates NSA's robustness as a general-purpose architecture.
**General Evaluation.** 我們對預先訓練的 NSA 模型與 Full Attention 基線模型進行評估,涵蓋一系列針對知識、推理與程式撰寫能力的綜合性基準測試,包括 MMLU (Hendrycks et al., 2020)、MMLU-PRO (Wang et al., 2024)、CMMLU (Li et al., 2023)、BBH (Suzgun et al., 2022)、GSM8K (Cobbe et al., 2021)、MATH (Hendrycks et al., 2020)、DROP (Dua et al., 2019)、MBPP (Austin et al., 2021) 以及 HumanEval (Chen et al., 2021)。結果如Table 1 所示。儘管 NSA 的稀疏性較高,但其整體表現仍然優於所有基線,包括 Full Attention 在 9 個指標中有 7 個是超越的。這說明即使 NSA 在處理短序列時無法完全發揮其效率上的優勢,其整體效能依然十分強勁。值得注意的是,NSA 在推理相關的基準(DROP:+0.042,GSM8K:+0.034)方面表現出顯著的增益,這說明著,我們的預訓練有助於模型發展特有的注意力機制。這種稀疏注意力預訓練機制迫使模型關注最關鍵的信息,並可能透過過濾掉來自無關注意力路徑的雜訊來提升效能。在各種評估中的一致性表現也驗證了 NSA 作為通用架構的穩健性。
**Long-Context Evaluation.** As shown in Figure 5, NSA achieves perfect retrieval accuracy across all positions in 64k-context needle-in-a-haystack (Kamradt, 2023) test. This performance stems from our hierarchical sparse attention design, which combines compression tokens for efficient global context scanning, and selection tokens for precise local information retrieval. The coarse-grained compression identifies relevant context blocks at low computational cost, while the token-level attention on selected tokens ensures the preservation of critical fine-grained information. This design enables NSA to maintain both global awareness and local precision.
**Long-Context Evaluation.** 如 Figure 5 所示,NSA 在 64k-context 的「大海撈針」測試中(Kamradt, 2023),於所有位置上皆達成完美的檢索準確率。這表現歸功於我們的分層稀疏注意力設計,它結合了 compression tokens 以有效地掃描全域(global)上下文,以及 selection tokens 可以精確地檢索局部信息。coarse-grained compression 能以低計算成本快速定位相關的上下文區塊,而對選定的 tokens 的 token-level attention 則是確保保留關鍵的精細(fine-grained)信息。這種設計使 NSA 能夠同時保持全域的感知能力與局部的精準性。
We also evaluate NSA on LongBench (Bai et al., 2023) against state-of-the-art sparse attention methods and Full Attention baseline. To ensure consistent sparsity, we set the token activated by each query in all sparse attention baselines to 2560 tokens, which corresponds to the average number of tokens activated in NSA when handling 32k sequence lengths. Following StreamLLM (Xiao et al., 2023), this token budget includes the leading 128 tokens and 512 local tokens. We exclude certain subsets from LongBench due to their low scores across all models, which may not provide meaningful comparisons. As shown in Table 2, NSA achieves the highest average score 0.469, outperforming all baselines (+0.032 over Full Attention and +0.046 over Exact-Top). This improvement arises from two key innovations: (1) our native sparse attention design, which enables end-to-end optimization of sparse patterns during pretraining, facilitates synchronized adaptation between the sparse attention module and other model components; and (2) the hierarchical sparse attention mechanism achieves a balance between local and global information processing.
我們也在 LongBench (Bai et al., 2023) 上評估 NSA,並與最先進的稀疏注意力方法和全注意力基準模型進行比較。為確保一致性的稀疏性,我們將所有稀疏注意力基準模型中每個查詢所啟動(activate)的 token 數量統一設為 2560,這等同於處理 32k 序列長度時 NSA 中平均啟動的 token 數量。參考 StreamLLM (Xiao et al., 2023)的設計,此 token 預算包含前導(leading)的 128 個 token 與 512 個局部 token。由於某些子集在所有模型上的得分都偏低,沒什麼營養,因此我們從 LongBench 中排除它們,以避免影響比較結果的意義。如 Table 2 所示,NSA 取得最高平均分數 0.469,優於所有基準 (相較於全注意力提升了 +0.032,較 Exact-Top 提升了 +0.046)。這項改進歸功於兩項關鍵創新:(1) 我們的原生稀疏注意力設計,它能夠在預訓練期間對稀疏模式進行端到端的最佳化,這有助於稀疏注意力模組與其它模型組件之間的同步適配; (2) 分層稀疏注意力機制實現了局部和全域信息處理之間的平衡。

Figure 5 | Needle-in-a-Haystack retrieval accuracy across context positions with 64k context length. NSA achieves perfect accuracy through its hierarchical sparse attention design.
Figure 5 | 針對 64k 上下文長度,在不同上下文位置下的大海撈針檢索精確度。NSA 藉由其分層稀疏注意力設計達成了完美的檢索精度。
Notably, NSA demonstrates exceptional performance on tasks requiring complex reasoning over long contexts, achieving +0.087 and +0.051 improvements over Full Attention on multi-hop QA tasks (HPQ and 2Wiki), exceeding the performance of baselines on code understanding (LCC: +0.069), and outperforming other methods on passage retrieval (PassR-en: +0.075). These results validate NSA's capability to handle diverse long-context challenges, with its natively pretrained sparse attention providing additional benefits in learning task-optimal patterns.
值得注意的是,NSA 在需要對長文本進行複雜推理的任務中表現出色,在 multi-hop QA tasks (HPQ and 2Wiki) 上相比於 Full Attention models 高出 0.087 和 0.051 分,在程式碼理解任務(LCC)中,超越基線模型 +0.069 分,並且而在段落檢索任務(PassR-en)中,也以 +0.075 的差距優於其他方法。這些結果驗證了 NSA 處理各種長文本挑戰的能力,其原生的預訓練稀疏注意力機制為學習任務最佳模式提供了額外優勢。
**Chain-of-Thought Reasoning Evaluation.** To evaluate NSA's compatibility with advanced downstream training paradigms, we investigate its capacity to acquire chain-of-thought mathematical reasoning abilities via post-training. Given the limited effectiveness of reinforcement learning on smaller-scale models, we employ knowledge distillation from DeepSeek-R1, conducting supervised fine-tuning (SFT) with 10B tokens of 32k-length mathematical reasoning traces. This produces two comparable models: Full Attention-R (Full Attention baseline) and NSA-R (our sparse variant). We assess both models on the challenging American Invitational Mathematics Examination (AIME 24) benchmark. We use a sampling temperature of 0.7 and a $p$ value of 0.95 to generate 16 responses for each question and obtain the average score. To validate the impact of reasoning depth, we conduct experiments with two generation context limits: 8k and 16k tokens, measuring whether extended reasoning chains improve accuracy. Example comparisons of model predictions are provided in Appendix A.
**Chain-of-Thought Reasoning Evaluation.** 為了評估 NSA 是否能與進階的下游訓練範式相容,我們研究其透過後訓練能否獲得思維鏈數學推理能力。鑑於強化學習在小型模型上的成效有限,我們改採自 DeepSeek-R1 進行知識蒸餾,並用總量 10B 個 tokens 長度 32k 的數學推理軌跡來做監督式微調(SFT)。這產生了兩個可比較的模型:Full Attention-R(Full Attention 基線)和 NSA-R(我們的稀疏變體)。我們在具有挑戰性的 American Invitational Mathematics Examination (AIME 24) 評估兩個模型。我們使用 0.7 的 sampling temperature 和 top-$p$=0.95來為每個問題生成 16 個回應並計算平均得分。為了驗證推理深度的影響,我們進行了兩種生成上下文限制實驗:8k 和 16k 個 tokens,觀察延長推理鏈是否能提高準確性。Appendix A 包含模型預測的範例比較。
Table 3 | AIME Instruction-based Evaluating after supervised fine-tuning. Our NSA-R demonstrates better performance than Full Attention-R at both 8k and 16k sequence lengths
Table 3 | 使用 AIME 進行指令式評估(於監督式微調後)結果。我們的 NSA-R 在8k和16k長度序列中,表現均優於 Full Attention models (Full Attention-R)。
| Generation Token Limit | 8192 | 16384 |
|--------------------------|--------|---------|
| Full Attention-R | 0.046 | 0.092 |
| NSA-R | 0.121 | 0.146 |

Figure 6 | Comparison of Triton-based NSA kernel with Triton-based FlashAttention-2 kernel. Our implementation significantly reduces latency across all context lengths, with the improvement becoming more pronounced as input length increases.
Figure 6 | 基於 Triton 的 NSA 核心與 FlashAttention-2 核心之比較。我們的實作在所有上下文長度上都顯著的降低延遲,隨著輸入長度的增加,這種改進變得更加明顯。
As shown in Table 3, NSA-R achieves significantly higher accuracy than Full Attention-R under the 8k context setting (+0.075), with this advantage persisting at 16k contexts (+0.054). These results validate two key benefits of native sparse attention: (1) The pretrained sparse attention patterns enable efficient capture of long-range logical dependencies critical for complex mathematical derivations; (2) Our architecture's hardware-aligned design maintains sufficient context density to support growing reasoning depth without catastrophic forgetting. The consistent outperformance across context lengths confirms sparse attention's viability for advanced reasoning tasks when natively integrated into the training pipeline.
如 Table 3所示,NSA-R 在 8k 上下文設定中,其準確度明顯高於 Full Attention-R (+0.075),且這種優勢在 16k 上下文設定中仍然存在 (+0.054)。這些結果驗證了原生稀疏注意力的兩大關鍵優勢:(1) 預先訓練的稀疏注意力模式能有效捕捉長距離邏輯依賴性,這對進行複雜的數學推理而言至關重要; (2) 我們在架構上與硬體對齊的設計維持了足夠的上下文密度,以支持不斷增長的推理深度而不會出現災難性遺忘。在不同上下文長度下的持續的優勢,證明了稀疏注意力作為訓練管道原生整合的一部分,在高級推理任務中的可行性。
## 5. Efficiency Analysis
We evaluate the computational efficiency of NSA against Full Attention on an 8-GPU A100 system. In efficiency analysis, we also configure the model with GQA group $g = 4$, heads per group $h = 16$, query/key dimension $d_k = 192$, and value dimension $d_v = 128$. Following the same settings in Section 4, we set NSA compression block size $l = 32$, sliding stride $d = 16$, selected block size $l' = 64$, selected block count $n = 16$, and sliding window size $w = 512$.
我們在一個配備 8 顆 A100 GPU 的系統上,評估 NSA 相對於 Full Attention 的計算效率。在效率分析中,我們將模型配置為 GQA group $g = 4$、每個 group 的 $h = 16$ 個 heads、query/key 的維度 $d_k = 192$ 和 value 的維度 $d_v = 128$。根據 Section 4 的設定,我們設置 NSA 的 compression block size 為 $l = 32$ $l = 32$、sliding stride 為 $d = 16$、selected block size 為 $l' = 64$、selected block count 為 $n = 16$,以及 sliding window size 為 $w = 512$。
## 5.1. Training Speed
We compare the Triton-based implementations of our NSA attention and Full Attention with Triton-based FlashAttention-2 to ensure fair speed comparison across the same backend. As shown in Figure 6, our NSA achieves progressively greater speedups as context length increases, up to 9.0 × forward and 6.0 × backward speedup at 64k context-length. Notably, the speed advantage becomes more pronounced with longer sequences. This speedup stems from our hardware-aligned algorithm design to maximize the efficiency of sparse attention architecture: (1) The Blockwise memory access pattern maximizes Tensor Core utilization through coalesced loads, (2) The delicate loop scheduling in the kernel eliminates redundant KV transfers.
為確保在相同的後端環境下進行公平的速度比較,我們比較了 NSA attention 搭配 Triton-based 與 Full Attention 搭配 Triton-based FlashAttention-2。如 Figure 6 所示,我們的 NSA 隨著上下文長度的增加表現出更顯著的速度提升,在 64k 上下文長度時前向提升達9.0倍、反向則是有著6.0倍的速度提升。值得注意的是,這種速度優勢在較長序列中更加明顯。這種速度的提升源於我們針對硬體對齊的演算法設計,我們要的就是最大化稀疏注意力架構的效率:(1)區塊式記憶體的存取模式通過聯合載入的方式最大化 Tensor Core 的利用率; (2)在核心計算中的精細循環調度消除了多餘的 KV 的傳輸。
Table 4 | Memory access volume (in equivalent number of tokens) per attention operation during decoding. Due to the low arithmetic intensity and memory-bound nature of decoding, the expected speedup is approximately linear with the volume of memory access.
Table 4 | 解碼期間每項注意力操作所需的記憶體存取量(以相同的 tokens 數量計之)。由於解碼階段的算術強度較低,並且屬於記憶體受限型運算,因此推論效能的提升預期會與記憶體存取量呈現線性關係。
| Context Length | 8192 | 16384 | 32768 | 65536 |
|------------------|--------|---------|---------|---------|
| Full Attention | 8192 | 16384 | 32768 | 65536 |
| NSA | 2048 | 2560 | 3584 | 5632 |
| Expected Speedup | 4 × | 6.4 × | 9.1 × | 11.6 × |
## 5.2. Decoding Speed
The decoding speed of Attention is primarily determined by the memory access bottleneck, which is closely tied to the amount of KV cache loading. In each decoding step, Our NSA just needs to load at most $\lfloor\dfrac{s-l}{d}\rfloor$ compression tokens, $nl'$ selected tokens, and $w$ neighbor tokens, where $s$ is the cached sequence length. As shown in Table 4, our method exhibits a significant reduction in latency as the decoding length increases, achieving up to 11.6 × speedup at 64k context-length. This advantage in memory access efficiency also amplifies with longer sequences.
注意力的解碼速度主要受限於記憶體的存取瓶頸,這跟 KV cache 的負載密切相關。在每個解碼步驟中,我們的 NSA 最多就只需要載入 $\lfloor\dfrac{s-l}{d}\rfloor$ 個 compression tokens, $nl'$ 個 selected tokens 以及 $w$ 個 neighbor tokens,其中 $s$ 表示 cached 的序列長度。如 Table 4 所示,隨著解碼長度的增加,我們的方法在延遲表現上呈現顯著的改善,在 64k 上下文長度下可實現最高 11.6 倍的速度提升。這種記憶體存取效率的優勢也會隨著序列長度的增加而進一步放大。
## 6. Discussion
In this section, we reflect on the development process of NSA and discuss key insights gained from our exploration of different sparse attention strategies. While our approach demonstrates promising results, understanding the challenges encountered with alternative strategies and analyzing attention patterns provides valuable context for future research directions. We first examine challenges with alternative token selection strategies that motivated our design choices, followed by visualizations that offer insights into attention distribution patterns.
在本節中,我們要來回顧 NSA 的開發過程,並探討我們在研究不同稀疏注意力策略過程中所獲得的關鍵見解。儘管我們所提出的方法已經展現出未來一片光景的成果,深入理解其它替代策略所面臨的挑戰,以及分析注意力模式,對未來研究方向而言,仍具有重要的參考價值。我們首先探討推動本方法設計決策的關鍵動因,也就是 alternative token selection strategies (替代選擇策略)的挑戰,接著透過視覺化分析,呈現注意力分佈的結構與趨勢。
## 6.1. Challenges with Alternative Token Selection Strategies
Before designing NSA, we explored adapting existing sparse attention methods to the training stage. However, these attempts encountered various challenges, prompting us to design a different sparse attention architecture:
在設計 NSA 之前,我們曾嘗試將現有的稀疏注意力方法套用到訓練階段。然而,這些嘗試遇到了各種挑戰,促使我們轉而設計一種不同的稀疏注意力架構
**Key-Clustering Based Strategies.** We examined clustering-based strategies like ClusterKV (Liu et al., 2024). These methods store Keys and Values from the same cluster in contiguous memory regions. While theoretically feasible for training and inference, they face three significant challenges: (1) Non-trivial computational overhead introduced by dynamic clustering mechanisms; (2) Operator optimization difficulties exacerbated by inter-cluster imbalances, especially in Mixture-of-Experts (MoE) systems, where skewed Expert Parallelism (EP) group execution times lead to persistent load imbalances; (3) Implementation constraints arising from the need for mandatory periodic reclustering and chunk-sequential training protocols. These combined factors create substantial bottlenecks, significantly limiting their effectiveness for real-world deployment.
**Key-Clustering Based Strategies.** 我們研究了基於聚類(clustering-based)的方法,像是 ClusterKV (Liu et al., 2024)。這些方法將同一個聚類中的 Keys 和 Values 儲存在連續的記憶體區域中。儘管在訓練和推理方面理論上是可行的,但它們面臨著三個重大挑戰:(1) 動態聚類機制(dynamic clustering mechanisms)可觀的計算開銷;(2) 由於聚類間的失衡,導致 operator 最佳化的困難加劇,特別是在混合專家(Mixture-of-Experts, MoE)架構中,專家並行(Expert Parallelism, EP)群組的執行時間不均,會造成長期的負載不平衡;(3) 因為必需要週期性的重新聚類和區塊順序訓練協議的而產生的實作限制。這些因素綜合起來形成了明顯的瓶頸,嚴重限制了該類策略在實際部署場景中的效能表現。

Figure 8 | Visualization of Attention Map on a Full Attention transformer. Lightcolored regions indicate higher attention values. As shown in the figure, attention scores exhibit blockwise clustering distribution.
Figure 8 | Full Attention transformer 的注意力圖像可視化。淺色區域表示更高的注意力值。如圖所示,注意力得分呈現塊狀聚類分布。

Figure 7 | Compare training loss on a 3B parameter model with Full Attention and different token selection strategies and. Our NSA achieves better performance.
Figure 7 | 比較 3B 個參數模型在 Full Attention 和不同 token 選擇策略下的損失。我們的 NSA 有著更好的效能。
**Other Blockwise Selection Strategies.** We also considered blockwise key, value selection strategies different from NSA, such as Quest (Tang et al., 2024) and InfLLM (Xiao et al., 2024). These methods rely on computing an importance score for each block and selecting the top-$n$ blocks based on their similarity with $q_t$ . However, existing methods face two critical issues: (1) Since the selection operation is non-differentiable, importance score computation based on neural networks relies on auxiliary loss, which increases operator overhead and often degrades model performance; (2) Heuristic parameter-free importance score computation strategy suffer from low recall rates, leading to suboptimal performance. We evaluate both approaches on a 3B-parameter model with similar architecture and compare their loss curve with NSA and Full Attention. For the auxiliary loss-based selection method, we introduce additional queries and representative keys for each block to estimate the block importance scores. These scores are supervised by the mean attention scores between the original queries and keys within each block.
For the heuristic parameter-free selection method, following the strategy of Quest, we implement direct selection using the product between queries and coordinate-wise min-max of the key chunks, without introducing additional parameters. We also explore a cold-start training approach where Full Attention is applied for the initial 1000 steps before transitioning to the heuristic blockwise selection. As shown in Figure 7, both methods exhibited inferior loss.
**Other Blockwise Selection Strategies.** 我們還考慮了與 NSA 不同的區塊式的鍵/值(key/value)選擇策略,例如 Quest (Tang et al., 2024) 和 InfLLM (Xiao et al., 2024)。這些方法依賴於計算每個區塊的重要性分數,並根據其與 $q_t$ 的相似度來選取前$𝑛$個區塊。然而,現有的方法存在著兩個關鍵問題:(1)由於選擇(select)這個操作是不可微分的,基於神經網路的重要性分數的計算就會需要引入其它的輔助損失(auxiliary loss),這會增加計算的成本,而且通常會降低模型的效能;(2)啟發式無參數重要性分數的計算策略通常具有較低的召回率,導致效能不佳。我們在一個擁有相似架構的 3B 參數量 模型上評估了兩種方法,並將其損失曲線與 NSA 和 Full Attention 的模型進行比較。對於基於輔助損失的選擇方法,我們為每個區塊引入額外的 queries 和 representative keys,以此估測每一個區塊的區塊重要性分數。這些分數是由區塊內原始的 queries 與 keys 之間的平均注意力分數做為監督信號。至於啟發式無參數選擇方法,則是依循著 Quest 的策略,我們將 queries 與 key chunks 的coordinate-wise min-maxd 相乘的方式來直接選擇,不引入額外的參數。
我們還嘗試了冷啟動訓練方法,:前 1000 個 step 採用 Full Attention,之後再換成啟發式區塊選擇(heuristic blockwise selection)。如 Figure 7 所示,這兩種方法的 loss 都比較高(表現較差)。
## 6.2. Visualization
To explore potential patterns in transformer attention distributions and seek inspiration for our design, we visualize the attention map from our pretrained 27B Full Attention model in Figure 8. The visualization reveals interesting patterns where attention scores tend to exhibit blockwise clustering characteristics, with nearby keys often showing similar attention scores. This observation inspired our design of NSA, suggesting that selecting key blocks based on spatial continuity might be a promising approach.
The blockwise clustering phenomenon indicates that tokens adjacent in the sequence may share certain semantic relationships with query tokens, though the exact nature of these relationships requires further investigation. This observation motivated us to explore a sparse attention mechanism that operates on continuous token blocks rather than individual tokens, aiming to enhance computational efficiency and preserve high-attention patterns.
為了探索 transformer attention distributions 中的潛在模式,並為我們的設計尋找靈感,我們在 Figure 8 中可視化了預訓練的 27B Full Attention model 的 attention map。這種可視化的方式揭示出引人注目的模式,其注意力分數往往會呈現出區塊聚類特徵,相鄰的 keys 通常會呈現高度相似度的注意力分數。這個觀察啟發了我們的 NSA 設計,表明根據空間連續性選擇關鍵塊可能是個很不錯的方法。這種區塊狀聚類現象說明著,序列中相鄰的 tokens 可能與 query tokens 共享某些語義關係,儘管這些關係的確切性質需要進一步研究。這個觀察促使我們去探索一種稀疏注意力機制的想法,這種機制作用於連續的 token blocks 而不是 individual tokens,旨在提高計算效率並保留高注意力模式。
## 7. Related Works
We review existing approaches that improve the efficiency of attention computation through sparse attention. These methods can be broadly categorized into three groups based on their core strategies: (1) fixed sparse pattern, (2) dynamic token pruning, and (3) query-aware selection. We introduce several representative works from each category.
我們回顧現有的一些利用稀疏注意力(sparse attention)提升注意力機制計算效率的方法。這些方法可根據其核心策略大致分為三類:(1) 固定稀疏模式,(2) 動態的 token pruning(剪枝),和 (3) query-aware selection。我們從每一種類別中介紹一些代表性的研究。
## 7.1. Fixed Sparse Pattern
SlidingWindow is a commonly used approach that allows the query to compute attention only within a fixed window. StreamingLLM (Xiao et al., 2023) addresses the challenges of processing long text streams by maintaining two critical portions of the context: an attention sink (early tokens) and a local context window. While these approaches effectively reduce memory and computation costs, their rigid pattern of ignoring contexts limits their performance on tasks requiring full context understanding.
滑動視窗 (SlidingWindow) 是一種常用的策略,可以讓 query 在一個固定範圍內計算其注意力。StreamingLLM (Xiao et al., 2023) 針對處理長文本流的挑戰,同時維持兩個關鍵的上下文:一是 attention sink(存放較早的 tokens),二是 局部上下文視窗(local context window)。儘管這些方法能有效降低記憶體與計算成本,但這種忽略上下文的死板模式仍然限制其於需要完整上下文理解的任務上的效能。
## 7.2. Dynamic Token Pruning
H2O (Zhang et al., 2023b) implements an adaptive approach to reduce KV-cache memory usage during decoding. This method dynamically evicts tokens deemed less important for future predictions based on their recent utility according to attention score. SnapKV (Li et al., 2024) also introduces a token pruning strategy that reduces the KV cache by selectively retaining only the most crucial features, enabling efficient memory usage. SnapKV identifies important features through attention weight analysis and voting during prefilling, then updates KV cache by combining selected compressed features with recent context to maintain prompt consistency.
H2O(Zhang et al., 2023b)實作一種自適應機制,在解碼過程中降低 KV-cache memory 的用量。這個方法根據最近的注意力分數動態地釋放那些被認為對未來預測不太重要的 tokens。
SnapKV(Li et al., 2024)則提出一種 token pruning strategy (剪枝策略),透過只保留最關鍵的特徵來減少 KV‐cache,以實現高效的記憶體利用。SnapKV 則是在預填充(prefilling)階段透過注意力權重分析與投票機制識別重要特徵,然後透過結合所選定壓縮後的特徵與近期的上下文的方式來更新 KV‐cache,以維持提示(prompt)的連貫性。
## 7.3. Query-Aware Selection
Quest (Tang et al., 2024) employs a blockwise selection strategy where each chunk's importance is estimated by product between query and coordinate-wise min-max of the key chunks. The results scores help to select top-$n$ important key-value chunks for attention. InfLLM (Xiao et al., 2024) combines fixed patterns with retrieval by maintaining attention sinks, local context, and retrievable chunks. This method selects representative keys from each chunk to estimate chunk importance. HashAttention (Desai et al., 2024) formulates pivotal token identification as a recommendation problem by mapping queries and keys to Hamming space using learned functions. ClusterKV (Liu et al., 2024) achieves sparsity by firstly clustering keys and then selecting the most relevant clusters for attention computation based on query-cluster similarity.
Quest (Tang et al., 2024) 採用分塊選擇策略(blockwise selection strategy),每個 chunk 的重要性由 query 與 key chunks 的 coordinate-wise min-max 之間的乘積估算。所產生的分數有助於選取排名前 $n$ 個重要的 key-value chunks 來作為注意力。 InfLLM (Xiao et al., 2024) 則是透過維持注意力匯點(attention sinks)、局部上下文(local context)與可檢索區塊(retrievable chunks)的方式來結合固定模式(fixed patterns)與檢索機制(retrieval)。這個方法會從每個區塊(chunk)中選出 representative keys (代表性的鍵)來估測區塊的重要性。HashAttention (Desai et al., 2024) 將關鍵的 token(pivotal token)的識別視為推薦問題(recommendation problem),透過學習到的函數將 queries 與 keys 射到漢明空間(Hamming space),再根據空間距離選取關鍵的 token。ClusterKV (Liu et al., 2024) 則先對所有 keys 進行聚類(clustering),再依據基於 query-cluster similarity 的注意力計算來挑選最相關的聚類,以此實現稀疏化(sparsity)並進行注意力計算。
## 8. Conclusion
We present NSA, a hardware-aligned sparse attention architecture for efficient long-context modeling. By integrating hierarchical token compression with blockwise token selection within a trainable architecture, our architecture achieves accelerated training and inference while maintaining Full Attention performance. NSA advances the state-of-the-art by demonstrating general benchmark performance matches full-attention baselines, exceeding modeling capability in long-context evaluations, and enhanced reasoning ability, all accompanied by measurable reductions in computational latency and achieving significant speedup.
我們提 NSA ,這是一種針對高效長文本建模的硬體對齊稀疏注意力架構。通過在可訓練架構中整合分層化的 token compression 與 blockwise token selection,我們的架構在維持 Full Attention 效能的同時還能夠實現加速訓練和推理。NSA 在通用基準測試表現上與 full-attention baselines 不相上下,在長文本評估中超越建模能力,並增強推理能力,同時伴隨可量化的運算延遲降低與顯著加速。