加速超長上下文 LLM 推理 [S72568]

加速超長上下文 LLM 推理 [S72568] 我們即將開始今天的會議。主題圍繞加速(Acceleration)、語言模型(LM)以及法國(France)。對於超長上下文的問題，Moe模型因某些公司政策限制，律師無法回答相關提問。你想發言嗎？不過我們歡迎你向Prince提問。我們已經進行到一半了。 Let’s get started with our session. The topics revolve around acceleration, LM, and France. For ultra-long context, Moe models, due to some company policies, lawyers will not be able to answer questions. Do you want to talk? But we welcome you to ask Prince. We’re on its halfway point. 接下來請歡迎黃博淵(Boyuan Huang)，他是阿里雲(Alibaba Cloud)大數據與人工智能平台(Big Data and AI Platform)的產品總監(Product Director)。他是一位專注於大數據(Big Data)、人工智能(AI)以及雲計算(Cloud Computing)的技術領袖，在工程與產品開發領域擁有深厚的背景。黃博淵於2014年加入阿里巴巴集團(Alibaba Group)，最初負責淘寶(Alimama)所有線上廣告與商業搜索團隊的工程工作。 Next, please welcome Boyuan Huang, Product Director of the Big Data and AI Platform at Alibaba Cloud. He is a tech leader specializing in big data, AI, and cloud computing with a strong background in engineering and product development. Boyuan joined Alibaba Group in 2014, initially overseeing engineering efforts for all of Alimama’s online advertisement and commercial search teams. 自2018年起，他擔任多個核心平台的產品負責人，包括人工智能數據工作平台(Platform for AI Data Works)以及阿里雲內部的搜索平台(Search Platform)。在加入阿里巴巴之前，黃博淵從2007年起任職於微軟搜索技術中心(Microsoft Search Technology Center)，擔任搜索與展示廣告團隊(Search and Display Ad Teams)的高級開發主管(Senior Development Lead)。他在工程與產品管理方面的豐富經驗，讓他具備推動科技產業創新的獨特視角。 Since 2018, he has served as the product owner of key platforms, including the Platform for AI Data Works and the Search Platform within Alibaba Cloud. Prior to his tenure at Alibaba, Boyuan worked at the Microsoft Search Technology Center from 2007, where he held the role of Senior Development Lead for the Search and Display Ad Teams. His extensive experience in both engineering and product management has equipped him with a unique perspective on driving innovations in the tech industry. 現在，讓我們以熱烈的掌聲歡迎黃博淵(Boyuan Huang)上台！ Now, please welcome Boyuan Huang with a warm round of applause! 大家好，歡迎參加GTC大會。很榮幸有這個機會向大家介紹我們的一些工作。我們與輝達(Nvidia)合作，針對超長上下文(Super Long Context)進行研究，特別是Moe模組(Moe Modules)。是的，我會先從一些背景開始介紹。在當今世界，越來越多的推理模型(Reasoning Models)出現在各地，例如我們的Qwen Max 2.5。許多新模型開始被訓練並公開給社群使用。一方面，這些模型對長上下文(Long Context)的需求日益增加。你們可以看看，從GPT 3.5到Gemini 1.5，上下文窗口(Context Window)從幾千個令牌(Tokens)增長到100萬個令牌。 Hello, everyone. Welcome to GTC. It’s my honor to have a chance to introduce some of our work. We work together with Nvidia on super long context, particularly for the Moe modules. Yes, I’ll start with some background. In today’s world, more and more reasoning models have appeared, such as our Qwen Max 2.5. Many new models have started to be trained and made public to the community. On one hand, these models bring us more requirements for long context. You can take a look—from TVT 3.5 to Gemini 1.5, the context window grows from several thousand tokens to 1 million tokens. 從應用場景的角度來看，我們開發了许多新應用，例如程式碼分析(Code Analysis)。通常，我們會將一個專案的大量程式碼(Code Source)放入一個大規模語言模型(Large Language Model)中。這意味著你可能會將數千個Python檔案(Python Files)放入一個大規模語言模型，讓它幫忙進行分析。此外，我們還有一些新場景，例如深度搜索(Deep Search)，通過爬取網站獲取大量新頁面，然後將所有內容放入大規模語言模型進行深入研究。這樣一來，上下文窗口自然變得越來越大。 From the scenario perspective, we’ve built many new applications, such as code analysis. Normally, we will put a project’s large amount of code source into one large language model. That means you might put thousands of Python files into a large language model to ask it to help with the analysis. Additionally, we have new scenarios like deep search, where we crawl websites, gather a lot of new pages, and then put all the content into a large language model for deep research. In this way, of course, the context window becomes larger and larger. 與此同時，這也帶來了許多挑戰，特別是在硬體限制(Hardware Limitation)方面，因為上下文窗口的增長會導致計算複雜度(Computing Complexity)顯著增加。我舉個例子來說明，如果使用100萬個令牌運行預覽(Preview)，可能需要幾十分鐘，這對客戶來說絕對是無法接受的。比如，我要求大規模語言模型幫我分析程式碼，卻需要等待10分鐘，那真是太糟糕了。此外，隨著鍵值緩存(KV Cache)越來越大，我們也遇到了記憶體牆(Memory Wall)的問題。 Meanwhile, this brings a lot of challenges, especially with hardware limitations, because the growth of the context window causes a significant increase in computing complexity. Let me give an example: if you use 1 million tokens to run a preview, it may take tens of minutes, and that’s definitely not acceptable for customers. For example, I ask a large language model to analyze my code, and I need to wait 10 minutes—oh, that’s terrible. Also, as the KV cache grows bigger and bigger, we’ve encountered a memory wall. 另一個背景與Moe有關。從去年開始，越來越多的Moe模型(Moe Models)在我們的阿里雲(Alibaba Cloud)以及全球範圍內被訓練。實際上，Moe與密集模型(Dense Model)有些不同。一方面，Moe相較於密集模型快得多，尤其是在推理時間(Inference Time)上。Moe提供了一個高效的方式來進行推理，特別是針對Moe模組。但這也帶來了一些挑戰。 Another background is related to Moe. Actually, starting from last year, more and more Moe models have been trained on our Alibaba Cloud and around the world. In fact, Moe is a bit different compared to dense models. On one hand, Moe is much faster than dense models, especially in inference time. Moe offers an efficient way to perform inference, particularly for Moe modules. But there are some challenges. 從右邊的圖表中，你們可以看到第五個前饋網路(Feedforward Network)。實際上，對於Moe模型(Moe Models)，我們需要將所有專家(Experts)的參數(Parameters)載入記憶體(Memory)。當我們開始進行推理(Inference)時，計算會從這裡開始，進行注意力計算(Attention Computation)，完成正規化(Normalization)，然後進入路由器(Router)。路由器會決定哪些專家應該被啟動，接著進入專家層，執行前饋網路計算。這意味著在推理階段(Inference Phase)，我們需要載入更多參數，也就是說，所有專家的參數都會佔用更多的記憶體空間(RAM)，這與之前相比是一個新的挑戰。 From the figure on the right part, you can see the fifth feedforward network. Actually, for Moe models, we need to load all the parameters for the experts into memory. So when we start to do an inference, the computation begins here with attention computation, performs normalization, and then enters the router. The router will decide which experts should be activated, and then it goes to the expert layer to perform the feedforward network computation. That means during the inference phase, we need to load more parameters—I mean, the parameters for all the experts, which will occupy more RAM compared to before. So this has become a new challenge. 在阿里雲(Alibaba Cloud)，我們開發了一個推理引擎(Inference Engine)，我們稱之為Blade LLM。Blade LLM 是一個高效能的大規模語言模型推理引擎(High-performance LLM Inference Engine)。從圖表中可以看到，這個引擎是建立在一個彈性部署平台(Record Deployment Platform)之上。我們稱這個平台為人工智能彈性算法服務平台(Platform for AI Elastic Algorithm Service)。在底層部分，彈性算法服務會連接推理引擎與我們雲端上的GPU資源(GPU Resources)。我們針對輝達(Nvidia)進行了大量優化。 At Alibaba Cloud, we built an inference engine called Blade LLM. Blade LLM is a high-performance LLM inference engine. As you can see from the diagram, this engine is built on top of a record deployment platform. We call this platform the Platform for AI Elastic Algorithm Service. At the bottom part, the elastic algorithm service connects the inference engine with the GPU resources on our cloud. We’ve made a lot of optimizations targeting Nvidia. 實際上，我們將優化分為三個層次。首先，我們通過利用人工智能編譯器(AI Compiler)進行模型優化(Model Optimization)，以充分利用不同類型的指令，尤其是由輝達GPU支援的低階指令(Low-level Instructions)。其次，我們有一個生成引擎(Generation Engine)層，提供同步運行時(Synchronized Runtime)、批次調度(Batch Scheduling)、提示緩存(Prompt Cache)等機制，以提升生成速度(Generation Speed)。此外，我們還提供了服務架構(Service Architecture)，例如網頁伺服器(Web Server)、分散式請求調度(Distributed Request Scheduling)、預覽解碼聚合(Preview-Decode Aggregation)等功能。 In fact, we divided the optimization into three layers. On one hand, we perform model optimization by leveraging an AI compiler so that we can utilize different kinds of instructions, especially low-level instructions powered by Nvidia GPUs. Secondly, we have a layer called the generation engine, which provides mechanisms like synchronized runtime, batch scheduling, and prompt cache to help increase our generation speed. Additionally, we have a service architecture that offers capabilities such as a web server, distributed request scheduling, and preview-decode aggregation. 在我們的推理引擎之上，有各種不同的應用(Applications)。今天我將介紹三項技術，三個我們為加速注意力計算(Attention Computation)、加速前饋網路計算(Feedforward Network Computation)所做的工作。我們還將介紹我們開發的動態分塊流水線並行(Dynamic Chunked Pipeline Parallelism)，用以提升推理效率(Inference Efficiency)，特別是針對Moe模型在長上下文(Long Context)上的表現。首先，我們先來談談稀疏注意力(Sparse Attention)。 On top of our inference engine, there are different kinds of applications. Today, I will introduce three technologies—three efforts we’ve made to accelerate attention computation and feedforward network computation. We will also introduce the dynamic chunked pipeline parallelism we’ve built to enhance inference efficiency, especially for long context on Moe models. Firstly, let’s talk about sparse attention. 隨著上下文窗口(Context Window)越來越長，注意力計算(Attention Computation)無論是在預覽階段(Preview Phase)還是解碼階段(Decode Phase)，都會耗費大量時間。因此，稀疏注意力(Sparse Attention)被引入到這個領域。你們可以看到，這是一個推理(Inference)的例子。這是一個由微軟(Microsoft)提供的稀疏注意力方法。基本概念是，我們不需要對矩陣中的每個數字進行注意力計算，而是挑選其中一些數字來計算，這樣就不需要計算所有的數字。這能為我們節省大量的計算能力(Computation Power)。 As we have longer and longer context windows, attention computation will cost a lot of time, no matter in the preview phase or the decode phase. So sparse attention is introduced into this area. As you can see, this is an example for inference. This is a sparse attention method provided, I think, by Microsoft. Actually, the basic idea is that we don’t need to do attention calculation for every single digit of the matrix. Instead, we pick up some of the digits to be calculated, so we don’t need to compute every digit. This will save us a lot of computation power. 但這帶來了幾個挑戰。第一個挑戰是關於準確性(Accuracy)。如果我們選錯了要進行稀疏計算(Sparse Computation)的數字，就會損失一些準確性。微軟的推理方法提供了一個垂直斜線規格算法(Vertical Slash Specification Algorithm)來挑選需要計算的數字。除了算法本身，我們還發現資源利用率(Utilization Rate)並不高，特別是在某些GPU上，例如H20。我們進行了一些測試，發現核心使用率(Kernel Usage)不到40%。這意味著我們並未充分利用核心(Kernel)。 But there are several challenges. The first challenge is about accuracy. If we pick the wrong digits for sparse computation, we lose some accuracy. Microsoft’s inference method provided a vertical slash specification algorithm to pick the digits to be computed. Besides the algorithm itself, we also found that the utilization rate is not so high, especially for some GPUs like the H20. We ran some tests, and actually, the kernel usage is less than 40%. That means we didn’t fully use the kernel. 在這種情況下，我們進行了一些優化，試圖提高通道利用率(Channel Utilization)。我們開發了一個優化的稀疏注意力(Optimized Sparse Attention)。我們針對從全局記憶體(Global Memory)載入稀疏鍵值對(Sparse Key-Value Pairs)進行了深入優化。首先，讓我展示結果。從圖表中可以看到，在100萬個上下文長度(Context Length)的情況下，相較於閃存注意力(Flash Attention)，它已經實現了超過13.7倍的加速。而我們的Blade LLM在相同GPU特定配置下實現了近30倍的加速。 In this case, we did some optimization to try to enhance channel utilization. We developed an optimized sparse attention. We intensively optimized the loading of sparse key-value pairs from global memory. First, let me show the results. As a result, you can see from the graph that for a context length of 1 million, compared to Flash Attention, which already speeds up by over 13.7 times, our Blade LLM achieves nearly 30 times the speedup at the same specific configuration on GPU. 我們已經將這段程式碼開源(Open Source)，放在VRM專案中。稍後我們會分享這些幻燈片和連結，你們可以直接到開源社群(Open Source Community)查看我們的程式碼，了解我們做了什麼。此外，我們嘗試使用FP4來處理鍵值緩存(KV Cache)。最近，有很多訓練和推理開始利用FP8。我們希望使用FP4來優化鍵值緩存。 We’ve already open-sourced this code to the VRM project. Later, I think we can share these slides and the link, and you can directly go to the open-source community to check out our code and what we did. Also, we tried to use FP4 for the KV cache. Recently, there’s been a lot of training and inference leveraging FP8, and we want to use FP4 to handle the KV cache. 我們可以通過使用LD Matrix來重疊數據載入(Data Loading)，同時利用高階的DP4A指令進行時間轉換(Time Conversion)，然後使用張量核心(Tensor Core)進行計算(Computation)。總體來說，通過使用FP4鍵值緩存(KV Cache)，我們可以節省高達75%的記憶體(Memory)，同時在解碼注意力(Decoding Attention)上獲得2.5倍的加速，且不會損失準確性(Accuracy)。這是第一部分。接下來，我將進入前饋網路計算(Feedforward Network Computation)的部分。 So we can overlap the data loading by using LD Matrix, and do the time conversion with the high-level DP4A instruction, and then use Tensor Core to do the computation. Overall, by using FP4 KV cache, we can save up to 75% of memory. At the same time, we got 2.5 times speedup for decoding attention with no accuracy loss. So this is the first part. Secondly, I will go into the feedforward network computation part. 特別是對於Moe模型(Moe Models)，我會回到這頁幻燈片。Moe模型中有大量的FFN計算(Feedforward Network Computation)。在端到端模型推理(Model Inference)的計算中，FFN計算佔據很大一部分。因此，提升FFN效率(FFN Efficiency)將有助於提升整體推理效率(Inference Efficiency)，特別是對於長上下文(Long Context)的Moe模型。解碼階段(Decoding)是一個受記憶體限制(Memory Bound)的階段。對於Moe模型，每個解碼步驟(Decoding Step)都有更多的FFN權重(FFN Weights)。我們觀察到，現有的Moe核心(Moe Kernels)，例如Fusion Moe，頻寬利用率(Bandwidth Utilization)相對較低。 Especially for Moe models, I’ll go back to this slide. There’s a lot of FFN computation in Moe models. In the end-to-end computation for model inference, FFN computation takes up a significant portion. So enhancing FFN efficiency will help us improve overall inference efficiency, especially for long-context Moe models. Decoding is a memory-bound period. For Moe models, there are a lot more FFN weights for each decoding step. What we observe is that existing Moe kernels, such as Fusion Moe, have relatively low bandwidth utilization. 為什麼呢？如果深入了解GPU如何進行計算工作，你會發現，通常GPU主要進行一個主迴圈計算(Main Loop Computation)，從全局記憶體(Global Memory)讀取數據(Data)，然後將數據載入主迴圈並傳送到張量核心(Tensor Core)進行計算。完成這一階段後，會進入收尾階段(Epilogue)，將結果寫回全局記憶體。從圖表中可以看到，當數據載入張量核心進行計算後，這段時間的頻寬(Bandwidth)並未被充分利用。因此，我們試圖充分利用從全局記憶體到張量核心的全部頻寬，提升效率。 Why is that? If we go into the details about how GPUs do the computation work, normally the GPU will mainly perform one main loop computation, reading data from off-chip global memory. Then it loads the main loop and sends it to Tensor Core for computation. After this phase, it enters the epilogue, writing the results back to off-chip global memory. From this graph, you can see that after you load the data to Tensor Core for computation, the bandwidth is not fully used during that period. So we are trying to leverage the full bandwidth between global memory and Tensor Core to enhance efficiency. 我們怎麼做呢？這是一個高度並行的持久化乒乓(Persistent Ping-Pong)設計。簡單來說，我們開始使用三個工作組(Work Groups)。工作組0負責數據載入(Data Loading)，工作組1在計算完成後開始執行主迴圈計算(Main Loop Computation)，然後工作組1會進入收尾階段，將內容寫回。同時還有一個工作組2，當工作組1進入收尾階段時，工作組2可以開始執行主迴圈計算。然而，這仍然不是完美的解決方案。因為存在許多執行緒(Warps)，執行緒調度器(Warp Scheduler)和寄存器使用(Register Usage)的額外負擔會讓性能不夠理想。 So what do we do? This is a highly parallelized design with persistent ping-pong. In short, we start to use three work groups. Work Group 0 handles data loading. Work Group 1 starts to do the main loop computation after the computation, and then Work Group 1 will do the epilogue to write the content back. There’s also another work group, Work Group 2. When Work Group 1 enters the epilogue stage, Work Group 2 can start to do the main loop computation. However, this is still not a perfect solution. Because there are many warps, the overhead on warp scheduling and register usage will make the performance less than perfect. 這個解決方案的另一個缺點是，我們使用工作組1和工作組2來執行主迴圈計算。但每個執行緒(Thread)的總寄存器數量(Register Number)只有255個，因此我們需要為不同工作組分配寄存器。這意味著我們無法為計算使用非常大的分塊(Tiles)，這會影響整體吞吐量(Throughput)。正如我之前提到的，GPU的利用率(Utilization)並不高，我們觀察到利用率不到40%。於是，我們引入了另一種方法來提升性能。我們開始使用張量記憶體加速器(Tensor Memory Accelerator, TMA)。這是由輝達(Nvidia)提供的另一個工具。使用TMA後，我們不需要那麼多執行緒來處理載入(Load)和儲存(Storage)。 Another drawback of this solution is that we use Work Group 1 and Work Group 2 to do the main loop computation. But the overall register number for the thread is 255, so we need to separate all the registers for different work groups. That means we cannot have very large tiles for the computation, which will impact the overall throughput. As I mentioned before, the utilization for the GPU is not very high—we observed it’s less than 40%, I think. So we introduced another way to enhance performance. We started to use the Tensor Memory Accelerator (TMA). This is another tool provided by Nvidia. By using TMA, we don’t need that many threads to do load and storage. 在這種情況下，我們只使用兩個工作組。工作組0中，我們使用執行緒組0和1(Warp 0 and 1)來執行載入；在工作組0中，我們使用另外兩個執行緒組2和3(Warp 2 and 3)來處理載入和寫入。每個工作組有四個執行緒組。我們仍然使用執行緒組1來執行主迴圈計算。 In this case, we just use two work groups. In Work Group 0, we use Warp 0 and 1 to do the loading. In Work Group 0, we use another two warps, Warp 2 and 3, to handle loading and writing. Each work group has four warps. We still use Warp 1 to do the main loop computation. 在此架構中，我們可以減少執行緒組(Warps)的數量。你可能會想，即使如此仍然有八個執行緒組。執行緒組0(Warp Group 0)有四個執行緒，執行緒組1(Warp Group 1)也有四個執行緒。是的，仍然存在一些額外負擔(Overhead)，但我們可以從乒乓機制(Ping-Pong)中獲益。通過利用乒乓機制，我們可以使用兩個共享記憶體緩存(Shared Memory Cache)。 Within this architecture, we can reduce warps. You may think, oh, there are still eight warps. Warp Group 0 has four warps, and Warp Group 1 has another four warps. Yeah, there are still some overheads, but we can benefit from ping-pong. By leveraging ping-pong, we can use two shared memory caches. 這個記憶體緩存可以在工作組0(Work Group 0)和工作組1(Work Group 1)之間共享。這意味著我們可以在載入(Loading)、寫入(Writing)的工作組和計算工作組(Computation Work Group)之間共享數據(Data)。這將大幅提升計算效率(Computation Efficiency)。此外，我們可以支援更大的分塊(Tiles)，因為寄存器(Register)可以被單一工作組充分利用。即使你考慮引入另一個工作組，例如工作組2(Work Group 2)來執行主迴圈計算(Main Loop Computation)，我們仍然可以使用乒乓機制進行計算。實際上，工作組1和工作組2都被用於計算，我們仍然可以在不犧牲乒乓機制的條件下計算相同的分塊。 This memory cache can be shared between Work Group 0 and Work Group 1. That means we can share the data between the loading, writing work group, and the computation work group. This overall will dramatically enhance the efficiency for the computation. Also, we can support large tiles because the registers can be fully used by one work group. Yeah, even if you think about introducing another work group, maybe Work Group 2, to do the main loop, we still can use ping-pong to do the computation. Actually, both Work Group 1 and Work Group 2 are used for computation, and we still can compute the same tiles without sacrificing ping-pong. 通過利用這些技術，你可以看到我們的性能表現。我們的RMFFP8 Moe核心(Moe Kernel)實現了80%到90%的峰值記憶體頻寬(Peak Memory Bandwidth)，這是在H20上執行的結果。我們之前運行測試時，頻寬利用率(Bandwidth Utilization)不到40%，而現在實現了比主流Moe核心高出1.6倍的加速。我們在測試環境配置(Testing Environment Configuration)中進行了一些測試，實際上使用DeepSeek V3和DeepSeek R1進行驗證。這是我們在這領域所做的工作。 By leveraging these technologies, you can see our performance. Actually, the RMFFP8 Moe kernel implementation achieved 80% to 90% peak memory bandwidth on H20. We ran this previously, and the bandwidth utilization was less than 40%, but now it leads to a 1.6 times speedup over mainstream Moe kernels. Here are some tests in the testing environment configuration. Actually, we tested this with DeepSeek V3 and DeepSeek R1. Yeah, so this is the work we’ve done in this area. 另一個有趣的工作是關於動態分塊流水線並行(Dynamic Chunked Pipeline Parallelism)。這是一個非常流行的算法，很多團隊都在使用流水線(Pipeline)和分塊並行(Chunked Parallelism)。 Another interesting work is with dynamic chunked pipeline parallelism. This is a very popular algorithm being used—actually, a lot of teams use pipeline and chunked parallelism. 通過使用並行性(Parallelism)，我們可以提升模型預填充效率(Prefill Efficiency)。當然，如果你的上下文(Context)變得很長，就需要將它們切分成不同的分塊(Chunks)，這樣計算(Computation)就可以並行進行。但使用分塊並行(Chunked Parallelism)仍然存在一些小挑戰，特別是對於非常長的上下文。例如，如果你將所有分塊的大小設為2048。在第一階段(Stage 1)，完成分塊0(Chunk 0)的計算後，第二階段(Stage 2)開始進行分塊0的計算。比較第一階段的分塊1(Chunk 1)和分塊0，它們的長度(Length)是相似的。然而，第一階段的鍵值緩存(KV Cache)有所不同，略有差異。這意味著，如果我們在第一階段進行計算，分塊1的計算時間會比第二階段的分塊0稍微長一些，這會在第一階段產生一個小氣泡(Bubble)。在分塊0之後，會有一個小氣泡，因此第二階段需要等待分塊1的計算完成。 By using parallelism, we can enhance the model prefill efficiency. Of course, if your context becomes very long, you need to cut them into different chunks so that the computation can be done in parallel. But there are still small challenges with chunked parallelism, especially for very long contexts. For example, if you set all the chunks within a size of 2,048, in Stage 1, after you finish Chunk 0 computation, Stage 2 starts to do Chunk 0 computation. Compare Stage 1’s Chunk 1 and Stage 1’s Chunk 0—the length of the chunks is similar. However, for Stage 1, the KV cache is different; the KV cache is slightly different. So that means if we do the computation in Stage 1, Chunk 1 will cost a little bit longer compared with Stage 2’s Chunk 0 computation. This will bring a small bubble in Stage 1, I think, after Chunk 0—there’s a small bubble. So Stage 2 needs to add up here and wait for Chunk 1’s computation to complete. 在第一階段之後，分塊0(Chunk 0)和分塊1(Chunk 1)的計算需要在第二階段(Stage 2)中進行競爭。謝謝。 After Stage 1, the computation for Chunk 0 and Chunk 1 competes in Stage 2. Thanks. 分塊0(Chunk 0)和分塊1(Chunk 1)都需要等待前一階段(Previous Stage)的計算完成。如果上下文長度(Context Length)越來越長，分塊並行流水線(Chunked Pipeline Parallelism)中會出現較大的氣泡，這也會導致GPU時間(GPU Time)的大量浪費。因此，我們開始使用另一種方法。實際上，我們稱之為動態分塊流水線並行(Dynamic Chunked Pipeline Parallelism)。如你所見，分塊0的大小是2048，但這是一個動態分塊大小(Dynamic Chunk Size)。我們會針對分塊0、分塊1、分塊2動態調整分塊大小。 Chunk 0 and Chunk 1 all need to wait for their previous stage’s computation. If the context length grows longer and longer, there will be a big bubble in the chunked pipeline parallelism. So that will also cause a lot of waste for the GPU time. Yeah, we started to use another way. Actually, we call it dynamic chunked pipeline parallelism. As you can see, Chunk 0 is 2,048—it’s actually a dynamic chunk size. We will change the chunk size for Chunk 0, 1, and 2 dynamically. 在計算分塊0(Chunk 0)之後，為分塊1(Chunk 1)進行計算。 So after computing Chunk 0, compute Chunk 1. 它會稍微減少一點。例如，我們可能不使用2048，而是使用2000。分塊1(Chunk 1)會開始計算。但在同一階段，例如第一階段(Stage 1)的分塊0(Chunk 0)，在這個設計中，所有計算(Computation)在不同階段之間會對齊(Aligned)。這樣可以節省很多時間。當然，真實數據並不像我在圖片中展示的那樣完美，中間仍然會有一些小氣泡(Small Bubbles)，但它們並不大。我將展示一些測試結果。我們以張量並行(Tensor Parallel)作為基準(Benchmark)。如果直接使用並行模式(Parallel Pattern)，當然會耗費大量時間，因為上下文窗口(Context Window)非常大，且沒有並行計算。對於64,000個提示(Prompts)，它可能需要89秒。如果我們使用分塊流水線並行(Chunked Pipeline Parallelism)，它實際上並未比張量並行節省時間，因為存在一些氣泡時間(Bubble Time)，為3.4秒。但如果我們使用動態分塊流水線並行(Dynamic Chunked Pipeline Parallelism)，可以看到氣泡時間不是零，而是減少到不到1秒。我們實現了32%的加速。我們還測試了128,000個令牌(Token)的規模。預覽延遲(Prefill Latency)主要針對Qwen 2模型(Qwen 2 Model)進行測試，實際上我們已經在生產環境(Production Environment)中利用了動態分塊流水線並行。你也可以在Qwen聊天(Qwen Chat)中嘗試使用Qwen模型。我們的新網站現已全球可用，你可以在那裡試用最新的Qwen模型。這就是動態分塊流水線並行的介紹。 It will reduce a little bit. So maybe, for example, we didn’t use 2,048—we just use 2,000. Chunk 1 will come, and we will compute. But at the same time, in Stage 1’s Chunk 0, within this design, all the computation will be aligned between different stages. This will save a lot of time. Of course, the real data is not as perfect as I showed in the picture—there will still be some small bubbles in between, but they’re not so big. I will show some test results. We use tensor parallel as a benchmark. If you directly use a parallel pattern, of course, it will cost a lot of time because the context window is so huge, and there’s no computation in parallel. It may cost 89 seconds for 64,000 prompts. If we use chunked pipeline parallelism, it doesn’t actually save time compared to tensor parallel because it has some bubble time of 3.4 seconds. But if we use the dynamic chunked pipeline parallelism pattern, we can see the bubble time—it’s not zero, it’s reduced, but reduced to less than 1 second. We speed up by 32%. We also tested with a 128,000-token size. The prefill latency is mainly tested for the Qwen 2 model—actually, we’ve already leveraged dynamic chunked pipeline parallelism in our production environment. You can also try to use the Qwen model in Qwen Chat. Our new website is globally available, and you can try the latest Qwen model there. This is the dynamic chunked pipeline parallelism. 是的，引擎優化(The Engine Optimization)只能幫助我們將單次推理(One-time Inference)的性能提升1倍。正如我之前提到的，在引擎的底層，我們有一個PIEEAS彈性算法服務(PIEEAS Elastic Algorithm Service)，這是一個部署平台(Deployment Platform)。 Yep, so the optimization for the engine can only help us increase the performance for one-time inference by 1 time. As I mentioned before, at the bottom of the engine, we have a PIEEAS elastic algorithm service—it’s a deployment platform. 目前，有很多關於預填充與解碼分離(Prefill and Decode Disaggregation)的討論。我們的產品已經支援這項功能，這實際上是從產品內部實現的。我們內建了一個大規模語言模型調度器(LLM Scheduler)，這個內建調度器(Built-in Scheduler)可以根據我們的配置(Configuration)執行預填充(Prefill)和解碼(Decode)的分離，或者進行多任務分配(Multi-tenant Allocation)。我們也讓所有產品使用者(Users)能夠為預填充與解碼分離(PD Disaggregation)設定自己的配置。在右邊的部分，我們展示了產品如何啟用預填充與解碼分離。在我們的產品中，你可以直接開啟預填充與解碼分離功能。我們同時提供使用者介面(User Interface)和API，讓客戶決定要部署多少預填充節點(Prefill Nodes)和解碼節點(Decode Nodes)。這為客戶提供了非常靈活的方式。如果你不知道如何進行分離配置，可以直接使用我們的預設配置(Default Configuration)。但如果你很了解你的流量(Traffic)，就可以自行設定配置。因為如果使用場景(User Scenario)改變，預填充與解碼的配置也應該隨之調整。例如，如果你正在開發一個開發工具(Development Tool)，需要讀取大量原始碼(Source Code)的輸入，然後生成少量結果供開發者閱讀，這會花費很長時間來進行預填充。 Currently, there are lots of discussions about prefill and decode disaggregation. Our products already support this obligation—actually, it’s integrated inside the product itself. We have an LLM scheduler—a built-in scheduler—which can handle prefill and decode disaggregation or multi-tenant allocation based on our configuration. We also enable all users of our products to set their own configurations for PD disaggregation. In the right part, we show how our products enable PD disaggregation. In our products, you can directly open prefill and decode disaggregation. We provide both a user interface and API for our customers to decide how many prefill nodes and decode nodes you are going to deploy. This provides a very flexible way for our customers. If you don’t know how to do the disaggregation, you can directly use our default configuration. But if you know your traffic a lot, you can set up the configuration for yourself. That’s because if the user scenario changes, the PD configuration should be changed. For example, if you are writing a development tool that will read a lot of input from source code and generate a little bit of results for developers to read, that will cost you a long time to do the prefill. 在這種情況下，在你的場景中，你可能需要為預填充(Prefill)設定更多實例(Instances)。另一方面，如果你想進行深入研究(Deep Research)，需要執行大量搜索(Search)，並生成一份很長的報告(Report)，這完全不同。如果你在處理數學問題(Mathematics)，要求大規模語言模型(Large Language Model)進行數學運算，推理模型(Reasoning Model)會反覆執行大量推理工作(Reasoning Work)，生成很長的內容。在這種情況下，你需要更多的解碼實例(Decode Instances)。我們不僅提供工具(Tools)將所有功能整合在一起，還提供指引(Tours)讓客戶或使用者自行設定配置。 In that case, in your scenario, you may need to set more instances for prefill. On the other hand, if you want to do deep research, it will do a lot of searches, but you want to write a very long report. It’s definitely different for mathematics. If you ask a large language model to do mathematics, the reasoning model will repeatedly do a lot of reasoning work and generate very long content. In that case, you need more decode instances. We not only provide tools to integrate everything together, but we also provide tours to let our customers or users set up their own configurations. 很容易且自由地操作。對你來說，這裡還有一個例子。我們使用Qwen 2.5，這是Qwen團隊開源(Open Source)的最新Moe模型(Moe Model)，在安全延遲(Security Latency)小於66毫秒(Milliseconds)的條件下進行測試。我們設定了最佳預填充與解碼比率(Prefill-to-Decode Ratio)為1:1和1:2。我們可以看到並發性(Concurrency)和每秒交易數(TPS)增長超過90%。也許你可以前往阿里雲(Alibaba Cloud)試用我們的服務，設定你的第一個預填充與解碼分離服務(PD Disaggregation Service)。 Easily and freely, yeah. Here is also an example we use—Qwen 2.5, which is the latest Moe model open-sourced by the Qwen team—to do the test with security latency less than 66 milliseconds. We set the best prefill-to-decode ratio at 1:1 and 1:2, and we can see that the concurrency and TPS all grow over 90%. Maybe you can go to Alibaba Cloud to try our service to set up your first PD disaggregation service. 我想讓我做個總結。對於Blade LLM，在頂層部分(Top Part)，我們與輝達(Nvidia)合作，提供一系列能力來增強注意力計算(Attention Calculation)、前饋計算(Feedforward Computation)、鍵值緩存載入(KV Cache Load)以及轉換方法(Conversion Methods)。我們還開發了新的MoeFDA核心(MoeFDA Kernels)，以提升記憶體頻寬利用率(Memory Bandwidth Utilization)。 I think, yeah, let me make a summary. So for Blade LLM, in the top part, we worked together with Nvidia to provide a group of capabilities to enhance attention calculation, feedforward computation, KV cache load, and conversion methods. We also wrote new MoeFDA kernels to enhance memory bandwidth utilization. 我們還使用動態分塊流水線並行(Dynamic Chunked Pipeline Parallelism)來提升整體長上下文預填充效率(Long Context Prefill Efficiency)。在底層部分(Bottom Part)，平台本身提供了預填充與解碼分離功能(PD Disaggregation Capabilities)給我們的客戶。需要再次強調，測試數據(Test Numbers)和測試設定(Test Settings)僅適用於我們的場景(Scenarios)。對於你自己的業務和場景，你需要找出最適合的配置(Best Configurations)。通過利用我們的平台，你可以直接自行設定配置，或通過API，甚至由你自己的團隊編寫的算法來控制設定。 We also use dynamic chunked pipeline parallelism to enhance the overall long context prefill efficiency. In the bottom part, the platform itself provides PD disaggregation capabilities for our customers. Again, the test numbers and test settings only fit our scenarios. For each one of your own businesses and scenarios, you need to figure out what your best configurations are. By leveraging our platform, you can directly set the configuration by yourself or even control the settings through API or some algorithms written by your own team. 我想這就是我的部分。謝謝。非常感謝你們願意聆聽，這是我們今天所有的時間。 Yep, I think that’s all for my part. Thank you. Thank you so much for wanting to listen, and that’s all the time we have today. 他現在不會回答問題，但歡迎你們在會後(Offline)提問。請享受GTC的其餘部分。謝謝。 He will not take questions now, but you’re welcome to ask questions offline. Enjoy the rest of GTC. Thank you. ---- 在經過grok訂正加速超長上下文 LLM 推理 [S72568] 我們即將開始今天的會議。主題圍繞加速(Acceleration)、大規模語言模型(LLM)以及法國(France)。對於超長上下文(Ultra-long Context)的問題，由於某些公司政策限制，Moe模型相關的法律顧問無法回答提問。你想發言嗎？不過我們歡迎你向Prince提問。我們的議程已經進行到一半。 Let’s get started with our session. The topics revolve around acceleration, large language models (LLM), and France. For ultra-long context issues, due to certain company policies, legal advisors for Moe models cannot respond to questions. Do you want to speak? However, we welcome you to ask Prince. We’re halfway through our agenda. 接下來請歡迎黃博淵(Boyuan Huang)，他是阿里雲智能集團(Alibaba Cloud Intelligence Group)大數據與人工智能平台(Big Data and AI Platform)的產品總監(Product Director)。他是一位專注於大數據(Big Data)、人工智能(AI)及雲計算(Cloud Computing)的技術領袖，在工程與產品開發領域擁有深厚背景。黃博淵於2014年加入阿里巴巴集團(Alibaba Group)，最初負責淘寶(Alimama)所有線上廣告與商業搜索團隊的工程工作。 Next, please welcome Boyuan Huang, Product Director of the Big Data and AI Platform at Alibaba Cloud Intelligence Group. He is a tech leader specializing in big data, artificial intelligence, and cloud computing, with extensive experience in engineering and product development. Boyuan joined Alibaba Group in 2014, initially leading engineering efforts for all of Alimama’s online advertising and commercial search teams. 自2018年起，他擔任多個核心平台的產品負責人，包括人工智能數據工作平台(Platform for AI Data Works)和阿里雲內部的搜索平台(Search Platform)。在加入阿里巴巴之前，黃博淵於2007年起在微軟搜索技術中心(Microsoft Search Technology Center)工作，擔任搜索與展示廣告團隊(Search and Display Ad Teams)的高級開發主管(Senior Development Lead)。他在工程與產品管理方面的豐富經驗，使他具備推動科技產業創新的獨特視野。 Since 2018, he has served as the product owner of several core platforms, including the Platform for AI Data Works and the Search Platform within Alibaba Cloud. Prior to Alibaba, Boyuan worked at the Microsoft Search Technology Center starting in 2007, where he was the Senior Development Lead for the Search and Display Ad Teams. His extensive experience in engineering and product management provides him with a unique perspective on driving innovation in the tech industry. 現在，讓我們以熱烈的掌聲歡迎黃博淵(Boyuan Huang)上台！ Now, please welcome Boyuan Huang with a warm round of applause! 大家好，歡迎參加GTC大會。很榮幸能有機會向大家介紹我們的工作。我們與輝達(Nvidia)合作，針對超長上下文(Super Long Context)進行研究，特別聚焦於Moe模組(Moe Modules)。我將從一些背景資訊開始介紹。在當今世界，越來越多的推理模型(Reasoning Models)不斷湧現，例如我們的Qwen Max 2.5。許多新模型開始被訓練並公開給社群使用，這也使得對長上下文(Long Context)的需求日益增長。例如，從TVT 3.5到Gemini 1.5，上下文窗口(Context Window)已從幾千個令牌(Tokens)增長至100萬個令牌。 Hello, everyone. Welcome to GTC. It’s an honor to have this opportunity to present our work. We collaborate with Nvidia to research super long contexts, with a special focus on Moe modules. I’ll begin with some background. In today’s world, an increasing number of reasoning models are emerging, such as our Qwen Max 2.5. Many new models are being trained and released to the community, driving a growing demand for long contexts. For instance, from TVT 3.5 to Gemini 1.5, the context window has expanded from a few thousand tokens to 1 million tokens. 從應用場景來看，我們開發了許多新應用，例如程式碼分析(Code Analysis)。通常，我們會將一個專案的大量程式碼原始檔案(Code Source Files)輸入大規模語言模型(Large Language Model)。這可能意味著將數千個Python檔案(Python Files)放入模型，讓它協助分析。此外，還有像深度搜索(Deep Search)這樣的新場景，我們會爬取網站收集大量頁面內容，然後將其全部輸入大規模語言模型進行深入研究。這樣，上下文窗口自然變得越來越大。 From an application perspective, we’ve developed numerous new use cases, such as code analysis. Typically, we feed a project’s extensive code source files into a large language model. This might involve inputting thousands of Python files into the model for analysis. Additionally, new scenarios like deep search involve crawling websites to gather vast amounts of page content, then feeding it all into a large language model for in-depth research. As a result, the context window naturally grows larger. 與此同時，這帶來了諸多挑戰，特別是硬體限制(Hardware Limitations)。上下文窗口的擴大會顯著增加計算複雜度(Computing Complexity)。舉例來說，若使用100萬個令牌運行預填充(Prefill)，可能需要數十分鐘，這對客戶而言是不可接受的。例如，我要求模型分析程式碼卻得等10分鐘，實在令人難以忍受。此外，隨著鍵值緩存(KV Cache)的規模增長，我們也面臨記憶體牆(Memory Wall)的問題。 Meanwhile, this introduces many challenges, particularly hardware limitations. The expansion of the context window significantly increases computing complexity. For example, running a prefill with 1 million tokens might take tens of minutes, which is unacceptable for customers. Imagine asking a model to analyze code and waiting 10 minutes—that’s simply unbearable. Furthermore, as the KV cache grows, we encounter the memory wall issue. 另一個背景與Moe模型相關。自去年起，越來越多的Moe模型(Moe Models)在阿里雲(Alibaba Cloud)及全球範圍內被訓練。與密集模型(Dense Models)相比，Moe模型有所不同。一方面，Moe模型在推理時間(Inference Time)上比密集模型更快，特別是針對Moe模組時，它提供了一個高效的推理方式。然而，這也帶來了一些挑戰。 Another background relates to Moe models. Since last year, an increasing number of Moe models have been trained on Alibaba Cloud and worldwide. Compared to dense models, Moe models differ. On one hand, Moe models are much faster than dense models in terms of inference time, offering an efficient approach, especially for Moe modules. However, this also presents certain challenges. 從右邊的圖表中，你們可以看到第五個前饋網路(Feedforward Network)。對於Moe模型(Moe Models)，我們需要將所有專家(Experts)的參數(Parameters)載入記憶體(Memory)。推理(Inference)開始時，計算從注意力計算(Attention Computation)啟動，經過正規化(Normalization)，然後進入路由器(Router)。路由器決定啟動哪些專家，隨後進入專家層執行前饋網路計算。這意味著在推理階段(Inference Phase)，我們需要載入更多參數，所有專家參數將佔用更多記憶體空間(RAM)，這是相較以往的新挑戰。 From the chart on the right, you can see the fifth feedforward network. For Moe models, we must load all expert parameters into memory. When inference begins, computation starts with attention computation, proceeds through normalization, and then enters the router. The router determines which experts to activate, followed by the expert layer performing feedforward network computation. This means that during the inference phase, we need to load more parameters—all expert parameters occupy additional RAM, presenting a new challenge compared to before. 在阿里雲(Alibaba Cloud)，我們開發了一個推理引擎(Inference Engine)名為Blade LLM，這是一個高效能的大規模語言模型推理引擎(High-performance LLM Inference Engine)。如圖表所示，該引擎建立在一個彈性部署平台(Elastic Deployment Platform)之上，我們稱之為人工智能彈性算法服務平台(Platform for AI Elastic Algorithm Service)。在底層，該服務連接推理引擎與雲端的GPU資源(GPU Resources)。我們針對輝達(Nvidia)進行了大量優化。 At Alibaba Cloud, we’ve developed an inference engine called Blade LLM, a high-performance LLM inference engine. As shown in the diagram, this engine is built on an elastic deployment platform, which we call the Platform for AI Elastic Algorithm Service. At the base layer, this service links the inference engine to our cloud’s GPU resources. We’ve conducted extensive optimizations for Nvidia. 我們將優化分為三個層次。首先，通過人工智能編譯器(AI Compiler)進行模型優化(Model Optimization)，充分利用輝達GPU支援的低階指令(Low-level Instructions)。其次，生成引擎(Generation Engine)層提供同步運行時(Synchronized Runtime)、批次調度(Batch Scheduling)和提示緩存(Prompt Cache)等機制，以提升生成速度(Generation Speed)。此外，我們還提供了服務架構(Service Architecture)，包括網頁伺服器(Web Server)、分散式請求調度(Distributed Request Scheduling)和預填充-解碼聚合(Prefill-Decode Aggregation)等功能。 We’ve categorized our optimizations into three layers. First, model optimization via an AI compiler leverages low-level instructions supported by Nvidia GPUs. Second, the generation engine layer offers mechanisms like synchronized runtime, batch scheduling, and prompt caching to boost generation speed. Additionally, our service architecture provides capabilities such as web servers, distributed request scheduling, and prefill-decode aggregation. 在推理引擎之上，我們支援多種應用(Applications)。今天，我將介紹三項技術：加速注意力計算(Attention Computation)、加速前饋網路計算(Feedforward Network Computation)，以及我們開發的動態分塊流水線並行(Dynamic Chunked Pipeline Parallelism)，旨在提升Moe模型在長上下文(Long Context)上的推理效率(Inference Efficiency)。首先，我們來探討稀疏注意力(Sparse Attention)。 Above our inference engine, we support various applications. Today, I’ll present three technologies: accelerating attention computation, accelerating feedforward network computation, and our developed dynamic chunked pipeline parallelism, designed to enhance inference efficiency for Moe models on long contexts. Let’s start with sparse attention. 隨著上下文窗口(Context Window)增長，注意力計算(Attention Computation)在預填充階段(Prefill Phase)和解碼階段(Decode Phase)耗時大幅增加。因此，稀疏注意力(Sparse Attention)被引入這一領域。這是一個推理(Inference)的例子，由微軟(Microsoft)提出的稀疏注意力方法。其核心理念是無需對矩陣中每個元素進行注意力計算，而是選擇部分元素進行計算，從而節省大量計算資源(Computation Resources)。 As context windows grow, attention computation consumes significantly more time in both the prefill and decode phases. Thus, sparse attention has been introduced to this field. Here’s an inference example—a sparse attention method proposed by Microsoft. The core idea is that we don’t need to compute attention for every matrix element, but instead select specific elements, saving substantial computation resources. 然而，這帶來了一些挑戰。首先是準確性(Accuracy)。若選擇了錯誤的元素進行稀疏計算(Sparse Computation)，可能損失準確性。微軟的方法採用垂直斜線規格算法(Vertical Slash Specification Algorithm)來挑選計算元素。除此之外，我們發現資源利用率(Utilization Rate)不高，特別是在H20等GPU上，測試顯示核心使用率(Kernel Usage)低於40%，表示核心未被充分利用。 However, this poses challenges. The first is accuracy—if incorrect elements are chosen for sparse computation, accuracy may suffer. Microsoft’s approach uses a vertical slash specification algorithm to select elements for computation. Beyond the algorithm, we found that resource utilization rates remain low, especially on GPUs like the H20, where tests showed kernel usage below 40%, indicating underutilized cores. 為此，我們進行了優化以提升通道利用率(Channel Utilization)，開發出優化的稀疏注意力(Optimized Sparse Attention)。我們深入優化了從全局記憶體(Global Memory)載入稀疏鍵值對(Sparse Key-Value Pairs)的過程。結果顯示，在100萬個上下文長度(Context Length)下，相較於閃存注意力(Flash Attention)的13.7倍加速，我們的Blade LLM在相同GPU配置下實現近30倍加速。 To address this, we optimized channel utilization and developed an enhanced sparse attention approach. We refined the loading of sparse key-value pairs from global memory. Results show that, for a context length of 1 million, compared to Flash Attention’s 13.7x speedup, our Blade LLM achieves nearly 30x speedup under the same GPU configuration. 我們已將此程式碼開源(Open Source)至VRM專案，稍後將分享幻燈片和連結，歡迎至開源社群(Open Source Community)查看。此外，我們嘗試以FP4優化鍵值緩存(KV Cache)。近期許多訓練與推理採用FP8，而我們探索FP4以進一步提升效率。 We’ve open-sourced this code to the VRM project and will share slides and links later—feel free to check it out in the open-source community. Additionally, we’re experimenting with FP4 to optimize the KV cache. While recent training and inference often use FP8, we’re exploring FP4 to further boost efficiency. 我們利用LD Matrix重疊數據載入(Data Loading)，搭配高階DP4A指令進行時間轉換(Time Conversion)，再由張量核心(Tensor Core)執行計算(Computation)。使用FP4鍵值緩存(KV Cache)可節省高達75%的記憶體(Memory)，並在解碼注意力(Decoding Attention)實現2.5倍加速，且不損失準確性(Accuracy)。這是第一部分。接下來，我將介紹前饋網路計算(Feedforward Network Computation)。 We overlap data loading with LD Matrix, use high-level DP4A instructions for time conversion, and perform computation with Tensor Core. Employing an FP4 KV cache saves up to 75% of memory and achieves a 2.5x speedup in decoding attention without accuracy loss. That’s the first part. Next, I’ll cover feedforward network computation. 對於Moe模型(Moe Models)，其包含大量FFN計算(Feedforward Network Computation)。在端到端模型推理(Model Inference)中，FFN計算佔據重要比重。提升FFN效率(FFN Efficiency)有助於改善整體推理效率(Inference Efficiency)，特別是在長上下文Moe模型中。解碼階段(Decoding Phase)受記憶體限制(Memory Bound)，Moe模型每一步解碼(Decoding Step)涉及更多FFN權重(FFN Weights)。我們觀察到，現有Moe核心如Fusion Moe的頻寬利用率(Bandwidth Utilization)偏低。 For Moe models, there’s significant feedforward network computation. In end-to-end model inference, FFN computation plays a major role. Improving FFN efficiency enhances overall inference efficiency, especially for long-context Moe models. The decoding phase is memory-bound, with Moe models involving more FFN weights per decoding step. We’ve observed that existing Moe kernels, like Fusion Moe, exhibit low bandwidth utilization. 為什麼？深入探究GPU運算流程可知，GPU通常執行主迴圈計算(Main Loop Computation)，從全局記憶體(Global Memory)讀取數據，傳至張量核心(Tensor Core)計算，隨後進入收尾階段(Epilogue)寫回結果。圖表顯示，數據載入張量核心後，頻寬(Bandwidth)在該時段未被充分利用。我們因此致力於充分利用全局記憶體至張量核心的頻寬，提升效率。 Why? Examining GPU computation reveals that it typically performs main loop computation, reading data from global memory, sending it to Tensor Core for processing, and then entering the epilogue to write back results. The chart shows that after data is loaded into Tensor Core, bandwidth remains underutilized during that period. We thus aim to fully leverage bandwidth from global memory to Tensor Core to improve efficiency. 我們採用高度並行的持久化乒乓(Persistent Ping-Pong)設計，使用三個工作組(Work Groups)：工作組0負責數據載入(Data Loading)，工作組1執行主迴圈計算(Main Loop Computation)後進入收尾階段，工作組2則在工作組1收尾時開始主迴圈計算。然而，這並非完美方案，因執行緒(Warps)、調度器(Warp Scheduler)和寄存器使用(Register Usage)的額外負擔影響性能。 We employ a highly parallel persistent ping-pong design with three work groups: Work Group 0 handles data loading, Work Group 1 performs main loop computation and then the epilogue, while Work Group 2 starts main loop computation as Work Group 1 enters the epilogue. However, this isn’t a perfect solution, as overhead from warps, warp schedulers, and register usage impacts performance. 此方案的另一缺點是，工作組1和2共享主迴圈計算，但每個執行緒(Thread)的寄存器數量(Register Number)僅255個，限制了分塊大小(Tiles)，影響吞吐量(Throughput)。GPU利用率(Utilization)低於40%。我們遂引入張量記憶體加速器(Tensor Memory Accelerator, TMA)，由輝達提供，減少載入與儲存所需的執行緒數量。 Another drawback is that Work Groups 1 and 2 share main loop computation, but with only 255 registers per thread, tile size is constrained, affecting throughput. GPU utilization falls below 40%. We thus introduced the Tensor Memory Accelerator (TMA), provided by Nvidia, reducing the number of threads needed for loading and storage. 在此情況下，我們僅使用兩個工作組。工作組0中，執行緒組0和1(Warp 0 and 1)負責載入，執行緒組2和3(Warp 2 and 3)處理載入與寫入，每組含四個執行緒組。我們仍使用執行緒組1執行主迴圈計算。 In this case, we use only two work groups. In Work Group 0, Warp 0 and 1 handle loading, while Warp 2 and 3 manage loading and writing, with each group containing four warps. We continue using Warp 1 for main loop computation. 此架構減少執行緒組(Warps)數量。雖仍有八個執行緒組—執行緒組0(Warp Group 0)和1(Warp Group 1)各四個—存在一定額外負擔(Overhead)，但乒乓機制(Ping-Pong)帶來益處，使我們能使用兩個共享記憶體緩存(Shared Memory Cache)。 This architecture reduces the number of warps. Though there are still eight—four in Warp Group 0 and four in Warp Group 1—with some overhead, the ping-pong mechanism benefits us, enabling two shared memory caches. 此記憶體緩存可在工作組0(Work Group 0)和1(Work Group 1)間共享，實現載入(Loading)、寫入(Writing)和計算工作組(Computation Work Group)間的數據共享(Data Sharing)。這大幅提升計算效率(Computation Efficiency)，並支援更大分塊(Tiles)，因寄存器可被單一工作組充分利用。即使考慮加入工作組2(Work Group 2)進行主迴圈計算，我們仍可利用乒乓機制，工作組1和2皆用於計算，且不犧牲機制即可處理相同分塊。 This memory cache is shared between Work Group 0 and Work Group 1, enabling data sharing among loading, writing, and computation work groups. This significantly boosts computation efficiency and supports larger tiles, as registers are fully utilized by a single work group. Even if we consider adding Work Group 2 for main loop computation, we can still use ping-pong, with both Work Groups 1 and 2 handling computation, processing the same tiles without sacrificing the mechanism. 利用這些技術，我們的RMFFP8 Moe核心(Moe Kernel)在H20上實現80%至90%的峰值記憶體頻寬(Peak Memory Bandwidth)。先前頻寬利用率(Bandwidth Utilization)低於40%，如今比主流Moe核心快1.6倍。我們以DeepSeek V3和R1進行測試，驗證了這一領域的成果。 Leveraging these technologies, our RMFFP8 Moe kernel achieves 80% to 90% peak memory bandwidth on H20. Previously, bandwidth utilization was below 40%, but now it’s 1.6x faster than mainstream Moe kernels. We validated this with tests using DeepSeek V3 and R1. 另一項有趣工作是動態分塊流水線並行(Dynamic Chunked Pipeline Parallelism)，這是廣泛採用的算法，許多團隊使用流水線(Pipeline)和分塊並行(Chunked Parallelism)。 Another intriguing effort is dynamic chunked pipeline parallelism, a widely adopted algorithm used by many teams for pipeline and chunked parallelism. 通過並行性(Parallelism)，我們提升模型預填充效率(Prefill Efficiency)。若上下文(Context)過長，需切分成不同分塊(Chunks)以並行計算(Computation)。但分塊並行(Chunked Parallelism)在超長上下文時仍存挑戰。例如，將分塊設為2048，第一階段(Stage 1)完成分塊0(Chunk 0)後，第二階段(Stage 2)開始計算分塊0。第一階段的分塊1(Chunk 1)和分塊0長度相似，但鍵值緩存(KV Cache)略有不同，導致分塊1計算時間略長，產生小氣泡(Bubble)，第二階段需等待分塊1完成。 Through parallelism, we enhance model prefill efficiency. For very long contexts, they must be split into chunks for parallel computation. However, chunked parallelism poses challenges with ultra-long contexts. For instance, with chunks set at 2048, after Stage 1 completes Chunk 0, Stage 2 begins computing Chunk 0. Stage 1’s Chunk 1 and Chunk 0 are similar in length, but their KV caches differ slightly, making Chunk 1’s computation slightly longer, creating a small bubble, with Stage 2 waiting for Chunk 1 to finish. 第一階段後，分塊0(Chunk 0)和分塊1(Chunk 1)的計算在第二階段(Stage 2)競爭。謝謝。 After Stage 1, computation for Chunk 0 and Chunk 1 competes in Stage 2. Thank you. 分塊0(Chunk 0)和分塊1(Chunk 1)需等待前階段(Previous Stage)完成。若上下文長度(Context Length)持續增長，分塊並行流水線(Chunked Pipeline Parallelism)會出現較大氣泡，浪費GPU時間(GPU Time)。我們因此採用動態分塊流水線並行(Dynamic Chunked Pipeline Parallelism)。分塊0設為2048，但分塊大小(Dynamic Chunk Size)可針對分塊0、1、2動態調整。 Chunk 0 and Chunk 1 must wait for the previous stage to complete. As context length grows, chunked pipeline parallelism generates larger bubbles, wasting GPU time. Thus, we adopted dynamic chunked pipeline parallelism. Chunk 0 is set at 2048, but the chunk size is dynamically adjusted for Chunks 0, 1, and 2. 計算分塊0(Chunk 0)後，接著計算分塊1(Chunk 1)。 After computing Chunk 0, proceed to compute Chunk 1. 這稍微減少計算負擔。例如，我們可能使用2000而非2048。分塊1(Chunk 1)在第一階段(Stage 1)的分塊0(Chunk 0)計算時，所有階段間計算(Computation)保持對齊(Aligned)，節省時間。真實數據中仍有小氣泡(Small Bubbles)，但影響不大。測試結果以張量並行(Tensor Parallel)為基準，直接並行模式(Parallel Pattern)因上下文窗口過大耗時89秒（64,000個提示，Prompts）。分塊流水線並行(Chunked Pipeline Parallelism)因氣泡時間(Bubble Time)3.4秒未節省時間。但動態分塊流水線並行將氣泡時間降至1秒以下，加速32%。我們測試128,000令牌(Token)，針對Qwen 2模型(Qwen 2 Model)的預填充延遲(Prefill Latency)，已在生產環境(Production Environment)應用此技術。歡迎在Qwen聊天(Qwen Chat)或我們全球可用新網站試用最新Qwen模型。 This slightly reduces the load. For example, we might use 2000 instead of 2048. As Chunk 1 computes alongside Stage 1’s Chunk 0, computation aligns across stages, saving time. Real data still shows small bubbles, but they’re minor. Test results, benchmarked against tensor parallel, show direct parallel patterns taking 89 seconds for 64,000 prompts due to a massive context window. Chunked pipeline parallelism, with a 3.4-second bubble time, doesn’t save time. But dynamic chunked pipeline parallelism reduces bubble time below 1 second, achieving a 32% speedup. We tested 128,000 tokens, evaluating prefill latency for the Qwen 2 model, and have deployed this in production. Try it in Qwen Chat or on our globally available new website with the latest Qwen model. 引擎優化(Engine Optimization)僅提升單次推理(One-time Inference)性能1倍。如前所述，引擎底層有PIEEAS彈性算法服務(PIEEAS Elastic Algorithm Service)，即部署平台(Deployment Platform)。 Engine optimization only boosts one-time inference performance by 1x. As mentioned, at the engine’s base, we have the PIEEAS Elastic Algorithm Service, a deployment platform. 目前，業界熱議預填充與解碼分離(Prefill and Decode Disaggregation)。我們的產品已內建支援，包含大規模語言模型調度器(LLM Scheduler)。此內建調度器(Built-in Scheduler)根據配置(Configuration)實現預填充(Prefill)和解碼(Decode)分離，或多任務分配(Multi-tenant Allocation)。使用者可自訂預填充與解碼分離(PD Disaggregation)配置。右邊展示如何啟用此功能：直接開啟，提供使用者介面(User Interface)和API，讓客戶決定預填充節點(Prefill Nodes)和解碼節點(Decode Nodes)數量。這極具靈活性。不懂分離者可用預設配置(Default Configuration)，熟悉流量(Traffic)者可自訂，因使用場景(User Scenario)改變需調整配置。例如，開發工具(Development Tool)需大量原始碼(Source Code)輸入，生成少量結果，預填充耗時長。 Currently, there’s much discussion about prefill and decode disaggregation. Our products natively support this, featuring an LLM scheduler. This built-in scheduler handles prefill and decode disaggregation or multi-tenant allocation based on configuration. Users can customize PD disaggregation settings. The right side shows how to enable it: activate directly with a user interface and API, allowing customers to specify prefill and decode node counts. This offers great flexibility. Those unfamiliar with disaggregation can use the default configuration, while traffic-savvy users can tailor it, as changing user scenarios require configuration adjustments. For example, a development tool needing extensive source code input and minimal output takes significant prefill time. 在此場景下，你可能需為預填充(Prefill)配置更多實例(Instances)。反之，深入研究(Deep Research)需大量搜索(Search)並生成長報告(Report)，或處理數學問題(Mathematics)，推理模型(Reasoning Model)反覆運算，需更多解碼實例(Decode Instances)。我們提供工具(Tools)整合功能，並有指引(Tours)助客戶自訂配置。 In this scenario, you might need more prefill instances. Conversely, deep research requiring extensive searches and long reports, or mathematical tasks where reasoning models repeatedly compute, demands more decode instances. We provide tools for integration and tours to help customers customize configurations. 這很簡單自由。我們以Qwen 2.5—Qwen團隊開源(Open Source)的最新Moe模型(Moe Model)—測試，安全延遲(Security Latency)低於66毫秒(Milliseconds)。最佳預填充與解碼比率(Prefill-to-Decode Ratio)設為1:1和1:2，並發性(Concurrency)和每秒交易數(TPS)增長超90%。歡迎至阿里雲(Alibaba Cloud)試用，設定你的預填充與解碼分離服務(PD Disaggregation Service)。 It’s simple and flexible. We tested with Qwen 2.5—the latest Moe model open-sourced by the Qwen team—with security latency below 66 milliseconds. Setting the optimal prefill-to-decode ratio at 1:1 and 1:2, concurrency and TPS increased over 90%. Visit Alibaba Cloud to try it and set up your PD disaggregation service. 總結來說，對於Blade LLM，頂層與輝達(Nvidia)合作，增強注意力計算(Attention Calculation)、前饋計算(Feedforward Computation)、鍵值緩存載入(KV Cache Load)和轉換方法(Conversion Methods)，並開發MoeFDA核心(MoeFDA Kernels)提升記憶體頻寬利用率(Memory Bandwidth Utilization)。 In summary, for Blade LLM, the top layer collaborates with Nvidia to enhance attention calculation, feedforward computation, KV cache loading, and conversion methods, developing MoeFDA kernels to improve memory bandwidth utilization. 我們採用動態分塊流水線並行(Dynamic Chunked Pipeline Parallelism)提升長上下文預填充效率(Long Context Prefill Efficiency)。底層平台提供預填充與解碼分離功能(PD Disaggregation Capabilities)。測試數據(Test Numbers)和設定(Test Settings)僅適用我們的場景(Scenarios)。你需根據業務場景找出最佳配置(Best Configurations)，利用我們的平台自行設定，或通過API和自訂算法控制。 We use dynamic chunked pipeline parallelism to boost long-context prefill efficiency. The bottom-layer platform offers PD disaggregation capabilities. Test numbers and settings suit our scenarios only. You must determine your best configurations based on your business scenarios, using our platform to set them directly or control them via API and custom algorithms. 這是我今天的分享。謝謝大家聆聽，這是今天的全部內容。 That’s my presentation. Thank you for listening—that’s all for today. 他目前不接受提問，但歡迎會後(Offline)交流。請享受GTC餘下時光。謝謝。 He won’t take questions now, but feel free to discuss offline. Enjoy the rest of GTC. Thank you.