Hung-yi Lee
NTU
生成式導論 2024
Learn More →
一個為普羅大眾而開的課程,不是教你怎麼用ChatGPT,想知道怎麼用去找網紅就好了。課程主要是讓我們知道生成式AI背後的概念跟原理。
這個課程可以是人生中第一個人工智慧的課程。
Learn More →
課程跟機器學習課程不會有重疊,所以不用擔心。
Learn More →
作業會帶學生訓練一個7B,也就是70億個學習參數的模型,這在現在來看沒什麼,但是對照5年前發佈的模型就顯的非常驚人。
當初的GPT2也就只有15億..
Learn More →
作業清單,有興趣的就可以自己找類似的題目來練習。
Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.
Jul 4, 2025神經網路相關論文翻譯
Jul 1, 2025Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pretraining DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
Apr 7, 2025W
Mar 4, 2025or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up