# The Transformer Architecture: An Analytical Report on its Evolution and Application in Large Language Models # Transformer 架構:其演化與大型語言模型應用之分析報告 --- https://www.youtube.com/watch?v=Q86qzJ1K1Ss Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 9 - Recap & Current Trends ## 1.0 Introduction ## 1.0 緒論 The Transformer architecture has established itself as the foundational technology of modern Artificial Intelligence, serving as the engine behind the remarkable capabilities of today's Large Language Models (LLMs). Originally conceived as an innovative solution for machine translation, its core principles have proven so versatile that they now underpin a vast array of generative AI applications, from text generation to image understanding. Understanding its evolution is therefore of strategic importance for professionals seeking to grasp the forces shaping the current technological landscape. The objective of this report is to provide a systematic analysis of the Transformer architecture. It aims to dissect the core principles that enabled its success, trace the key architectural improvements that have refined its performance and efficiency, and examine the different model variants that have emerged to address specific tasks. The analysis begins with an exploration of the foundational self-attention mechanism, contrasting it with its predecessors. It then details the critical optimizations that have scaled the architecture for modern hardware and larger models. Subsequently, the report presents a taxonomy of the primary Transformer model types, leading into a discussion of how these architectures are scaled into LLMs through a multi-stage training paradigm. Finally, the report looks toward the next frontier, covering emerging cross-modal applications, novel generation paradigms such as diffusion, and the core challenges that continue to drive research and innovation. --- 變壓器(Transformer)架構已成為現代人工智慧的關鍵技術基礎,是當今大型語言模型(Large Language Models, LLMs)卓越能力背後的核心引擎。此架構最初被提出作為機器翻譯的創新解決方案,但其核心原理具有高度通用性,現已支撐從文字生成到影像理解等多種類型的生成式 AI 應用。對於希望掌握當前技術發展脈動的專業人士而言,理解 Transformer 的演化歷程具有重要的策略意義。 本報告的目標是對 Transformer 架構進行系統性的分析。內容包括:拆解其成功背後的核心原理、追蹤關鍵架構改良如何提升效能與效率,以及檢視為因應不同任務需求而發展出的各種模型變體。 本分析首先從自注意力(self-attention)機制切入,說明其與前一代架構的差異。接著說明為因應現代硬體與大型模型訓練規模所提出的關鍵最佳化措施。之後,本報告將建立主要 Transformer 模型類型的分類架構,並進一步說明這些架構如何透過多階段訓練典範被擴展為大型語言模型。最後,報告將探討下一階段的發展方向,包括跨模態應用、新型生成典範(例如 diffusion 模型),以及持續驅動研究與創新的核心挑戰。 --- ## 2.0 Foundational Principles: The Self-Attention Mechanism ## 2.0 基礎原理:自注意力機制 To fully appreciate the innovation of the Transformer, it is essential to first understand the limitations of the architectures that preceded it. The development of the self-attention mechanism was not merely an incremental improvement, but a fundamental shift that directly addressed the core challenges of representing and processing sequential data. This mechanism became the central idea that unlocked the Transformer's power. The primary limitations of preceding architectures included: - **Word2Vec**: While it represented a significant step forward in learning vector representations of words, Word2Vec produced *static* embeddings. These representations were not context-aware, meaning a word would have the same vector regardless of its usage in different sentences, and therefore failed to capture nuanced meaning. - **Recurrent Neural Networks (RNNs)**: RNNs process sequences token-by-token while maintaining an internal state. However, this recurrent structure struggles with the problem of long-range dependencies, where gradients either vanish or explode during backpropagation. As a result, it is difficult for RNNs to capture relationships between distant tokens. The self-attention mechanism solved these problems by allowing every token in a sequence to directly attend to every other token, regardless of their position. This is achieved by projecting each token's embedding into three distinct vectors: a **Query (Q)**, a **Key (K)**, and a **Value (V)**. The model calculates the similarity between one token's Query vector and all other tokens' Key vectors. These similarity scores are then used as weights to compute a weighted average of all the Value vectors in the sequence, producing a new, context-rich representation for that token. This operation can be written compactly in matrix form, which is highly optimized for modern hardware such as GPUs: \[ \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \] The division by \(\sqrt{d_k}\) is a critical scaling factor that mitigates the tendency for the dot products to grow too large in magnitude, ensuring that the softmax function operates in a region with stable gradients. The result is a highly parallelizable computation that forms the heart of the Transformer. The original Transformer architecture, introduced in 2017, comprised two main components: an **Encoder** and a **Decoder**. The Encoder processes the input sequence, and the Decoder generates the output sequence, with attention mechanisms in both modules. This design proved exceptionally successful in its initial application to machine translation, and it set the stage for subsequent developments. Nonetheless, this foundational design was只是 the starting point for a series of critical improvements that would further enhance its capabilities. --- 若要充分理解 Transformer 的創新之處,必須先認識其問世之前主流架構的限制。自注意力機制的提出並非微幅調整,而是一種根本性的架構轉換,直接回應了「如何表示與處理序列資料」的核心問題。這個機制成為釋放 Transformer 威力的關鍵概念。 前一代架構的主要限制包括: - **Word2Vec**:Word2Vec 在詞向量學習上是重要的里程碑,但其產生的是「靜態」詞向量。這些向量不具情境敏感性,同一個單字在不同句子中會被賦予相同向量,因此難以表達語意上的細微差異。 - **循環神經網路(Recurrent Neural Networks, RNNs)**:RNN 以逐詞(token-by-token)方式處理序列,並維持一個內部狀態。然而,此種遞迴結構在處理長距依賴(long-range dependencies)時常面臨梯度消失或梯度爆炸問題,使模型難以有效捕捉彼此距離較遠的詞元關係。 自注意力機制透過讓序列中「每一個詞元都能直接關注(attend)序列中的所有其他詞元」,不受位置距離限制,來回應上述問題。具體作法是將每個詞元的嵌入向量分別映射為三種向量:**查詢向量(Query, Q)**、**鍵向量(Key, K)**與**值向量(Value, V)**。模型計算該詞元 Query 向量與所有詞元 Key 向量之間的相似度,並將這些相似度作為權重,對所有 Value 向量進行加權平均,以產生該詞元新的、兼具上下文資訊的表示。 此運算可用矩陣形式簡潔表示,並且非常適合在 GPU 等現代硬體上進行最佳化: \[ \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \] 其中除以 \(\sqrt{d_k}\) 的縮放因子用來抑制內積值幅度過大,確保 softmax 運算落在梯度穩定的區域。這樣的設計使整個計算高度平行化,成為 Transformer 架構的核心。 2017 年提出的原始 Transformer 架構由兩大組成要素構成:**編碼器(Encoder)**與**解碼器(Decoder)**。編碼器負責處理輸入序列,解碼器則負責產生輸出序列,兩者皆大量使用注意力機制。此設計在機器翻譯任務上展現出卓越成效,為後續發展奠定基礎。然而,這個基礎設計只是起點,隨後一系列關鍵改良大幅提升了其能力與適用範圍。 --- ## 3.0 Architectural Evolution and Key Optimizations ## 3.0 架構演化與關鍵最佳化 Since its introduction in 2017, the original Transformer architecture has undergone substantial refinement. Researchers and engineers have introduced a series of modifications aimed at improving performance, computational efficiency, and scalability, making it feasible to train the massive models we observe today. This section examines the most impactful of these architectural optimizations. --- 自 2017 年原始 Transformer 架構問世以來,相關設計已歷經多次重要修正與優化。研究與工程社群陸續提出多種改良,目的在於提升模型效能、計算效率與可擴展性,使當前規模龐大的模型訓練成為可能。本節將說明其中幾項影響最為顯著的架構最佳化方向。 --- ### 3.1 From Absolute to Relative: Positional Encoding ### 3.1 由絕對到相對:位置編碼的演進 In the original Transformer, token positions were encoded in an absolute manner: each position was assigned a unique embedding that was added to the token embedding. Subsequent research, however, showed that what matters more in self-attention is not a token's absolute position, but its *relative* position to other tokens. This insight led to the development of **Rotary Position Embeddings (RoPE)**, a widely adopted method that encodes relative positional information directly within the self-attention computation. RoPE operates by rotating the Query and Key vectors by an angle determined by their position. When their dot product is later computed, the resulting attention score becomes a function of the *relative distance* between tokens, making the attention mechanism more robust and more aligned with how context is structured in sequences. --- 在原始 Transformer 中,詞元的位置是以「絕對位置編碼」的方式處理:每個位置都有對應的向量,與詞元嵌入向量相加。然而,後續研究指出,在自注意力機制中,真正關鍵的往往不是「絕對位置」,而是詞元彼此之間的「相對位置」。 基於這個觀察,發展出現今廣泛使用的 **旋轉位置編碼(Rotary Position Embeddings, RoPE)**。RoPE 將相對位置信息直接融入自注意力計算之中,其作法是依據詞元所在位置,對 Query 與 Key 向量施加對應角度的旋轉。之後在計算內積時,所得的注意力分數自然就反映出詞元間的相對距離,讓注意力機制在建模序列結構時更加穩健且具直覺性。 --- ### 3.2 Enhancing Attention Efficiency ### 3.2 提升注意力機制的效率 The multi-head attention layer, a central component of the Transformer, has also been a major target for optimization. In the original design, each attention *head* possessed its own projection matrices for Queries, Keys, and Values. To improve efficiency, especially in large models, techniques were developed to share some of these matrices. A key optimization in this direction is **Grouped Query Attention (GQA)**. Instead of learning a unique Key and Value projection matrix for each attention head, GQA groups multiple heads so that they share a single Key and Value projection. This significantly reduces both the parameter count and memory footprint, while preserving most of the performance, thus enabling faster and more resource-efficient inference. --- 多頭注意力(multi-head attention)層是 Transformer 的核心組件之一,也因此成為重要的最佳化對象。原始設計中,每一個注意力頭(head)都擁有各自獨立的 Query、Key 與 Value 投影矩陣。為了提升效率,特別是在大型模型中,後續提出了多種參數共享技術。 其中一項關鍵技術是 **Grouped Query Attention(GQA)**。GQA 的作法是:不再為每一個注意力頭分別學習獨立的 Key 與 Value 投影矩陣,而是將多個注意力頭分組,共享同一組 Key 與 Value 投影。此舉可大幅減少參數量與記憶體使用,且在維持大部分效能的情況下,明顯加速推論並提升資源利用效率。 --- ### 3.3 Refining Normalization Layers ### 3.3 正規化層的調整與精煉 Layer normalization is crucial for stabilizing the training of deep neural networks. In Transformer blocks, both the placement and the specific form of normalization have evolved. The original architecture used a **Post-norm** configuration, where normalization is applied after adding the residual connection for each sub-layer. Modern architectures, however, have largely switched to a **Pre-norm** configuration, in which normalization is applied *before* each sub-layer. This change has been shown to improve training stability and to support much deeper Transformer stacks by mitigating gradient-related issues that occur with Post-norm designs. Furthermore, more parameter-efficient normalization methods have been proposed. **RMSNorm** is a notable alternative to standard LayerNorm. By simplifying the normalization computation and focusing on the root-mean-square of activations, RMSNorm reduces the number of learnable parameters while contributing to overall training and inference efficiency. --- 正規化(normalization)對於穩定深度神經網路的訓練至關重要。在 Transformer 區塊中,正規化層的「放置位置」以及「具體形式」都歷經演進。原始架構採用的是 **Post-norm** 配置,也就是在子層(sub-layer)計算完成並加入殘差連結(residual connection)之後,再進行層正規化。許多現代架構則改用 **Pre-norm** 配置,即在進入子層計算之前先進行正規化。此改變被證實有助於提升訓練穩定性,並使得構建更深層的 Transformer 成為可能,因為可減輕 Post-norm 設計中常見的梯度問題。 此外,研究者也提出了更具參數效率的正規化方法,其中 **RMSNorm** 是重要代表之一。RMSNorm 聚焦於輸出值的均方根(root mean square),以較為精簡的計算流程取代部分 LayerNorm 的操作,降低可學參數數量,同時維持甚至改善訓練與推論效率。 --- ### 3.4 Optimizing for Hardware: The Flash Attention Mechanism ### 3.4 面向硬體的最佳化:Flash Attention 機制 As models grew larger, the self-attention computation became a major bottleneck—not primarily due to floating-point operations (FLOPs), but because of memory input/output (I/O). **Flash Attention** is a landmark algorithm designed to optimize attention computation by minimizing the number of read/write operations to the GPU's large but relatively slow high-bandwidth memory (HBM). Its core principles are: 1. **Memory Hierarchy Utilization**: Flash Attention leverages the two-level memory hierarchy of GPUs: a small but extremely fast on-chip SRAM and a large but slower HBM. It partitions the large attention computation into blocks that fit entirely within SRAM. Each block is computed end-to-end in SRAM, markedly reducing accesses to HBM. 2. **Recomputation Strategy**: To further conserve memory, Flash Attention avoids storing certain large intermediate results (such as the full attention matrix) that would ordinarily be written to HBM. Instead, these values are recomputed on demand during the backward pass in training. Although this increases the number of arithmetic operations, the overall speed improves because it avoids the latency associated with frequent HBM access. 3. **Exactness**: Importantly, Flash Attention is an *exact* algorithm rather than an approximation. It produces outputs that are numerically identical to those from a standard attention implementation, while delivering significant runtime speedups. These architectural refinements, taken together, paved the way for the diversification of the Transformer framework into distinct model families, each tailored to different objectives and application domains. --- 隨著模型規模持續擴大,自注意力計算逐漸成為效能瓶頸,其主要限制不在於運算次數(FLOPs),而是記憶體 I/O。**Flash Attention** 是為了解決這個問題所提出的重要演算法,其核心目標是減少對 GPU 高頻寬記憶體(High-Bandwidth Memory, HBM)的讀寫次數,進而加速注意力計算。 其主要設計原則包括: 1. **善用記憶體階層結構**:Flash Attention 利用 GPU 內部的兩層記憶體結構:容量較小但速度極快的片上 SRAM,以及容量較大但速度較慢的 HBM。演算法將原本龐大的注意力矩陣運算切分為多個能完全放入 SRAM 的小區塊,並在 SRAM 中完成該區塊從前向到後向的計算,大幅降低對 HBM 的存取需求。 2. **重新計算策略(Recomputation)**:為了進一步節省記憶體用量,Flash Attention 不會將某些大型中間結果(例如完整的注意力矩陣)寫入 HBM,而是在反向傳播時即時重新計算需要的值。雖然這增加了計算量,但整體速度反而提升,因為減少了高延遲的記憶體 I/O。 3. **精確性**:重要的是,Flash Attention 並非近似演算法,而是「數值上等價」的實作。也就是說,其輸出與標準注意力計算在數值上相同,同時大幅縮短執行時間。 這些架構層面的精進,使 Transformer 架構得以擴展成多種不同的模型家族,分別對應不同的任務需求與應用場景。 --- ## 4.0 A Taxonomy of Transformer-Based Models ## 4.0 基於 Transformer 的模型分類 The modular design of the Transformer—built from stackable Encoder and Decoder blocks—has led to the emergence of several distinct model variants. Each archetype utilizes different portions of the original architecture and is particularly well-suited for specific categories of tasks. This section provides a comparative overview of the dominant variants. --- Transformer 以可重複堆疊的編碼器與解碼器區塊為基礎,具有高度模組化的特性。這種設計促成多種不同模型變體的發展,各自取用原始架構中的不同部件,並針對特定任務類型展現優勢。本節對主要模型變體進行比較性說明。 --- ### 4.1 Model Variants Overview ### 4.1 模型變體總覽 | Model Variant | Core Characteristics | Primary Use Cases | |------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------| | **Encoder-only** (e.g., BERT) | - Processes the entire input sequence simultaneously with bidirectional context. <br> - Produces rich, context-aware embeddings. <br> - Often uses a special `[CLS]` token for classification. <br> - Not designed for autoregressive text generation. | Text classification, sentiment analysis, named entity recognition, and other natural language understanding tasks. | | **Decoder-only** (e.g., GPT) | - Generates text autoregressively in a left-to-right fashion. <br> - Follows a “text in, text out” paradigm. <br> - Serves as the foundation for most modern LLMs. | Open-ended text generation, chatbots, summarization, and question answering. | | **Encoder–Decoder** (e.g., T5) | - Maps an entire input sequence to an output sequence. <br> - The encoder constructs a representation of the input. <br> - The decoder generates the output autoregressively conditioned on the encoder output. | Machine translation, text summarization, and other sequence-to-sequence transformation tasks. | The remarkable success and flexibility of the Decoder-only architecture in generating coherent, contextually relevant text led to its adoption as the de facto standard for constructing the next generation of AI systems: Large Language Models. --- | 模型類型 | 核心特徵 | 主要應用情境 | |------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------| | **僅編碼器(Encoder-only)**(如 BERT) | - 以雙向方式同時處理整段輸入序列。<br> - 產生具高度情境資訊的向量表示。<br> - 常透過特殊的 `[CLS]` 標記作為分類任務的整體表示。<br> - 不適合用於自回歸式文字生成。 | 文字分類、情緒分析、命名實體辨識等各類自然語言理解任務。 | | **僅解碼器(Decoder-only)**(如 GPT) | - 以由左至右的自回歸方式逐 token 生成文字。<br> - 採「文字輸入、文字輸出」的互動模式。<br> - 是多數現代大型語言模型的核心架構。 | 開放式文字生成、對話機器人、摘要生成、問答等。 | | **編碼器–解碼器(Encoder–Decoder)**(如 T5) | - 將整段輸入序列轉換為整段輸出序列。<br> - 編碼器先為輸入構建表示。<br> - 解碼器在此表示的條件下,以自回歸方式產生輸出。 | 機器翻譯、文字摘要,以及各類輸入需轉換為新輸出的序列轉換任務。 | 僅解碼器架構在產生連貫且符合情境的文字方面展現出極高彈性與效能,因此成為建構下一代 AI 系統——大型語言模型——的事實標準。 --- ## 5.0 Scaling Transformers into Large Language Models (LLMs) ## 5.0 將 Transformer 擴展為大型語言模型 Large Language Models are the result of scaling Transformer architectures—主要是僅解碼器變體—to unprecedented sizes. This strategic shift was motivated by the empirical observation that larger models, trained on more data with greater computational resources, exhibit qualitatively better performance. Realizing this potential, however, required new training paradigms and efficiency techniques to manage the vast computational demands. --- 大型語言模型的誕生,來自於將 Transformer 架構——特別是僅解碼器(Decoder-only)變體——擴展到前所未有的規模。這樣的策略轉向源自一項重要經驗:在足夠的資料量與運算資源支撐下,**更大的模型不僅在數值指標上表現更好,還會展現質的能力飛躍**。然而,要實際達成這種規模,必須引入新的訓練典範與效率最佳化技術,以因應龐大的計算需求。 --- ### 5.1 Scaling Laws and Efficiency ### 5.1 擴展定律與效率考量 In the early 2020s, researchers discovered a set of **scaling laws** describing a predictable relationship between model performance and three key factors: model size (number of parameters), dataset size, and training compute. These studies showed that many contemporary models were effectively “undertrained”—too large relative to the amount of data they had seen. A widely cited heuristic emerged: for near-optimal performance, a model should be trained on a dataset containing *at least* roughly twenty tokens per parameter. To make such scaling economically and computationally viable, architectures such as **Mixture of Experts (MoE)** were introduced. In an MoE layer, the standard feed-forward network is replaced by a collection of smaller “expert” networks. For each token, a gating mechanism activates only a subset of these experts. This design allows the total parameter count to increase substantially while keeping the computational cost per forward pass relatively stable. --- 在 2020 年代初期,研究者提出一系列 **擴展定律(scaling laws)**,描述模型效能與三個關鍵變數之間可預測的關係:模型規模(參數數量)、資料集大小,以及訓練所投入的計算量。這些研究顯示,許多當代模型在嚴格意義上屬於「訓練不足」:也就是模型參數數量相對於其所見資料量而言過大。一項常見的經驗法則是:若要接近最佳表現,訓練資料中的 token 數量應約為模型參數數量的 20 倍以上。 為了在實務上達成此種規模,同時控制成本與計算負荷,研究社群提出 **專家混合架構(Mixture of Experts, MoE)**。在 MoE 層中,傳統的前饋網路被一組較小的「專家網路」所取代。對每一個 token,門控機制只會啟動其中少數幾個專家。如此一來,總參數規模可以大幅增加,而單次前向傳遞的計算成本則維持在可接受範圍。 --- ### 5.2 The Three-Stage LLM Training Paradigm ### 5.2 大型語言模型的三階段訓練典範 Training a modern LLM is no longer a single-step process; instead, it follows a sophisticated three-stage paradigm designed to impart both broad knowledge and specific, desirable behaviors: 1. **Pre-training** This initial stage is computationally intensive. The model is trained on極大量的文字與程式碼,目標通常是下一個 token 的預測(next-token prediction)。在此階段中,模型學習語言的統計結構、語法規則與常見推理模式,因而形成強大的「自動補全」基礎模型。 2. **Supervised Fine-Tuning (SFT)** 在第二階段,預訓練模型會在規模較小、精心挑選的高品質輸入–輸出配對資料上進行微調。這些資料展示了我們希望模型展現的行為範例,藉此訓練模型學會遵守指令、組織回答並扮演有幫助的助理角色。 3. **Preference Tuning** 前兩個階段主要依賴正向範例,而偏好調整階段則引入「負面訊號」與人類價值觀。此階段使用成對偏好資料:標註者會在兩個模型回應中選擇較好的那一個。藉由這些偏好訊息,模型得以校準自身的輸出,使其更符合人類在「有用性」與「安全性」上的期待。 --- ### 5.3 Advanced Alignment and Reasoning ### 5.3 進階對齊與推理能力強化 Preference tuning is typically implemented using techniques from **reinforcement learning (RL)**, thereby形成一個對齊導向的回饋迴圈。在此框架下,LLM 被視為「策略(policy)」,模型產生的文字被視為「行動(action)」,而人類偏好的評價則扮演「獎勵(reward)」角色。 為了讓此過程可擴展,訓練流程中會建立一個獨立的 **獎勵模型(Reward Model)**。該模型以人類偏好資料為訓練基礎,學習預測某一回應相對另一回應的偏好分數。在後續的強化學習調整中,LLM 會產生候選回應,由獎勵模型評分,然後更新 LLM 的權重以最大化預期獎勵。此類 RL 損失函數通常會加入 **KL 散度懲罰項**,用來限制模型偏離 SFT 階段基準模型過多,以避免所謂的「獎勵駭客(reward hacking)」,即模型利用獎勵模型的漏洞產生不符合真實人類意圖的輸出。 為了進一步增強模型在複雜任務上的表現,現代 LLM 常被訓練成在給出最終答案前,先顯式產生一段推理過程,這種方法被稱為 **Chain of Thought(思維鏈)**。研究顯示,在邏輯推理、數學題目與多步驟問題上,讓模型輸出中介推理步驟可以顯著提升答題正確率。 上述訓練與對齊方法的持續優化,構成當前技術前緣的重要部分;同時,研究社群也在探索下一代訓練典範,可能進一步改變這類模型的建構方式與使用方式。 --- ## 6.0 The Next Frontier: Emerging Paradigms and Future Challenges ## 6.0 下一個前沿:新興典範與未來挑戰 The extraordinary success of text-based Transformers is now inspiring both their extension into new domains and the exploration of fundamentally different generation paradigms. As the field matures, attention is increasingly directed toward core architectural challenges involving efficiency, data quality, and safety—factors that will shape the next wave of innovation. --- 文字型 Transformer 的成功,正推動其應用擴展至其他資料模態,同時也促使研究社群探索截然不同的生成典範。隨著領域逐漸成熟,研究焦點愈來愈集中在效率、資料品質與安全性等核心架構議題上,而這些因素將深刻影響下一階段的技術演進。 --- ### 6.1 Beyond Text: Cross-Modal Applications ### 6.1 超越文字:跨模態應用 The core idea of self-attention—learning relationships between vector representations—is not restricted to text. This has enabled its successful adaptation to other modalities, particularly images. - **Vision Transformer (ViT)**: ViT adapts the Transformer Encoder for image classification. An image is divided into a grid of patches, each treated as a token, and fed into the encoder. A special `[CLS]` token aggregates global information, and its final embedding is used for classification. This result was notable because it demonstrated that a general-purpose architecture with relatively low inductive bias, such as the Transformer, can outperform highly specialized convolutional neural networks (CNNs)—which strongly encode spatial locality—provided that sufficient training data are available. - **Vision–Language Models (VLMs)**: VLMs are designed to answer questions about images or to generate descriptions conditioned on visual input. A common approach uses an image encoder to convert an image into a sequence of patch tokens, which are then concatenated with text tokens and processed by a Decoder-only LLM that generates a textual response autoregressively. --- 自注意力的核心概念——學習向量表示之間的關係——並不侷限於文字資料,也可有效應用於其他模態,特別是影像。 - **Vision Transformer(ViT)**:ViT 將 Transformer 編碼器應用於影像分類。其作法是將影像切分為固定大小的影像區塊(patch),每個區塊視為一個 token,再送入編碼器。模型通常會引入一個特殊的 `[CLS]` 標記來匯聚全域資訊,其最終嵌入向量用於分類決策。此結果具有指標性,顯示在資料量充足的情況下,像 Transformer 這種歧管假設(inductive bias)較弱的通用架構,亦可超越專為影像設計、具強烈空間局部性假設的卷積神經網路(CNN)。 - **視覺–語言模型(Vision–Language Models, VLMs)**:VLMs 旨在對影像進行問答,或是生成與影像相關的文字描述。一種典型作法是先透過影像編碼器將影像轉換為一串 patch token,再與文字 token 串接,作為輸入提供給僅解碼器式的 LLM。該 LLM 便可在此聯合序列的條件下,自回歸地生成文字回答。 --- ### 6.2 A New Generation Paradigm: Diffusion-Based LLMs ### 6.2 新型生成典範:基於 Diffusion 的文字模型 Although autoregressive models (ARMs) dominate current LLM design,其推論階段的生成過程本質上是序列式的:每次只能生成一個 token,難以完全平行化。這激發了研究社群對替代方法的興趣,其中 **diffusion 模型** 成為備受關注的候選方案。 在影像領域,diffusion 模型透過學習「反轉逐步加噪過程」來生成樣本。對於離散文字,可將雜訊類比為 `[MASK]` token:前向過程逐步將原始序列中的 token 以 `[MASK]` 取代,而模型則學習如何從完全遮蔽的序列中逐步「去遮蔽」,重建原始文字。 此類 **Masked Diffusion Models(MDMs)** 具有數項潛在優勢: - **速度**:推論可明顯加快(例如可達數倍至十倍),因為前向傳遞次數取決於固定的 diffusion 步數,而非輸出長度。 - **適合特定任務**:由於其對整段序列進行整體性修正,而非逐 token 生成,因此在「中間補全」(fill-in-the-middle)等任務,例如程式碼補全,往往更為自然。 --- ### 6.3 Ongoing Research and Core Challenges ### 6.3 持續中的研究與核心挑戰 Research on Transformers remains highly active, spanning incremental architectural refinements and fundamental open problems. The interaction between these directions defines the current frontier: - **Component Refinements** 幾乎所有組件都仍在持續改良之中,包括新的最佳化器(例如 Muon)、更高效的正規化層(例如 RMSNorm),以及新型啟動函數等。 - **Data Curation and Model Collapse** 資料品質正成為關鍵議題。隨著網路上由 LLM 生成的內容比例快速增加,若新一代模型持續在這些「再生成資料」上訓練,可能導致所謂的 **model collapse**:模型逐漸失去多樣性與真實世界結構,效能反而下降。這迫使研究者發展更嚴謹的資料篩選與標註流程,並思考新的訓練階段,以確保長期模型品質。 - **Hardware Innovation** 擴展定律對計算資源的需求持續攀升,現有硬體架構正逐漸逼近極限。因此出現專為 Transformer 設計的新一代硬體架構(包含數位與類比方案),目標在於提升延遲與能源效率。 - **Fundamental Challenges: Continuous Learning, Personalization, and Safety** 多項深層問題仍未解決,例如如何讓模型從「一次訓練、靜態部署」轉向真正的 **持續學習(continuous learning)**,如何在保護隱私前提下實現有意義的個人化,以及如何在提升能力的同時維持安全性。 此外,「幻覺(hallucination)」問題亦屬核心挑戰之一。幻覺並非單純的隨機錯誤,而是源自下一個 token 預測目標本身:模型優先追求統計上的合理性,而非事實正確性,因此必須結合檢索、外部工具與安全機制來加以抑制。 這些議題環環相扣,構成 Transformer 研究持續向前推進的主要動力。 --- ## 7.0 Conclusion ## 7.0 結論 The Transformer architecture has undergone a remarkable evolution, progressing from a novel solution for machine translation to the foundational framework of modern generative AI. Its core innovation—the self-attention mechanism—overcame key limitations of earlier architectures and provided a scalable, parallelizable approach for modeling complex relationships in sequential data. This report has traced its development through a sequence of critical optimizations, including relative positional encodings, efficient attention variants, and hardware-aware algorithms such as Flash Attention, which collectively enabled the scaling of Transformers into today's massive Large Language Models. We have also seen how the modular design of the architecture gave rise to multiple model variants—Encoder-only, Decoder-only, and Encoder–Decoder—with the Decoder-only configuration becoming the standard for generative LLMs. The three-stage training paradigm of pre-training, supervised fine-tuning, and preference tuning now serves as the backbone for aligning these models with complex human intents. The influence of the Transformer has extended beyond text into cross-modal domains such as vision, further展示其通用性。展望未來,整體發展方向將同時受到「效率」與「創新」兩大軸線的牽引:一方面,新型生成典範(例如 diffusion-based LLMs)有望帶來顯著的速度與效能提升;另一方面,研究社群仍須面對資料品質、硬體設計、持續學習與安全性等基本問題。 總結而言,Transformer 並非一個靜態終點,而是一套持續演進的架構框架。隨著新技術與新應用不斷湧現,Transformer 及其後續變體極可能在相當長的一段時間內,持續扮演推動人工智慧前沿發展的核心角色。 ```