[Week 11] LLM Foundations

# [Week 11] LLM Foundations [課程目錄](https://areganti.notion.site/Applied-LLMs-Mastery-2024-562ddaa27791463e9a1286199325045c) [課程連結](https://areganti.notion.site/Week-11-LLM-Foundations-f6c9f5f8e2ef4522a5358ac57361073c) ## ETMI5: Explain to Me in 5 In the first week of our course, we looked at the difference between two types of machine learning models: generative models, which LLMs are a part of, and discriminative models. Generative models are good at learning from data and creating new things. This week, we'll learn about how LLMs were developed by looking at the history of neural networks used in language processing. We start with the basics of Recurrent Neural Networks (RNNs) and move to more advanced architectures like sequence-to-sequence models, attention mechanisms, and transformers. We'll also review some of the earlier language models that used transformers, like BERT and GPT. Finally, we'll talk about how the LLMs we use today were built on these earlier developments. 在課程的第一週，我們研究了兩種機器學習模型之間的差異：生成模型，LLMs就屬其中，和判別模型。生成模型擅長從資料中學習並創造新事物。本週，我們將透過研究語言處理中使用的神經網路的歷史來了解LLMs是如何開發的。我們從遞迴神經網路(RNNs)的基礎知識開始，然後轉向更先進的架構，像是序列到序列(sequence-to-sequence)模型、注意力機制(attention mechanisms)和transformer。最後，我們還會討論如今我們所使用的LLMs是如何建立在這些早期的發展基礎上。 ## Generative vs Discriminative models In the first week, we briefly covered the idea of Generative AI. It's essential to note that all machine learning models fall into one of two categories: generative or discriminative. LLMs belong to the generative category, meaning they learn text features and produce them for various applications. While we won't delve deeply into the mathematical intricacies, it's important to grasp the distinctions between generative and discriminative models to gain a general understanding of how LLMs operate: 在第一週，我們簡要地介紹了生成式人工智慧的想法。值得注意的是，所有機器學習模型都可以分成兩個類別：生成模型或判別模型。LLMs屬於生成類別，這意味著它們學習文本特徵並為各種應用生成它們。雖然我們不會深入研究複雜的數學問題，但理解生成式和判別式模型之間的區別對於大致瞭解LLMs的運作方式很重要： ### **Generative Models** Generative models try to understand how data is generated. They learn the patterns and structures in the data so they can create new similar data points. 生成模型試圖理解資料是如何生成的。它們學習資料中的模式和結構，以便可以創建新的相似資料點。 For example, if you have a generative model for images of dogs, it learns what features and characteristics make up a dog (like fur, ears, and tails), and then it can generate new images of dogs that look realistic, even though they've never been seen before. 舉例來說，如果你有狗狗照片的生成模型，它會學習狗的特徵和特性(像是皮毛、耳朵和尾巴)，然後它可以生成看起來很逼真的新的狗狗照片，即便它們以前從未被看到過。 ### **Discriminative Models** Discriminative models, on the other hand, are focused on making decisions or predictions based on the input they receive. 另一方面，判別模型專注於根據收到的輸入做出決策或預測。 Using the same example of images of dogs, a discriminative model would look at an image and decide whether it contains a dog or not. It doesn't worry about how the data was generated; it's just concerned with making the right decision based on the input it's given. 使用相同的狗狗照片範例，判別模型將查看照片並決定照片中是否包含狗。它並不關心資料是如何被生成的；它只關心根據給定的輸入做出正確的決定。 Therefore, Generative models learn the underlying patterns in the data to create new samples, while discriminative models focus on making decisions or predictions based on the input data without worrying about how the data was generated. 因此，生成模型學習資料中的潛在模式來創建新的樣本，而判別模型則專注於根據輸入資料做出決策或預測，而不用擔心資料是如何被生成的。 **Essentially, generative models create, while discriminative models classify or predict.** **本質上，生成模型進行創建，而判別模型進行分類或預測。** ## Neural Networks for Language For several years, neural networks have been integral to machine learning. Among these, a prominent class of models heavily reliant on neural networks is referred to as deep learning models. The initial neural network type introduced for text generation was termed as a Recurrent Neural Network (RNN). Subsequent iterations with improvements emerged later, such as Long Short-Term Memory networks (LSTMs), Bidirectional LSTMs, and Gated Recurrent Units (GRUs). Now, let's explore how RNNs generate text. 多年來，神經網路一直是機器學習中不可或缺的一部分。這裡面吼，有一類主要依賴神經網路的模型被稱為深度學習模型。最初為文本生成引入的神經網路類型被稱為遞迴神經網路(RNN)。隨後出現了改進的迭代，像是長短期記憶網路(LSTM)、雙向LSTM和門控遞迴單元(GRUs)。現在，我們來探討RNN如何生成文字。 ### Recurrent Neural Network (RNN) Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to handle sequential data by allowing information to persist through loops within the network architecture. Traditional neural networks lack the ability to retain information over time, which can be a major limitation when dealing with sequential data like text, audio, or time-series data. 遞迴神經網路(RNN)是一種人工神經網路，主要是透過信息能夠在網路架構內的遞迴中持續存在的方式處理序列資料。傳統的神經網路缺乏隨著時間的流轉而保留信息的能力，這在處理像是文本、音訊或時間序列資料等序列資料時可能是一個主要限制。 The basic principle behind RNNs is that they have connections that form a directed cycle, allowing information to be passed from one step of the network to the next. This means that the output of the network at a particular time step depends not only on the current input but also on the previous inputs and the internal state of the network, which captures information from earlier time steps. RNNs背後的基本原理是它們具有形成有向遞迴的連接，允許信息從網路的一個步驟傳遞到下一個步驟。這意味著特定時步的網路輸出不僅取決於當前輸入，還取決於先前的輸入和網路的內部狀態，而這個狀態的捕捉是來自早期時步的信息。 ![image](https://hackmd.io/_uploads/HJVcmx4AA.png) Image Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ Here's a simplified explanation of how RNNs work: 1. **Input Processing**: At each time step $t$, the RNN receives an input $x_t$. This input could be a single element of a sequence (e.g., a word in a sentence) or a feature vector representing some aspect of the input data. 2. **State Update**: The input $x_t$ is combined with the internal state $h_{t-1}$ of the network from the previous time step to produce a new state $h_t$ using a set of weighted connections (parameters) within the network. This update process allows the network to retain information from previous time steps. 3. **Output Generation**: The current state $h_t$ is used to generate an output $y_t$ at the current time step. This output can be used for various tasks, such as classification, prediction, or sequence generation. 4. **Recurrent Connections**: The key feature of RNNs is the presence of recurrent connections, which allow information to flow through the network over time. These connections create a form of memory within the network, enabling it to capture dependencies and patterns in sequential data. 以下是RNNs工作原理的簡單說明： 1. **輸入處理**：在每個時步$t$，RNN接收到一個輸入$x_t$。此輸入可以是序列的單一元素(例如，句子中的單詞)或表示輸入資料的某些方面的特徵向量。 2. **狀態更新**：輸入$x_t$與前一個時步的網路內部狀態$h_{t-1}$結合，使用在網路內的一組加權連接(參數)來產生一個新的狀態$h_t$。這個更新過程允許網路從先前的時間步保留信息。 3. **輸出生成**：當前的狀態$h_t$用於在當前的時步生成輸出$y_t$。這個輸出可用於各種任務，像是分類、預測或序列生成。 4. **遞迴連接**：RNNs的關鍵特徵是遞迴連接的存在，這允許信息能夠隨著時間的推移在網路中流動。這些連接在網路中創建了一種記憶形式，使其能夠捕捉到序列資料中的依賴性和模式。 While RNNs are powerful models for handling sequential data, they can suffer from certain limitations, such as difficulties in learning long-range dependencies and vanishing/exploding gradient problems during training. To address these issues, more advanced variants of RNNs, such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), have been developed. These architectures incorporate mechanisms for better handling long-term dependencies and mitigating gradient-related problems, leading to improved performance on a wide range of sequential data tasks. 儘管RNNs對於處理序列資料來說是強大的模型，不過它們還是糟受到某些限制，像是在學習長期相依性(Long-range dependence)的困難以及訓練期間梯度消失/爆炸的問題。為了解決這些問題，就有更先進的RNNs變體陸續被開發出來，像是長短期記憶(LSTM)網路和門控遞迴單元(GRUs)。這些架構結合了更好地處理長期依賴性和減輕梯度相關問題的機制，從而提高了各種序列資料任務的效能。 ### Long Short-Term Memory (LSTM) LSTM networks are thus an enhanced version of RNNs designed to better handle sequences of data like text just like RNNs, but with the below improvements: 因此，LSTM網路是RNNs的增強版本，主要是希望能夠像RNNs一樣更好地處理文本等資料序列，但具有以下改進： ![image](https://hackmd.io/_uploads/HyR9Qe4A0.png) ![image](https://hackmd.io/_uploads/BkbsXeE0C.png) Image Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 1. **Memory Cell**: LSTMs have a special memory cell that can store information over time. 2. **Gating Mechanism**: LSTMs use gates to control the flow of information into and out of the memory cell: - Input Gate: Decides how much new information to keep. - Forget Gate: Decides how much old information to forget. - Output Gate: Decides how much of the current cell state to output. 3. **Gradient Flow**: LSTMs help gradients flow better during training, which helps in learning from long sequences of data. 4. **Learning Long-Term Dependencies**: LSTMs are good at remembering important information from earlier in the sequence, making them useful for tasks where understanding context over long distances is crucial. 1. **記憶單元**：LSTMs有一個特殊的記憶單元，可以隨著時間的推移儲存信息。 2. **門控機制**：LSTMs使用門控來控制信息進出記憶單元的流動： - Input Gate：決定保留多少新信息。 - Forget Gate：決定忘記多少舊信息。 - Output Gate：決定輸出多少目前單元狀態。 3. **梯度流**：LSTMs讓梯度在訓練過程中更好地流動，這有助於從長資料序列中學習。 4. **學習長期相依性**：LSTMs擅長記住序列中較早期的重要信息，這使得它們對於理解長距離上下文任務非常有用。 Therefore LSTMs are better at handling sequences by remembering important information and forgetting what's not needed, which makes them more effective than traditional RNNs for tasks like language processing. 因此，LSTMs更擅長透過記住重要信息並忘記不需要的信息來處理序列，這也使得它們在語言處理等任務上比傳統RNNs更加有效。 Both RNNs and LSTMs (and their variants) are widely used for language modeling tasks, where the goal is to predict the next word in a sequence of words. They can learn the underlying structure of language and generate coherent text. However, they struggle to handle input sequences of variable lengths and generate output sequences of variable lengths because their fixed-size hidden states limit their ability to capture long-range dependencies and maintain context over time. RNNs和LSTMs(及其變體)都廣泛用於語言建模任務，其目標是預測單詞序列中的下一個單詞。他們可以習得語言的底層結構並生成連貫的文本。然而，因為它們的固定大小的隱藏狀態限制了它們捕捉長期依賴性和隨著時間維持上下文的能力，因此，它們很難處理可變長度的輸入序列與生成可變長度的輸出序列。 ### Sequence-to-Sequence (Seq2Seq) models That's where Sequence-to-Sequence (Seq2Seq) models come in; they work by employing an encoder-decoder architecture, where the input sequence is encoded into a fixed-size representation (context vector) by the encoder, and then decoded into an output sequence by the decoder. This architecture allows Seq2Seq models to handle sequences of variable lengths and effectively capture the semantic meaning and structure of the input sequence while generating the corresponding output sequence. A simple Seq2Seq model is depicted below. Each unit in the Seq2Seq is still an RNN type of architecture. Sequence-to-Sequence (Seq2Seq)閃亮亮登場；它們透過採用編碼器-解碼器(encoder-decoder)的架構，其中輸入序列由編碼器編碼為固定大小的表示(上下文向量)，然後由解碼器解碼為輸出序列。這種架構讓Seq2Seq model能夠處理可變長度的序列，並有效捕捉輸入序列的語意和結構，同時生成相對應的輸出序列。一個簡單的Seq2Seq model如下所示。 Seq2Seq中的每個單元仍然是RNN類型的架構。 We won’t dive too deep into the workings here for brevity, [this](https://www.analyticsvidhya.com/blog/2020/08/a-simple-introduction-to-sequence-to-sequence-models/#:~:text=Sequence%20to%20Sequence%20(often%20abbreviated,Chatbots%2C%20Text%20Summarization%2C%20etc.) article is a great read for those interested: ![image](https://hackmd.io/_uploads/rJe27e4AR.png) Image Source: https://towardsdatascience.com/sequence-to-sequence-model-introduction-and-concepts-44d9b41cd42d ### Seq2Seq models + Attention ![image](https://hackmd.io/_uploads/BkN2QgVAR.png) Image Source: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html The problem with traditional Seq2Seq models lies in their inability to effectively handle long input sequences, especially when generating output sequences of variable lengths. In standard Seq2Seq models, a fixed-length context vector is used to summarize the entire input sequence, which can lead to information loss, particularly for long sequences. Additionally, when generating output sequences, the decoder may struggle to focus on relevant parts of the input sequence, resulting in suboptimal translations or predictions. 傳統Seq2Seq model的問題在於無法有效處理長輸入序列，特別是在生成可變長度的輸出序列時。在標準的Seq2Seq model中，使用固定長度的上下文向量來概括整個輸入序列，這可能會導致信息的遺失，特別是對於長序列的份。此外，在生成輸出序列時，解碼器(decoder)可能很難專注於輸入序列的相關部分，因而導致翻譯或預測的不理想。 To address these issues, attention mechanisms were introduced. Attention mechanisms allow Seq2Seq models to dynamically focus on different parts of the input sequence during the decoding process. 為了解決這些問題，注意力機制被引入了。注意力機制允許Seq2Seq models在解碼過程中動態關注輸入序列的不同部分。 **Here's how attention works:** 1. **Encoder Representation**: First, the input sequence is processed by an encoder. The encoder converts each word or element of the input sequence into a hidden state. These hidden states represent different parts of the input sequence and contain information about the sequence's content and structure. 2. **Calculating Attention Weights**: During decoding, the decoder needs to decide which parts of the input sequence to focus on. To do this, it calculates attention weights. These weights indicate the relevance or importance of each encoder hidden state to the current decoding step. Essentially, the model is trying to determine which parts of the input sequence are most relevant for generating the next output token. 3. **Softmax Normalization**: After calculating the attention weights, the model normalizes them using a softmax function. This ensures that the attention weights sum up to one, effectively turning them into a probability distribution. By doing this, the model can ensure that it allocates its attention appropriately across different parts of the input sequence. 4. **Weighted Sum**: With the attention weights calculated and normalized, the model then takes a weighted sum of the encoder hidden states. Essentially, it combines information from different parts of the input sequence based on their importance or relevance as determined by the attention weights. This weighted sum represents the "attended" information from the input sequence, focusing on the parts that are most relevant for the current decoding step. 5. **Combining Context with Decoder State**: Finally, the context vector obtained from the weighted sum is combined with the current state of the decoder. This combined representation contains information from both the input sequence (through the context vector) and the decoder's previous state. It serves as the basis for generating the output of the decoder for the current decoding step. 6. **Repeating for Each Decoding Step**: Steps 2 to 5 are repeated for each decoding step until the end-of-sequence token is generated or a maximum length is reached. At each step, the attention mechanism helps the model decide where to focus its attention in the input sequence, enabling it to generate accurate and contextually relevant output sequences. 1. **編碼器表示**：首先，輸入序列由編碼器(encoder)所處理。編碼器將輸入序列的每個單詞或元素轉換為隱藏狀態。這些隱藏狀態代表輸入序列的不同部分，並包含有關序列內容和結構的信息。 2. **計算注意力權重**：在解碼過程中，解碼器需要決定專注於輸入序列的哪些部分。為此，它計算了注意力權重。這些權重指出每個編碼器隱藏狀態與當前解碼步驟的相關性或重要性。本質上，該模型嚐試確定輸入序列的哪些部分與生成下一個輸出符記(token)最相關。 3. **Softmax Normalization**：在計算注意力權重之後，模型使用softmax對其進行正規化。這確保了注意力權重總和為1，有效地將它們轉換為機率分佈。透過這樣做，模型可以確保將注意力適當地分配到輸入序列的不同部位。 4. **加權和**：隨著注意力加權計算與正規化之後，模型接著採用編碼器隱藏狀態的加權和。本質上，它根據注意力權重所確定的重要性或相關性，結合輸入序列不同部位的信息。此加權和表示來自輸入序列的"被注意到"的信息，專注於與當前解碼步驟最相關的部分。 5. **結合上下文與解碼器狀態**：最後，將加權和得到的上下文向量與解碼器的當前狀態結合。這個組合表示包含來自輸入序列(透過上下文向量)和解碼器先前狀態的信息。這做為生成當前解碼步驟的解碼器輸出的基礎。 6. **每個解碼步驟的重複計算**：在每個解碼步驟重複步驟2至5，一直到end-of-sequence token出現或是達到最大長度限制。在每個步驟中，注意力機制有助於模型決定將注意力集中在輸入序列的那個部份，使其能夠生成準確且上下文相關的輸出序列。 ### Transformer Models The problem with Seq2Seq models with attention lies in their computational inefficiency and inability to capture dependencies effectively across long sequences. While attention mechanisms significantly improve the model's ability to focus on relevant parts of the input sequence during decoding, they also introduce computational overhead due to the need to compute attention weights for each decoder step. Additionally, like we mentioned before, traditional Seq2Seq models with attention still rely on RNN or LSTM networks, which have limitations in capturing long-range dependencies. 結合注意力機制的 Seq2Seq models的問題在於其計算效率不佳，並且無法有效捕捉長序列之間的相依性。雖然注意力機制明顯提高了模型在解碼過程中專注於輸入序列相關部分的能力，但由於需要計算每個解碼器步驟的注意力權重，這同時也引入了計算開銷。此外，正如我們先前所提的，結合注意力機制的 Seq2Seq models仍然是依賴RNN或LSTM，這些網路在捕捉長期相依性的方面仍然存在著侷限性。 The Transformer model was introduced to address these limitations and improve the efficiency and effectiveness of sequence-to-sequence tasks. Here's how the Transformer model solves the problems of Seq2Seq models with attention: Transformer model的引入就是為了解決這些限制並提sequence-to-sequence tasks的效率和有效性。下面是Transformer model如何用注意力機制解決Seq2Seq models的問題： ![image](https://hackmd.io/_uploads/Hyl6QlNAC.png) Image Source: https://arxiv.org/pdf/1706.03762.pdf 1. **Self-Attention Mechanism**: Instead of relying solely on attention mechanisms between the encoder and decoder, the Transformer model introduces a self-attention mechanism. This mechanism allows each position in the input sequence to attend to all other positions, capturing dependencies across the entire input sequence simultaneously. Self-attention enables the model to capture long-range dependencies more effectively compared to traditional Seq2Seq models with attention. 2. **Parallelization**: The Transformer model relies on self-attention layers that can be computed in parallel for each position in the input sequence. This parallelization greatly improves the model's computational efficiency compared to traditional Seq2Seq models with recurrent layers, which process sequences sequentially. As a result, the Transformer model can process sequences much faster, making it more suitable for handling long sequences and large-scale datasets. 3. **Positional Encoding**: Since the Transformer model does not use recurrent layers, it lacks inherent information about the order of elements in the input sequence. To address this, positional encoding is added to the input embeddings to provide information about the position of each element in the sequence. Positional encoding allows the model to distinguish between elements based on their position, ensuring that the model can effectively process sequences with ordered elements. 4. **Transformer Architecture**: The Transformer model consists of an encoder-decoder architecture, similar to traditional Seq2Seq models. However, it replaces recurrent layers with self-attention layers, which enables the model to capture dependencies across long sequences more efficiently. Additionally, the Transformer architecture allows for greater flexibility and scalability, making it easier to train and deploy on various tasks and datasets. 1. **自注意力機制**：Transformer引入了自注意力機制(self-attention)，而不是單純的依賴編碼器和解碼器之間的注意力機制。這種機制允許輸入序列中的每個位置關注所有其它位置，同時捕捉整個輸入序列的相依性。對比結合注意力機制的Seq2Seq models，自注意力讓模型能夠更有效地捕捉長期相依性。 2. **平行化**：Transformer依賴自注意力層，可以對輸入序列中的每個位置進行平行計算。對比有著遞迴層依序處理序列的傳統Seq2Seq models，這種平行化技術大大的提升模型的計算效率。因此，Transformer可以更快地處理序列，使其更適合處理長序列和大規模的資料集。 3. **位置編碼**：由於Transformer不使用遞迴層，因此它缺乏有關輸入序列中元素順序的固有信息。為了解決這個問題，將位置編碼加到input embeddings中，以提供關於序列中每個元素的位置的信息。位置編碼允許模型根據元素的位置來區分它們，確保模型可以有效地處理具有有序元素的序列。 4. **Transformer架構**：Transformer由encoder-decoder架構所組成，類似於傳統的 Seq2Seq models。然而，它用自注意力層取代了遞迴層，這使得模型能夠更有效地捕捉長序列之間的相依性。此外，Transformer架構具有更大的靈活性和可擴展性，使其更容易在各種任務和資料集上進行訓練和部署。 In summary, the Transformer model addresses the limitations of Seq2Seq models with attention by introducing self-attention mechanisms, parallelization, positional encoding, and a flexible architecture. These advancements improve the model's ability to capture long-range dependencies, process sequences efficiently, and achieve state-of-the-art performance on various sequence-to-sequence tasks. 總而言之，Transformer透過引入自注意力機制、平行化、位置編碼和靈活的架構，解決了具有注意力的Seq2Seq models的限制。這些進步提高了模型捕捉長期相依性、高效處理序列以及在各種sequence-to-sequence tasks上實現最先進性能的能力。 ### Older Language Models Although LLMs have gained significant attention recently, especially with models like GPT from OpenAI, it's important to recognize that the groundwork for this architecture was laid by earlier models such as BERT, GPT (older versions) and T5 explained below. 儘管LLMs最近獲得了極大的關注，特別隨著是OpenAI的GPT等模型的出現之後，但對於瞭解該架構的基礎是由早期的模型所奠定的一些事情是很重要的，像是BERT、GPT(舊版本)和下面解釋的T5。 LLMs like BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-To-Text Transfer Transformer) build on top of the concepts introduced by the Transformer model (described in the previous sections) using the following steps: 像是BERT(Bidirectional Encoder Representations from Transformers)、GPT(Generative Pre-trained Transformer)和T5(Text-To-Text Transfer Transformer)這樣的LLMs，都是建立在Transformer(在前面的章節中所說明的)引入的概念之上，使用以下步驟： 1. **Pre-training and Fine-Tuning**: These models utilize a pre-training and fine-tuning approach. During pre-training, the model is trained on large-scale corpora using unsupervised learning objectives, such as masked language modeling (BERT), autoregressive language modeling (GPT), or text-to-text pre-training (T5). This pre-training phase allows the model to learn rich representations of language and general knowledge from large amounts of text data. After pre-training, the model can be fine-tuned on specific downstream tasks with labeled data, enabling it to adapt its learned representations to perform various NLP tasks such as text classification, question answering, and machine translation. 2. **Bidirectional Context**: BERT introduced bidirectional context modeling by utilizing a masked language modeling objective. Instead of processing text in a left-to-right or right-to-left manner, BERT is able to consider context from both directions by masking some of the input tokens and predicting them based on the surrounding context. This bidirectional context modeling enables BERT to capture deeper semantic relationships and dependencies within text, leading to improved performance on a wide range of NLP tasks. 3. **Autoregressive Generation**: GPT models leverage autoregressive generation, where the model predicts the next token in a sequence based on the previously generated tokens. This approach allows GPT models to generate coherent and contextually relevant text by considering the entire history of the generated sequence. GPT models are particularly effective for tasks that involve generating natural language, such as text generation, dialogue generation, and summarization. 4. **Text-to-Text Approach**: T5 introduces a unified text-to-text framework, where all NLP tasks are framed as text-to-text mapping problems. This approach unifies various NLP tasks, such as translation, classification, summarization, and question answering, under a single framework, simplifying the training and deployment process. T5 achieves this by representing both the input and output of each task as textual strings, enabling the model to learn a single mapping function that can be applied across different tasks. 5. **Large-Scale Training**: These models are trained on large-scale datasets containing billions of tokens, leveraging massive computational resources and distributed training techniques. By training on extensive data and with powerful hardware, these models can capture rich linguistic patterns and semantic relationships, leading to significant improvements in performance across a wide range of NLP tasks. 1. **預訓練和微調**：這些模型採用預訓練和微調方法。在預訓練期間，模型使用無監督學習目標在大型語料庫上進行訓練，例如掩碼語言模型(BERT)、自迴歸語言模型(GPT)或文本到文本預訓練(T5)。這個預訓練階段允許模型從大量文本資料中學習豐富的語言表示和常識。預訓練後，模型可以使用標記好的資料對特定的下游任務進行微調，使其能夠調整其學習的表示來執行各種NLP任務，例如文本分類、問答和機器翻譯。 2. **雙向上下文**：BERT透過利用掩碼語言建模的目標式引入了雙向上下文建模。 BERT不是以從左到右或從右到左的方式處理文本，而是能夠透過遮蔽一些input tokens並根據周圍的上下文來預測它們，而且是從兩個方向來考慮上下文。這種雙向上下文建模使BERT能夠捕獲文本中更深層的語義關係和依賴關係，從而提高各種NLP任務的效能。 3. **自迴歸生成**：GPT models利用自迴歸生成，其中模型根據先前生成的tokens來預測序列中的下一個token。這種方法允許GPT models透過考慮已生成的序列的整個歷史記錄來生成連貫且上下文相關的文本。GPT models對於涉及生成自然語言的任務特別有效，例如文本生成、對話生成和摘要。 4. **文本到文本方法**：T5引入了統一的文本到文本的框架，其中所有NLP任務都被設計成文本到文本的映射問題。這種方法將翻譯、分類、摘要和問答等各種NLP任務統一在一個框架下，簡化了訓練和部署流程。T5透過將每個任務的輸入和輸出表示為文字字串來實現這一點，使模型能夠學習可應用於不同任務的單一映射函數。 5. **大規模訓練**：這些模型在包含數十億個tokens的大規模資料集上進行訓練，利用大量的計算資源和分散式訓練的技術。透過對大量資料進行訓練並使用強大的硬體，這些模型可以捕捉豐富的語言模式和語意關係，從而大大提高各種NLP任務的效能。 ### Large Language Models The latest Llama such as Llama and ChatGPT represent significant advancements over earlier models like BERT and GPT in several key ways: 最新的Llama，像是Llama和ChatGPT，在幾個關鍵地方比起早期模型(像是BERT與GPT)取得了明顯的進展： 1. **Task Specialization**: While earlier LLMs like BERT and GPT were designed to perform a wide range of NLP tasks, including text classification, language generation, and question answering, newer models like Llama and ChatGPT are more specialized. For example, Llama is specifically tailored for multimodal tasks, such as image captioning and visual question answering, while ChatGPT is optimized for conversational applications, such as dialogue generation and chatbots. 2. **Multimodal Capabilities**: Llama and other recent LLMs integrate multimodal capabilities, allowing them to process and generate text in conjunction with other modalities such as images, audio, and video. This enables LLMs to perform tasks that require understanding and generating content across multiple modalities, opening up new possibilities for applications like image captioning, video summarization, and multimodal dialogue systems. 3. **Improved Efficiency**: Recent advancements in LLM architecture and training methodologies have led to improvements in efficiency, allowing models like Llama and ChatGPT to achieve comparable performance to their predecessors with fewer parameters and computational resources. This increased efficiency makes it more practical to deploy these models in real-world applications and reduces the environmental impact associated with training large models. 4. **Fine-Tuning and Transfer Learning**: LLMs like ChatGPT are often fine-tuned on specific datasets or tasks to further improve performance in targeted domains. By fine-tuning on domain-specific data, these models can adapt their pre-trained knowledge to better suit the requirements of particular applications, leading to enhanced performance and generalization. 5. **Interactive and Dynamic Responses**: ChatGPT and similar conversational models are designed to generate interactive and dynamic responses in natural language conversations. These models leverage context from previous turns in the conversation to generate more coherent and contextually relevant responses, making them more suitable for human-like interaction in chatbot applications and dialogue systems. 1. **任務專業化**：雖然早期的LLMS(如BERT和GPT)主要目的在於執行廣泛的NLP任務，包括文本分類、語言生成和問答，不過一些比較新的模型，像是Llama和ChatGPT，則是更加專業化。舉例來說，Llama專門針對多模態任務量身定制，像是看圖說故事與視覺問答，而ChatGPT則針對對話應用程式做了最佳化，像是對話生成與聊天機器人。 2. **多模態功能**：Llama和其它近期的LLMs整合了多模態功能，使它們能夠與照片、音訊和視訊等其它模態結合處理和生成文本。這使得LLMs能夠執行需要跨多種模態理解和生成內容的任務，為看圖說故事、視訊摘要和多模態對話系統等應用開闢了新的可能性。 3. **效率提升**：LLM架構和訓練方法的最新進展帶來了效率的提升，使得Llama和ChatGPT等模型能夠以更少的參數和計算資源實現與其前身相當的性能。效率的提高使得在現實應用中部署這些模型更為切實可行，並減少了訓練大型模型所帶來的環境影響。 4. **微調和遷移學習**：像ChatGP 這樣的LLMs通常會針對特定資料集或任務進行微調，以進一步提高目標領域的效能。透過在特定領域的資料上進行微調，這些模型可以使其預訓練的知識更好地適應特定應用的需求，從而提高性能和泛化能力。 5. **互動性和動態響應**：ChatGPT和類似的對話模型主要在自然語言對話中生成互動式和動態響應。這些模型利用對話中先前對話的上下文來生成更連貫且與上下文相關的響應，使它們更適合聊天機器人應用程式和對話系統中的似人類(human-like)互動。 ## Read/Watch These Resources (Optional) 1. Understanding LSTM Networks: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 2. Sequence to Sequence (seq2seq) and Attention: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html 3. Sequence to Sequence models: https://www.youtube.com/watch?v=kklo05So99U 4. How Attention works in Deep Learning: understanding the attention mechanism in sequence models**:** https://theaisummer.com/attention/ 5. Intro to LLMs: 1. https://www.youtube.com/watch?v=zjkBMFhNj_g&t=1845s 2. https://www.youtube.com/watch?v=zizonToFXDs 6. Transformers: https://www.youtube.com/watch?v=wl3mbqOtlmM ## Read These Papers (Optional) 1. https://arxiv.org/abs/1706.03762 2. https://arxiv.org/abs/2005.14165 3. https://arxiv.org/abs/1910.10683