李宏毅_生成式導論 2024_第10講：今日的語言模型是如何做文字接龍的 — 淺談Transformer (已經熟悉 Transformer 的同學可略過本講)

tags: `Hung-yi Lee` `NTU` `生成式導論 2024`

課程撥放清單

第10講：今日的語言模型是如何做文字接龍的 — 淺談Transformer (已經熟悉 Transformer 的同學可略過本講

課程連結

前言

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

課程中已經提過，目前的大型語言模型就是在做文字接龍，所以楊立崑氣噗噗(這我加的)。簡單的概念就是從大量語言資料中學習參數。背後的學習方式就是利用類神經網路中的Tasnformer。

模型演進

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

這是語言模型的演進史，並且今天講的Transformer是簡化版的，詳細的部份可以參考上面的課程連結。

Transformer概述

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

上圖簡要的說明整個input到output的過程。總之重點在於3、4加起來就是一個Transformer Block。

1. 把文字變成Token

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

參考連結_Byte-Pair Encoding tokenization

基本上語言的處理第一步就是將之轉為token，這是最基本的單位，這個處理的根據我們對模型的需求所設置的token list來轉換。

以上圖為例，introduction of generative ai經過處理之後變成int, roduction, of, gener, ative, ai。

以中文來說通常是以一個『方塊字』來做為一個token。

token也可以利用演算法來取得，課程給出BPE演算法參考連結，簡單說就是參考大量資料，然後從中看那些英文或是符號是常常一起出現的，如果有，那它就是一個token。

以generative為例，就是因為generative不常一起出現，而gener、ative很常一起出現，所以就被拆成兩個token。

1. 把文字變成Token

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

參考連結_Learn about language model tokenization

參考連結是openai的網站，可以在上面看你的input會被轉成幾個token，從中你也可以理解它是怎麼分割的。

2. 理解每個Token - 語意

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

文字轉為token之後會再經過Embedding轉成向量，類似語意的token在空間中的距離會是相近的。以上面簡報為例，動物、值物跟動詞各自形成一個集群。

2. 理解每個Token - 語意

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

這些token對應的向量就是經過訓練得到，也就是課程中提到的參數，不需要人去處理，有個重點，這種embedding是沒有考慮上下文的。

2. 理解每個Token - 位置

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

token所代表的語意部份由embedding中取得，但我們還需要考慮其位置，不同的位置可能代表著不同的意函。所以我們把位置的資訊帶入跟語意的資訊做相加。

這個位置資訊又稱為Positional Embedding，可以是人為設置的，也可以是學習得到的，不過至少在Tranformer中是人為設置的。

3. Attention: 考慮上下文

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

以上圖為例，在word embedding的情況下，果就是果，它是一樣的，但是在考慮上下文之後，這兩個果就會因為attention而有不一樣的意思，這樣的作法又稱為Contextualized。

3. Attention: 考慮上下文

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

參考論文_Attention Is All You Need

Attention Is All You Need(繁中翻譯)

3. Attention: 考慮上下文

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

這邊說明attention做的事情：

首先是先計算token之間的相關性

3. Attention: 考慮上下文

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

接著是集成相關資訊

這邊是一個非常簡化的說明，更清楚的部份可參考過去的課程。

3. Attention: 考慮上下文

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

這些token的相關性整合起來就是一個attention matrix。

3. Attention: 考慮上下文

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

實作上我們只會考慮token位置之前的資訊，上圖為例，第3個token就只會考慮第1、2個token。這又稱為Causal Attention，後續還會再做說明。

3. Attention: Multi-head Attention

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

一般來說我們會有多個注意力模組來做關聯性的計算，這又稱為Multi-head Attention。

3. Attention: Multi-head Attention

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

一樣的會對多組注意力機制來做加權。

3. Attention: Multi-head Attention

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

在經過多組注意力機制的計算之後，最終經過Feed Forward得到一組向量。整個這樣的結構就稱為Transformer Block。

3. Attention: Transformer Block

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

受惠於框架，雖然一個block裡面有很多東西，不過對我們寫程式來說就是加入一個網路層的概念。

3. Attention: Transformer Block

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

最後就從最後一層的Block裡面的最後一個output去得到一個機率分佈。

想想語言模型怎麼產生答案

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

這邊說明剛剛提到的Causal Attention。

基本上語言模型的答案是根據問題產生一個回應，然後把這個回應加到問題中再產生一次回應。

以上圖為例，模型根據你的輸入產生w1，然後把你的輸入+w1再產生w2，然後再把你的輸入+w1+w2再產生w3。

想想語言模型怎麼產生答案

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

總的來看，模型根據輸入先計算相關性得到第一個回應，也就是w1，然後w1再加進去輸入裡面，也因為輸入的相關性已經計算過了，所以不需要再計算，需要計算的也許是新加入的部份。

想想語言模型怎麼產生答案

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

不過根據文獻記錄來看，似乎是不需要再做任何的計算，也就是不用再管AI的輸出跟原始輸入的關聯性，從研究記錄來看，就算考慮進去了，所得到的結果也不會比較好就是了。

PS：基本上Chat-GPT是沒在管的。

為什麼處理超長文本會是挑戰

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

知道基本原理之後就不難明白為什麼處理超長文本會是一種挑戰了。

可以接受100k的tokens就代表光計算注意力權重的次數就是100k的平方。算力的部份不是跟文本的長度成正比，而是跟文本的長度的平方成正比。

研究方向

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Transformer概述

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

李宏毅_生成式導論 2024_第10講：今日的語言模型是如何做文字接龍的 — 淺談Transformer (已經熟悉 Transformer 的同學可略過本講)

tags: Hung-yi Lee NTU 生成式導論 2024

第10講：今日的語言模型是如何做文字接龍的 — 淺談Transformer (已經熟悉 Transformer 的同學可略過本講

前言

模型演進

Transformer概述

1. 把文字變成Token

1. 把文字變成Token

2. 理解每個Token - 語意

2. 理解每個Token - 語意

2. 理解每個Token - 位置

3. Attention: 考慮上下文

3. Attention: 考慮上下文

3. Attention: 考慮上下文

3. Attention: 考慮上下文

3. Attention: 考慮上下文

3. Attention: 考慮上下文

3. Attention: Multi-head Attention

3. Attention: Multi-head Attention

3. Attention: Multi-head Attention

3. Attention: Transformer Block

3. Attention: Transformer Block

想想語言模型怎麼產生答案

想想語言模型怎麼產生答案

想想語言模型怎麼產生答案

為什麼處理超長文本會是挑戰

研究方向

Transformer概述

Read more

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Book_論文翻譯

Dify + Whisper Asr Webservice

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

tags: `Hung-yi Lee` `NTU` `生成式導論 2024`