[pb note] 2025生成式AI時代下的機器學習_李宏毅(ch1-4)

筆記全集Book 請至: https://hackmd.io/@4j/r1U_UJ_pye/

課程網站 https://speech.ee.ntu.edu.tw/~hylee/ml/2025-spring.php

[pb note] 2025生成式AI時代下的機器學習_李宏毅(ch1-4)
第1講：一堂課搞懂生成式人工智慧的技術突破與未來發展
第2講：一堂課搞懂 AI Agent 的原理 (AI如何透過經驗調整行為、使用工具和做計劃)
第3講：AI 的腦科學-語言模型內部運作機制剖析 (解析單一神經元到整群神經元的運作機制、如何讓語言模型說出自己的內心世界)
第4講：Transformer 的時代要結束了嗎？介紹 Transformer 的競爭者們

第1講：一堂課搞懂生成式人工智慧的技術突破與未來發展

video: https://www.youtube.com/watch?v=QLiKmca4kzI
pdf: https://speech.ee.ntu.edu.tw/~hylee/ml/ml2025-course-data/introduction.pdf

請回顧精華片段

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

"想投影片的內容" 才是最花時間的!!(老師不可取代的地方)

思考(Reasoning)

像是腦內小劇場

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

▪ AI Agent

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Deep Research 也有點 AI Agent 的能力
隨著搜尋到的結果不同改變要搜尋的內容

運作機制

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

==> 複雜物件由 有限的基本單位構成
多個有限的選擇可組合成近乎無窮的可能
基本單位可能是文字、像素、取樣點等 token

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

策略：根據固定的次序每次只產生一個 yi
稱作 Autoregressive Generation (像文字接龍)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

給一串token, 決定下一個token(的機率分佈)(選擇題)
也可把多模態的 token 集合起來生成

▪ 怎麼決定下一個 token
答案往往不是唯一
故讓他產生機率分布再去擲骰子決定答案(每次輸出可能不同)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

深度學習：多個Layer

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

深度學習：像是把複雜問題拆解成簡單問題(多個步驟)
反而更有效率
ref: https://www.youtube.com/watch?v=KKT2VkTdFyc

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

困難的問題需要思考很多步，Layer 不夠
==> 深度不夠，長度來湊 (Testing Time Scaling)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

一個layer裡有很多小layer，有考慮全局/單點的
Self-attention Layer: 產生輸出時，會考慮全部的輸入

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

含有 Self-attention Layer 的通常稱為 Transformer

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

▪ Transformer 的限制？
輸入無法太長運作會有問題
出現改良版的架構 ==> Mamba

運作機制是怎麼產生出來的

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

▪ 架構 architecture
超參數 (hyperparameter) <– 所謂的調參數
由開發者(人類)決定天生資質

▪ 參數 parameter
由訓練資料決定後天學習

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

尋找能讓 𝑓𝜃 最能滿足訓練資料的參數𝜃

通用機器學習模型

"通用" 在歷史上有不同的含義

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

▪ 通用機器學習模型：第一型態 (2018-2019)
僅是個 decoder, 無法直接用
後面需要再外掛一個特化模型才能達到目的
ex: 芝麻街家族模型

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

▪ 通用機器學習模型：第二型態 (2020-2022)
有完整的文字生成功能但無法用指令操控他
需再微調一下參數
把模型用在不同任務上屬於: 架構相同、參數不同

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

▪ 通用機器學習模型：第三型態 (2023)
用在不同任務上時，不需再做模型的調整，直接下指令即可

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

回顧 ref:
Pre-train https://youtu.be/cCpErV7To2o?s
Fine-tune https://youtu.be/Q9cNkUPXUB8?s
RLHF https://youtu.be/v12IKvF6Cj8?

怎麼賦予新的能力

機器的終身學習(Life-long Learning)時代

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

prompt的調整(給他需求資訊)
此種模型的參數是固定的
不會永遠改變行為 (一旦移除指令他就會復原)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

若希望他永久具備新技能 –> 就需 改變參數
▪ 微調(fine-tune)
要注意可能使原始能力下降不是個容易事
是最後的手段!
(提醒：應該先確定不微調就無法具備目標能力，才選擇微調)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

ChatGPT有介面可輸入資料微調參數
https://platform.openai.com/docs/guides/fine-tuning

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

微調可能使模型原始的能力錯亂
Q:我們只想要改基礎模型的一個小地方有必要麻煩到去微調參數嗎？

▪ 模型編輯(Model Editing)/ 類神經網路編輯
直接找出有關的參數人工手動修改！
ref: 第八講作業八

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

▪ Model Merging
在沒有訓練資料下直接合體兩個模型的能力
ref: 第九講作業九

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

第2講：一堂課搞懂 AI Agent 的原理 (AI如何透過經驗調整行為、使用工具和做計劃)

video: https://www.youtube.com/watch?v=M2Yg1kwPpts
[ppt] [pdf]

AI Agent

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

AI Agent: 人類給予目標, AI自己想辦法達成
若為複雜目標 –> 需要多步驟靈活調整計畫

目標 Goal: 人給定目標做為輸入
環境 Observation: agent觀察目前的狀況
行動 Action: agent採取行動

▪ 透過 RL 打造 AI Agent
以下棋為例
agent 把目標定為 "學習最大化 Reward"
Reward 是人定的(ex:贏棋,+1)
但這有個侷限：不同任務無法通用

▪ **新想法：直接把 LLM 當 agent **
目標: 告訴他遊戲規則、目標
環境: 也轉為文字敘述
行動: 把文字轉為真正可執行的行動
持續運作到達成目標

從 LLM 角度看, AI Agent其實也是在做接龍
只是一種語言模型的應用~

以下所講的都是倚靠 現有語言模型的能力 所達成的
!!沒有任何模型被訓練!!

以 LLM 運行 AI Agent 的優勢

行動不再有局限無限的輸出能力

以前：做 RL 時需要通靈定個 reward
現在：提供 compile log 讓他自己修改行為 (log資訊量也比單一數值多)

更真實的互動情境應要能即時轉換行動的而非回合制
ex: 語音對話會被打斷、講話同時會收到對方的回應

AI Agent 關鍵能力剖析

根據經驗調整行為

▪ 超憶症 (Hyperthymesia)
記憶太長算力不足時可能無法得到正確答案
對 agent 不是很好

memory：長期記憶內容太長 :(
▪ read 模組：從記憶篩選出跟現在情境相關的訊息
讓模型根據相關經驗跟observation來進行決策
–> 用 Retrival 打造
如同RAG, 從資料庫中(所有自己的記憶)檢索

▪ write 模組：讓資料庫只記下重要的資訊
–> 可用 語言模型(or agent自己) 打造，自問這件事需要記下來嗎

▪ reflection 模組：反思對記憶中的資訊做重新整理 (可能產生新想法)
–> 可用 語言模型(or agent自己) 自問
根據整理推論出的新想法做決策

▪ 建立 Knowledge Graph：用以前觀察到的經驗建立經驗間的關係
讓 read 模組根據 Knowledge Graph 來找出相關資訊

ex: GraphRAG, 把資料庫變成 Knowledge Graph, 讓RAG更有效率

MemGPT、Agent Workflow Memory、A-MEM: Agentic Memory for LLM Agents

AI 如何使用工具

工具：只需要知道怎麼使用，不需要知道內部運作原理
使用工具又叫 "Function Call"

System Prompt：你在開發應用的developer所下的prompt, 每次都固定放在最前面的敘述(優先級較高)
User Prompt：此服務的使用者輸入的內容, 每次會不同

–> 在 System Prompt 裡面定義好如何使用工具、工具的使用方式
輸出的文字再去呼叫函示(開發者需自己設定好流程不給使用者看到)

(模型的內心是思考了很多步才輸出給使用者的)

▪ 使用工具的一些例子

▪ 工具很多怎麼辦

把工具說明書全部存到 memory
打造工具選擇模組協助選出合適的工具

模型也可以自己打造工具並放到工具包中

▪ 判斷是否相信工具

希望不因過度相信工具而犯錯

發現模型其實有自己的判斷力的！
內外部知識會拉扯

什麼樣的外部知識比較容易說服AI呢
–> 若跟內部知識差距太大 就不會相信

還有發現日期較新的也比較相信

就算工具是對的也不代表模型一定不會犯錯
ex: 有RAG, 還是把兩個李宏毅(藝人跟老師)混再一起了

使用工具不一定比較有效率

AI 能不能做計畫

請他先產生plan，再去執行，可能會產生得更好
but 與預期不同，導致原有的計畫行不通!

解法：當模型看到新的obs時就再想想新的計劃~

可試著實際跟現實世界互動找出最佳路徑!
太長的話走的時候要邊判斷是否要繼續以減少搜索

but 有些動作無法回溯><
–> 讓一切都發生在 腦內小劇場!

需要一個 World Model 模擬環境的變化
ex: 模型自導自演

reasoning 腦內小劇場即是做驗證做規劃嗎
未來研究：模型有時也會想太多而不試著去做一下

第3講：AI 的腦科學-語言模型內部運作機制剖析 (解析單一神經元到整群神經元的運作機制、如何讓語言模型說出自己的內心世界)

20250317
video: https://www.youtube.com/watch?v=Xnil63UDW2o
[ppt]

一「個」神經元在做什麼

生成式AI: 給定seq z1~zt-1, 預測下一個token zt
輸出下一個token zt 是任一個token的機率有多大, 是個 "機率分布"

embedding:每個 token 對應到一個向量
unembedding: 把最後一層的向量 seqence 的最後一個向量拿出來，轉成distribution
(把向量轉回token的過程)

一個layer裡有多種layer

一個 神經元(neuron) 的輸出 = weighted sum + activation function(RELU)
這個紅到藍的轉換過程就是 神經元 的作用

▪ 怎麼知道一個神經元在做什麼？
"移除"的方式, ex:把輸出設成0或平均值

一件事情可能很多神經元共同管理

一組神經元來管一個任務

一「層」神經元在做什麼

Representation：某層神經元的輸出

想單獨取出"控制拒絕"的部分

利用拒絕的減去沒拒絕的
得到純拒絕向量

減去向量, ex: 用相減或投影距離的

這種 修改Representation來改變語言行為的事情 ,稱為： Representation Engineering, Activation Engineering, Activation Steering …
[ref]機器學習2021

範例論文：In-Context Vector

每個 representation 都是功能向量的線性組合(weighted sum)
e 是非功能向量的部分 –> 希望 e 越小越好

每次選擇的功能向量越少越好 –> 希望 α 越小越好

Loss function 如上!
可用 Sparse Auto-Encoder (SAE) 來解

一「群」神經元在做什麼

需要一個語言模型的模型來幫忙解析語言模型
faithfulness：保有原來實物的特徵

▪ 舉例一個早期的研究

前面幾層先對主詞做理解
主詞跟受詞的"關聯性" 會產生個 linear function

用幾筆(x,y)回推找出 W,b
再去對答案來達到用簡單的模型比擬語言模型的行為

結果還可以有的高有的低

▪ 根據模型上得到的預測來改變實體

也有試驗用這種方式去修改語言模型
(用模型的模型得到的結論直接用在語言模型上)
結果有蠻多情況是可以成功的
ex: 硬要他回答錯成高雄

▪ 系統化的語言模型「模型」建構方法

pruning: 刪減一些模型的 component
circuit: 語言模型的模型

讓語言模型直接說出它的想法

之前簡化的講法忽略了 residual connection
residual connection：layer的輸出其實還會再加上輸入，才會得到最終的輸出
這個設計使很深的網路更好訓練

所以 transformer 的 layer 間是需要加上 residual connection 的, 通過一個 layer 產生輸出後會把原來的輸入再加起來

!!!右邊換一種畫法!!!

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

▪ residual stream: 像是高速公路, 直接把輸入往前送送送到輸出
▪ 在中間過程每個 layer 都會加一點東西進去
才會才會得到最終的 distribution

▪ 想：中間幾層的輸入是否相同可像最後層一樣, 加過 unembedding layer 來得到token的機率分布
–> logit lens

▪ logit lens：檢查每一層的 logic 看看transformer 是怎麼思考的
(logic: 過 softmax 前叫 logic)

論文範例

到後面發現他把 it 推理成 album

用 logit lens 發現他會把法文先翻成英文再翻成中文, 代表這個模型是用英文在思考的

▪ 每一層就是加點什麼進去 Residual Stream

加了什麼呢？

綠色：原輸入
橘色：先前說一個神經元是把前一層的結果做weighted sum <–> 前一層的某神經元乘上 weight 後傳入下一層不同的神經元

"Transformer Feed-Forward Layers Are Key-Value Memories" 提出這種另一面的解釋概念！
把前一層(橘色)看成 attention 的 weight –> k
中間線段 weight –> value v

若想把金城武換成李宏毅
就可去找出是哪個v在影響的, 把金換成李
這篇論文發現此做法有48%會改變輸出, 34%會成功換掉

▪ Patchscopes

利用不同角度解析 representation
但這會受到你舉的例子的影響

ex: 解析每一層最後一個字對這個模型的意義
威爾斯王子

-另一個範例論文利用分析每層的輸出發現有時要到很後面層才會被解析出關鍵事(太晚)
-所以他試著把後層再傳遞到最前面重跑一次
發現對原本答錯的題目有40~60%變成答對！

第4講：Transformer 的時代要結束了嗎？介紹 Transformer 的競爭者們

video: https://www.youtube.com/watch?v=gjsdVi90yQo
[ppt] [pdf]
(20250323)

每一種架構的存在都有一個理由!!

競爭者 mamba 跟 transformer 其實是相像的

1. CNN 存在的理由是什麼？

-fully connected layer
–> receptive field：拿掉一些不需要的 weight
–>> parameter sharing：使一些 weight 的參數相同
==> 專門為影像設計, 減少不必要參數, 避免overfitting

複習 CNN [ref]

2. Residual Connection 存在的理由是什麼？

發現類神經網路"深layer"架構在測試跟 "訓練" 都表現不好
所以設計此使 Optimization 更容易, 讓更深的網路可以訓練好
(如示意圖 error surface 較平坦, 不易卡在 local minimun)

3. Transformer 存在的理由是什麼？

取代演化表： RNN(LSTM) >>> self-attention layer >>> Mamba

▪ 要解的問題
輸入 vector seq, layer混合資訊, 輸出 vector seq
(輸出 yi 時只能看到之前的資訊)

3-1. RNN-Style

RNN 流派的解法

Hidden state 混合資訊

- Ht: Hidden state 存的資訊, 由前一時間點的 Hidden state 及當下的輸入 xt 所合成 (可以是向量或大矩陣)
Ht 再通過 fC 得到 yt
- fA, fB, fC: 訓練出的函數

讓 fA, fB, fC 與時間t有關 : 可隨時間變化, 依輸入x變化
ex: 設計成遺忘\清除資訊等多元操作

LSTM 的功能對應：
fA,t –> forget gate
fB,t –> input gate
fC,t –> output gate

▪ RNN-Style vs. AI Agent’s Memory

對應上一堂課的 AI Agent’s Memory 機制：
memory –> H
read –> fC,t
write –> fB,t
reflection –> fA,t

RNN 的運行

3-2. Self-Attention Style

Self-Attention 流程： 算yt
輸入seq x1,…xt
xi 分別乘上 3個transform 得到 vi,ki,qi

第t時間點的 qt, 去和每個位置的 ki 做內積
得到 alpha_t,i (attetion 的 weight)

softmax 使 alpha 總和為零
各 alpha' 與各 vi 相乘再相加(weighted sum)
就得到 yt

(課程上的 atteation 簡化圖)

▪ Attention 概念很早就有了

Neural Turing Machine https://arxiv.org/abs/1410.5401
Memory Networks https://arxiv.org/pdf/1410.3916

Attention-based Memory
Selection Recurrent Network
for Language Modeling
https://arxiv.org/abs/1611.08656

▪ Attention 的 inference

每次都要跟前面的位置做att

- RNN(上)
每一步的運算量固定
memory小只需記前一個H
- attention(下)
耗memory 越往後運算量會越大(要考慮前面的步)

Q：RNN無法記得大量資訊 att可(?) ⇒ 錯錯錯誤解!

"attention is all you need"
非發明att,是拿掉att以外的東西發現還是可以運作很好
使訓練可以更佳平行化!

語言模型的訓練 (找出參數)

複習訓練的原理
Backpropagation
Computational Graph

transformer 設計是為了讓訓練可以平行化

▪ 訓練的步驟
算出目前的答案，與正確答案計算差異，更新參數

transformer 可以快速的算出現有的答案

以前的模型要一個一個 token 的吐出結果
transformer的好處：可以一次輸入完整的seq, 平行算出每個時間的 token 結果

"給定完整輸入" sequence
token 轉成向量 x1,…,x6
做 self-attention 平行算出 y1,…,y6 (他們之間無關連)
"平行輸出" 每一時間點的 token

▪ GPU friendly 的設計 (矩陣運算)
x 乘上 transformation 得到 qkv
(左) kq相乘得到每一時間點兩兩間的 att matrix
(下) att matrix 做 softmax, 再跟 value v 相乘, 得到 y
整個過程都是矩陣運算 –> GPU 最擅長做的了:D

反之 RNN 是無法平行運算的,H 6需等H1~H5算出來才能計算, GPU討厭等待!

▪ Self-attention vs. RNN-style

(其實RNN是可以平行化的請繼續看下去)

人類需要越來越長的序列故開始想念RNN的好

RNN 有沒有訓練時平行的可能性

這樣展開還是要連續算難以平行
但發現這塊都是fA

–> 拿掉fA
Ht 為 X1,…Xt 分別做 fB 的相加

–> Ht 是個 dxd 矩陣, fBT 用 Dt 當代號

–> D=vk

–> kq = scalar alpha
scalar前移
即為對向量v做 weighted sum

!!!變成attention了!!!
少了 softmax, 稱為 linear attention

linear attention

linear attention: 沒有做 "Reflection"(fA,t) 的 RNN, 像廣義的 RNN
RNN: 就是 linear attention 加上 "Reflection"(fA,t)

▪ Linear Attention
Training 的時候像 Self-attention, 就也可平行化加速訊練
Inference 的時候像 RNN

dxd 矩陣 "vk" 的直觀含義
v：要寫入記憶(Hidden state)的資訊
k：scalar 要寫到哪裡 (ex:第幾個column)

H：各colume存各個資訊
q：決定要從哪個colume取出多少資訊

Linear Attention 的變形可以近似 Softmax [yt]

Linear Attention 還是無法贏 Self-attention

RNN (Linear Attention) 贏不過 Transformer (Self-attention with Softmax)？

Q: 是因為 RNN 記憶有限嗎?
A: 不, 兩個都有限

▪RNN 儲存的記憶有限, 最多存d個時間點的v, 超過就會重疊使用區間互相干擾

▪Transformer (Self-attention with softmax)
儲存的記憶也是有限的！
當時間t>維度d時
就無法找到一個key把單純的v取出模型記憶會開始錯亂

–> 所以比較弱應該是差在softmax的機制

Linear Attention 最大的問題： 記憶永不改變
而 softmax 可做到記憶的改變(如圖), 只要後面有出現更重要的事, 前面的記憶就變得沒那麼重要了(值會變小)

試著讓他可以改變呢？
▪ 加上 Reflection: 逐漸遺忘
Retention Network(RetNet), 即加上常數項(0-1) Gamma r 來讓記憶逐漸淡忘

訓練時 alpha 多乘以 r_t-1
推論時 H_t-1 多乘以 r

Gated Retention： r 改成 r_t, 使記憶淡忘非定值,可以隨時間改變
r_t 是模型學出來的, 哪些事要記得哪些要遺忘

訓練時要多計算各個r

▪ 對 Reflection 做一點限制

☉：elementwise的相乘
Gt：決定Ht中每個col的記憶要做什麼行動(抹去/保留/減弱)

Mamba

=左圖=
Mamba(linear att的架構) 第一次贏過 transformer
橫軸FLOPs 不同大小的模型
縱軸perplecity 越小模型越好

=右圖=
縱軸每秒可以處理多少tokens
Mamba 在推論時也可以比 transformer 有更好的加速

DeltaNet
第二行：把 memory 清空(減去原先想放的資訊v_t,old) 再放入新的資訊v_t
推推推…
變成 Gradient Descent了!!!

▪ 其他有用到 linear attention的系列模型：

大模型 Jamba, Minimax-01

sana 影像

MambaOut: Do We Really Need Mamba for Vision?
https://arxiv.org/abs/2405.07992
Mamba不一定要用在影像上像在分類任務上拔掉比較好

Do not train from scratch
ex: finetune

[pb note] 2025生成式AI時代下的機器學習_李宏毅(ch1-4)

第1講：一堂課搞懂生成式人工智慧的技術突破與未來發展

思考(Reasoning)

運作機制

運作機制是怎麼產生出來的

通用機器學習模型

怎麼賦予新的能力

第2講：一堂課搞懂 AI Agent 的原理 (AI如何透過經驗調整行為、使用工具和做計劃)

AI Agent

以 LLM 運行 AI Agent 的優勢

AI Agent 關鍵能力剖析

根據經驗調整行為

AI 如何使用工具

AI 能不能做計畫

第3講：AI 的腦科學-語言模型內部運作機制剖析 (解析單一神經元到整群神經元的運作機制、如何讓語言模型說出自己的內心世界)

一「個」神經元在做什麼

一「層」神經元在做什麼

一「群」神經元在做什麼

讓語言模型直接說出它的想法

第4講：Transformer 的時代要結束了嗎？介紹 Transformer 的競爭者們

1. CNN 存在的理由是什麼？

2. Residual Connection 存在的理由是什麼？

3. Transformer 存在的理由是什麼？

3-1. RNN-Style

3-2. Self-Attention Style

語言模型的訓練 (找出參數)

RNN 有沒有訓練時平行的可能性

linear attention

RNN (Linear Attention) 贏不過 Transformer (Self-attention with Softmax)？

Read more

[22025李宏毅ML] 生成式AI時代下的機器學習

[2025李宏毅ML] 第10講：人工智慧的微創手術 — 淺談 Model Editing

[2025李宏毅ML] 第10講：人工智慧的微創手術 — 淺談 Model Editing

[2025李宏毅ML] 第9講：你這麽認這個評分系統幹什麽啊？談談有關大型語言模型評估的幾件事