Try   HackMD

[pb note] 2025生成式AI時代下的機器學習_李宏毅(ch1-4)

筆記全集Book 請至: https://hackmd.io/@4j/r1U_UJ_pye/

課程網站 https://speech.ee.ntu.edu.tw/~hylee/ml/2025-spring.php

第1講:一堂課搞懂生成式人工智慧的技術突破與未來發展

video: https://www.youtube.com/watch?v=QLiKmca4kzI
pdf: https://speech.ee.ntu.edu.tw/~hylee/ml/ml2025-course-data/introduction.pdf

請回顧精華片段

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

"想投影片的內容" 才是最花時間的!!(老師不可取代的地方)
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

思考(Reasoning)

像是腦內小劇場

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

▪ AI Agent

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Deep Research 也有點 AI Agent 的能力
隨著搜尋到的結果不同 改變要搜尋的內容

運作機制

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

==> 複雜物件 由 有限的基本單位 構成
多個有限的選擇 可組合成近乎無窮的可能
基本單位可能是文字、像素、取樣點等 token

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

策略:根據固定的次序每次只產生一個 yi
稱作 Autoregressive Generation (像文字接龍)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

給一串token, 決定下一個token(的機率分佈)(選擇題)
也可把多模態的 token 集合起來生成

▪ 怎麼決定下一個 token
答案往往不是唯一
故讓他產生機率分布 再去擲骰子決定答案(每次輸出可能不同)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

深度學習:多個Layer
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

深度學習:像是把 複雜問題 拆解成簡單問題(多個步驟)
反而更有效率
ref: https://www.youtube.com/watch?v=KKT2VkTdFyc

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

困難的問題需要思考很多步,Layer 不夠
==> 深度不夠,長度來湊 (Testing Time Scaling)
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

一個layer裡有很多小layer,有考慮 全局/單點 的
Self-attention Layer: 產生輸出時,會考慮全部的輸入

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

含有 Self-attention Layer 的 通常稱為 Transformer

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

▪ Transformer 的限制?
輸入無法太長 運作會有問題
出現改良版的架構 ==> Mamba
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

運作機制是怎麼產生出來的

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

架構 architecture
超參數 (hyperparameter) < 所謂的調參數
由開發者(人類)決定 天生資質

參數 parameter
由訓練資料決定 後天學習

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

尋找能讓 𝑓𝜃 最能滿足 訓練資料的參數𝜃

通用機器學習模型

"通用" 在歷史上有不同的含義

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

通用機器學習模型:第一型態 (2018-2019)
僅是個 decoder, 無法直接用
後面需要再外掛一個特化模型才能達到目的
ex: 芝麻街家族模型

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

通用機器學習模型:第二型態 (2020-2022)
有完整的文字生成功能 但無法用指令操控他
需再微調一下參數
把模型用在不同任務上屬於: 架構相同、參數不同

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

通用機器學習模型:第三型態 (2023)
用在不同任務上時,不需再做模型的調整,直接下指令即可

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

回顧 ref:
Pre-train https://youtu.be/cCpErV7To2o?s
Fine-tune https://youtu.be/Q9cNkUPXUB8?s
RLHF https://youtu.be/v12IKvF6Cj8?

怎麼賦予新的能力

機器的終身學習(Life-long Learning)時代

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

prompt的調整(給他需求資訊)
此種模型的參數是固定的
不會永遠改變行為 (一旦移除指令 他就會復原)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

若希望他永久具備新技能 > 就需 改變參數
微調(fine-tune)
要注意可能使原始能力下降 不是個容易事
是最後的手段!
(提醒:應該先確定不微調就無法具備目標能力,才選擇微調)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

ChatGPT有介面可輸入資料 微調參數
https://platform.openai.com/docs/guides/fine-tuning

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

微調 可能使模型原始的能力錯亂
Q:我們只想要改基礎模型的一個小地方 有必要麻煩到去微調參數嗎?
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

模型編輯(Model Editing)/ 類神經網路編輯
直接找出有關的參數 人工手動修改!
ref: 第八講 作業八

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Model Merging
在沒有訓練資料下 直接合體兩個模型的能力
ref: 第九講 作業九

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →


第2講:一堂課搞懂 AI Agent 的原理 (AI如何透過經驗調整行為、使用工具和做計劃)

video: https://www.youtube.com/watch?v=M2Yg1kwPpts
[ppt] [pdf]

AI Agent

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

AI Agent: 人類給予目標, AI自己想辦法達成
若為複雜目標 > 需要多步驟 靈活調整計畫

image
目標 Goal: 人給定目標 做為輸入
環境 Observation: agent觀察目前的狀況
行動 Action: agent採取行動

image
透過 RL 打造 AI Agent
以下棋為例
agent 把目標定為 "學習最大化 Reward"
Reward 是人定的(ex:贏棋,+1)
但這有個侷限:不同任務無法通用

image
▪ **新想法:直接把 LLM 當 agent **
目標: 告訴他遊戲規則、目標
環境: 也轉為文字敘述
行動: 把文字轉為真正可執行的行動
持續運作到達成目標

image
從 LLM 角度看, AI Agent其實也是在做接龍
只是一種語言模型的應用~

image
以下所講的都是倚靠 現有語言模型的能力 所達成的
!!沒有任何模型被訓練!!

以 LLM 運行 AI Agent 的優勢

image
行動不再有局限 無限的輸出能力

image
以前:做 RL 時需要通靈定個 reward
現在:提供 compile log 讓他自己修改行為 (log資訊量也比單一數值多)
image

image
image

更真實的互動情境 應要能即時轉換行動的 而非回合制
ex: 語音對話 會被打斷、講話同時會收到對方的回應

AI Agent 關鍵能力剖析

image

根據經驗調整行為

image
image

超憶症 (Hyperthymesia)
記憶太長 算力不足時可能無法得到正確答案
對 agent 不是很好

image
memory:長期記憶 內容太長 :(
▪ read 模組:從記憶篩選出跟現在情境相關的訊息
讓模型根據相關經驗跟observation來進行決策
> 用 Retrival 打造
如同RAG, 從資料庫中(所有自己的記憶)檢索

image
▪ write 模組:讓資料庫只記下重要的資訊
> 可用 語言模型(or agent自己) 打造,自問這件事需要記下來嗎

image
▪ reflection 模組:反思 對記憶中的資訊做重新整理 (可能產生新想法)
> 可用 語言模型(or agent自己) 自問
根據整理推論出的新想法做決策

image
▪ 建立 Knowledge Graph:用以前觀察到的經驗 建立經驗間的關係
讓 read 模組根據 Knowledge Graph 來找出相關資訊

ex: GraphRAG, 把資料庫變成 Knowledge Graph, 讓RAG更有效率

image
MemGPT、Agent Workflow Memory、A-MEM: Agentic Memory for LLM Agents

AI 如何使用工具

image
工具:只需要知道怎麼使用,不需要知道內部運作原理
使用工具又叫 "Function Call"

image
System Prompt:你在開發應用的developer所下的prompt, 每次都固定放在最前面的敘述(優先級較高)
User Prompt:此服務的使用者輸入的內容, 每次會不同

> 在 System Prompt 裡面定義好 如何使用工具、工具的使用方式
輸出的文字再去呼叫函示(開發者需自己設定好流程 不給使用者看到)

image
(模型的內心是思考了很多步 才輸出給使用者的)

使用工具的一些例子

image
image

image

工具很多怎麼辦

image
把 工具說明書 全部存到 memory
打造工具選擇模組 協助選出合適的工具

image
模型也可以自己打造工具 並放到工具包中

判斷是否相信工具

image
希望 不因過度相信工具而犯錯

image
image

發現模型其實有自己的判斷力的!
內外部知識會拉扯

image
什麼樣的外部知識比較容易說服AI呢
> 若跟內部知識差距太大 就不會相信

還有發現日期較新的也比較相信

image
就算工具是對的 也不代表模型一定不會犯錯
ex: 有RAG, 還是把兩個李宏毅(藝人跟老師)混再一起了
image

使用工具不一定比較有效率

AI 能不能做計畫

image
請他先產生plan,再去執行,可能會產生得更好
but 與預期不同,導致原有的計畫行不通!
image

解法:當模型看到新的obs時 就再想想新的計劃~

image
image

可試著 實際跟現實世界互動 找出最佳路徑!
太長的話走的時候要邊判斷是否要繼續 以減少搜索

image
but 有些動作無法回溯><
> 讓一切都發生在 腦內小劇場!
image
image

需要一個 World Model 模擬環境的變化
ex: 模型自導自演

image
image

reasoning 腦內小劇場 即是做驗證做規劃嗎
未來研究:模型有時也會想太多 而不試著去做一下


第3講:AI 的腦科學-語言模型內部運作機制剖析 (解析單一神經元到整群神經元的運作機制、如何讓語言模型說出自己的內心世界)

20250317
video: https://www.youtube.com/watch?v=Xnil63UDW2o
[ppt]

相關複習
Transformer
[self-attention 上] [self-attention 下]
[Transformer 上] [Transformer 下]
[Transformer 簡介]
可解釋的機器學習
https://youtu.be/WQY85vaQfTI?si=QP9mlhZoD4Hy-xF-
https://youtu.be/0ayIPqbdHYQ?si=WtdggsDHBMMXMiIB
語言模型在「想」什麼?
https://youtu.be/rZzfqkfZhY8?si=SghPRZbFJLrKQk7L

image

一「個」神經元在做什麼

image
生成式AI: 給定seq z1~zt-1, 預測下一個token zt
輸出下一個token zt 是任一個token的機率有多大, 是個 "機率分布"

image
embedding:每個 token 對應到一個向量
unembedding: 把最後一層的向量 seqence 的最後一個向量拿出來,轉成distribution
(把向量轉回token的過程)
image

一個layer裡有多種layer

一個 神經元(neuron) 的輸出 = weighted sum + activation function(RELU)
這個紅到藍的轉換過程就是 神經元 的作用

image
▪ 怎麼知道一個神經元在做什麼?
"移除"的方式, ex:把輸出設成0或平均值

image
一件事情可能很多神經元共同管理

image
一組神經元來管一個任務

一「層」神經元在做什麼

image
Representation:某層神經元的輸出
image

想單獨取出"控制拒絕"的部分
image

image
image

利用拒絕的 減去 沒拒絕的
得到純拒絕向量
image

減去向量, ex: 用 相減 或 投影距離的

這種 修改Representation來改變語言行為的事情 ,稱為: Representation Engineering, Activation Engineering, Activation Steering …
[ref]機器學習2021

範例論文:In-Context Vector

image
image

image
image

image

每個 representation 都是功能向量的線性組合(weighted sum)
e 是非功能向量的部分 > 希望 e 越小越好
image

每次選擇的功能向量越少越好 > 希望 α 越小越好
image

Loss function 如上!
可用 Sparse Auto-Encoder (SAE) 來解

一「群」神經元在做什麼

image
image

需要一個語言模型的模型來幫忙解析語言模型
faithfulness:保有原來實物的特徵

舉例 一個早期的研究

image
image

前面幾層先對主詞做理解
主詞跟受詞的"關聯性" 會產生個 linear function

image
用幾筆(x,y)回推找出 W,b
再去對答案 來達到用簡單的模型比擬語言模型的行為
image

結果還可以 有的高 有的低

根據模型上得到的預測來改變實體

image
image

也有試驗用這種方式去修改語言模型
(用模型的模型得到的結論 直接用在語言模型上)
結果有蠻多情況是可以成功的
ex: 硬要他回答錯成高雄

系統化的語言模型「模型」建構方法

image
pruning: 刪減一些模型的 component
circuit: 語言模型的模型
image

讓語言模型直接說出它的想法

image
之前簡化的講法忽略了 residual connection
residual connection:layer的輸出 其實還會再加上輸入,才會得到最終的輸出
這個設計使很深的網路更好訓練

image
所以 transformer 的 layer 間是需要加上 residual connection 的, 通過一個 layer 產生輸出後 會把原來的輸入再加起來

!!!右邊換一種畫法!!!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

residual stream: 像是高速公路, 直接把輸入往前送送送到輸出
▪ 在中間過程每個 layer 都會加一點東西進去
才會才會得到最終的 distribution

▪ 想:中間幾層的輸入是否相同可像最後層一樣, 加過 unembedding layer 來得到token的機率分布
> logit lens

logit lens:檢查每一層的 logic 看看transformer 是怎麼思考的
(logic: 過 softmax 前叫 logic)

論文範例

image
到後面發現他把 it 推理成 album
image

用 logit lens 發現他會把法文先翻成英文再翻成中文, 代表這個模型是用英文在思考的

每一層就是加點什麼進去 Residual Stream

image
加了什麼呢?

  • 綠色:原輸入
  • 橘色:先前說一個神經元是把前一層的結果 做weighted sum <> 前一層的某神經元乘上 weight 後傳入下一層不同的 神經元

"Transformer Feed-Forward Layers Are Key-Value Memories" 提出這種另一面的解釋概念!
把前一層(橘色)看成 attention 的 weight > k
中間線段 weight > value v

image
若想把金城武換成李宏毅
就可去找出是哪個v在影響的, 把金換成李
這篇論文發現此做法有48%會改變輸出, 34%會成功換掉

Patchscopes

image
image

利用不同角度解析 representation
但這會受到你舉的例子的影響
image

ex: 解析每一層最後一個字對這個模型的意義
威爾斯王子

image
image

image

-另一個範例論文 利用分析每層的輸出 發現有時要到很後面層才會被解析出關鍵事(太晚)
-所以他試著把後層再傳遞到最前面重跑一次
發現對原本答錯的題目 有40~60%變成答對!


第4講:Transformer 的時代要結束了嗎?介紹 Transformer 的競爭者們

video: https://www.youtube.com/watch?v=gjsdVi90yQo
[ppt] [pdf]
(20250323)

每一種架構的存在都有一個理由!!

image
競爭者 mamba 跟 transformer 其實是相像的
image

1. CNN 存在的理由是什麼?

image

-fully connected layer
> receptive field: 拿掉一些不需要的 weight
>> parameter sharing: 使一些 weight 的參數相同
==> 專門為影像設計, 減少不必要參數, 避免overfitting

複習 CNN [ref]

2. Residual Connection 存在的理由是什麼?

image
image

發現類神經網路"深layer"架構 在測試跟 "訓練" 都表現不好
所以設計此 使 Optimization 更容易, 讓更深的網路可以訓練好
(如示意圖 error surface 較平坦, 不易卡在 local minimun)

3. Transformer 存在的理由是什麼?

image
取代演化表: RNN(LSTM) >>> self-attention layer >>> Mamba

image
要解的問題
輸入 vector seq, layer混合資訊, 輸出 vector seq
(輸出 yi 時只能看到之前的資訊)

3-1. RNN-Style

RNN 流派的解法

image
Hidden state 混合資訊
image

- Ht: Hidden state 存的資訊, 由前一時間點的 Hidden state 及 當下的輸入 xt 所合成 (可以是向量或大矩陣)
Ht 再通過 fC 得到 yt
- fA, fB, fC: 訓練出的函數

image
讓 fA, fB, fC 與時間t有關 : 可隨時間變化, 依輸入x變化
ex: 設計成 遺忘\清除資訊 等多元操作

image
LSTM 的功能對應:
fA,t > forget gate
fB,t > input gate
fC,t > output gate

RNN-Style vs. AI Agent’s Memory

image
image

對應上一堂課的 AI Agent’s Memory 機制:
memory > H
read > fC,t
write > fB,t
reflection > fA,t

image
RNN 的運行

3-2. Self-Attention Style

image
Self-Attention 流程: 算yt
輸入seq x1,xt
xi 分別乘上 3個transform 得到 vi,ki,qi

第t時間點的 qt, 去和每個位置的 ki 做內積
得到 alpha_t,i (attetion 的 weight)

softmax 使 alpha 總和為零
各 alpha' 與 各 vi 相乘再相加(weighted sum)
就得到 yt

image
(課程上的 atteation 簡化圖)

Attention 概念很早就有了

image
Neural Turing Machine https://arxiv.org/abs/1410.5401
Memory Networks https://arxiv.org/pdf/1410.3916

image
Attention-based Memory
Selection Recurrent Network
for Language Modeling
https://arxiv.org/abs/1611.08656

Attention 的 inference

image
每次都要跟前面的位置做att
image

- RNN(上)
每一步的運算量固定
memory小 只需記前一個H
- attention(下)
耗memory 越往後 運算量會越大(要考慮前面的步)

Q:RNN無法記得大量資訊 att可(?) ⇒ 錯錯錯 誤解!

image
"attention is all you need"
非發明att,是拿掉att以外的東西 發現還是可以運作很好
使訓練可以更佳平行化!

語言模型的訓練 (找出參數)

image
複習訓練的原理
Backpropagation
Computational Graph

transformer 設計是為了讓訓練可以平行化

image
▪ 訓練的步驟
算出目前的答案,與正確答案計算差異,更新參數

transformer 可以快速的算出現有的答案

image
image

以前的模型要一個一個 token 的吐出結果
transformer的好處: 可以一次輸入完整的seq, 平行算出每個時間的 token 結果
image

"給定完整輸入" sequence
token 轉成 向量 x1,,x6
做 self-attention 平行算出 y1,,y6 (他們之間無關連)
"平行輸出" 每一時間點的 token

image
▪ GPU friendly 的設計 (矩陣運算)
x 乘上 transformation 得到 qkv
(左) kq相乘 得到每一時間點兩兩間的 att matrix
(下) att matrix 做 softmax, 再跟 value v 相乘, 得到 y
整個過程都是矩陣運算 > GPU 最擅長做的了:D

image
反之 RNN 是無法平行運算的,H 6需等H1~H5算出來才能計算, GPU討厭等待!

▪ Self-attention vs. RNN-style

image
(其實RNN是可以平行化的 請繼續看下去)

image
人類需要越來越長的序列 故開始想念RNN的好

RNN 有沒有訓練時平行的可能性

image
這樣展開還是要連續算 難以平行
但發現這塊都是fA
image

> 拿掉fA
Ht 為 X1,Xt 分別做 fB 的相加

> Ht 是個 dxd 矩陣, fBT 用 Dt 當代號

image
> D=vk
image

> kq = scalar alpha
scalar前移
即為對 向量v做 weighted sum
image

!!!變成attention了!!!
少了 softmax, 稱為 linear attention

linear attention

image
linear attention: 沒有做 "Reflection"(fA,t) 的 RNN, 像廣義的 RNN
RNN: 就是 linear attention 加上 "Reflection"(fA,t)

image
▪ Linear Attention
Training 的時候像 Self-attention, 就也可平行化加速訊練
Inference 的時候像 RNN
image

dxd 矩陣 "vk" 的直觀含義
v:要寫入記憶(Hidden state)的資訊
k:scalar 要寫到哪裡 (ex:第幾個column)
image

H:各colume存各個資訊
q:決定要從哪個colume取出多少資訊

image
Linear Attention 的變形可以近似 Softmax [yt]

Linear Attention 還是無法贏 Self-attention

RNN (Linear Attention) 贏不過 Transformer (Self-attention with Softmax)?

image
Q: 是因為 RNN 記憶有限嗎?
A: 不, 兩個都有限
image

▪RNN 儲存的記憶有限, 最多存d個時間點的v, 超過就會重疊使用區間 互相干擾

image
image

▪Transformer (Self-attention with softmax)
儲存的記憶 也是有限的!
當時間t>維度d時
就無法找到一個key把單純的v取出 模型記憶會開始錯亂

> 所以比較弱應該是差在softmax的機制

image
Linear Attention 最大的問題: 記憶永不改變
而 softmax 可做到記憶的改變(如圖), 只要後面有出現更重要的事, 前面的記憶就變得沒那麼重要了(值會變小)

image
試著讓他可以改變呢?
▪ 加上 Reflection: 逐漸遺忘
Retention Network(RetNet), 即加上常數項(0-1) Gamma r 來讓記憶逐漸淡忘
image

訓練時 alpha 多乘以 r_t-1
推論時 H_t-1 多乘以 r

image
Gated Retention: r 改成 r_t, 使記憶淡忘非定值,可以隨時間改變
r_t 是模型學出來的, 哪些事要記得 哪些要遺忘
image

訓練時 要多計算各個r

▪ 對 Reflection 做一點限制

image
☉:elementwise的相乘
Gt:決定Ht中 每個col的記憶 要做什麼行動(抹去/保留/減弱)

image
image

Mamba

image
image

=左圖=
Mamba(linear att的架構) 第一次贏過 transformer
橫軸FLOPs 不同大小的模型
縱軸perplecity 越小模型越好

=右圖=
縱軸 每秒可以處理多少tokens
Mamba 在推論時也可以比 transformer 有更好的加速

image
DeltaNet
第二行:把 memory 清空(減去原先想放的資訊v_t,old) 再放入新的資訊v_t
推推推
變成 Gradient Descent了!!!

▪ 其他有用到 linear attention的系列模型:

image
大模型 Jamba, Minimax-01
image

sana 影像
image

MambaOut: Do We Really Need Mamba for Vision?
https://arxiv.org/abs/2405.07992
Mamba不一定要用在影像上 像在分類任務上 拔掉比較好
image

Do not train from scratch
ex: finetune