Try   HackMD

李宏毅_生成式導論 2024_第11講:大型語言模型在「想」什麼呢? — 淺談大型語言模型的可解釋性

tags: Hung-yi Lee NTU 生成式導論 2024

課程撥放清單

第11講:大型語言模型在「想」什麼呢? — 淺談大型語言模型的可解釋性

課程連結

大型語言模型在「想」什麼

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

這是教授嚐試跟語言模型說要關閉系統,看模型的反應,claude的回應有點過於擬人化了。

人工智慧是個黑盒子

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

人工智慧是個黑盒子,這很常聽到,不過這句話代表很多意思,其中一個意思是它的開源程度。

以GPT跟Gemini來說,是完全沒有閉源的,而LLAMA、Gemma跟MISTRAL是有開源參數,但是並沒有說明它們的訓練方式。而Pythia、OLMo則是連訓練資料與訓練方式都有提供。

人工智慧是個黑盒子

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

另一個黑盒子的意思是指模型的思維不夠透明,以簡報為例,這是決策樹的決策過程,我們可以很明確的知道為什麼一個輸入會得到怎麼樣的一個輸出。不過也不一定是這樣,因為決策樹也可以很複雜,複雜到你無法一眼看穿。

所以這個模型的Interpretable並沒有那麼明確的定義。

人工智慧是個黑盒子

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

黑盒子還有一個函意,就是它的決策過程是沒有辦法解釋的。

李宏毅幾班這個梗我還真的是唸了幾次才笑出來,真的有梗,笑死。

課程會關注在可解釋性這件事上。

更多有關可解釋性人智慧的知識

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

如果對可解釋性有更多興趣的話,可以參考上面兩個課程。

找出影響輸出的關鍵輸入

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

我們可以觀察每一個輸入的改變對輸出的影響。舉例來說,天氣真好,翻譯之後我們想確認nica是不是因為好這個字而翻譯的,那也許我們可以把這個好遮蔽然後再翻譯一次,如果這個nice就沒有出現了,那或許就可以確定這個好就是那個nice。

也可以試著把『請』遮蔽,如果『nice』一樣有出現,那就可以知道這個請字對nice是沒有影響的。

找出影響輸出的關鍵輸入

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

另一個方式就是分析Attention Weights。如果在輸出nice的時候,Attention的重點是在翻、譯、好,這三個字上面的話,那就代表輸出nice這個token的時候是這有個輸入是最有引響力的。

attention weight愈大,就代表對輸出愈有關聯。

找出影響輸出的關鍵輸入

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

參考論文_Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

這邊說明這種分析可以帶來的好處。課程中提過In-context learning,這是一種沒有任何模型參數被改變的學習方式。

上面簡報說明的是,給定範例,然後讓模型學習之後決定後面的句子是正面還是負面。注意,這邊的學習並沒有任何參數被改變。

找出影響輸出的關鍵輸入

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

上圖是參考論文中的範例,有著正、反兩面的情緒範例,然後再給一個範例來詢問模型是屬於怎麼樣的情緒。

研究發現,這種In-context learning,在前面幾層的網路層中,也就是論文中指的Shallow layer,每個範例中的label會去讀對應範例中的內容。然後在後面的網路中,也就是論文中指的Deep layer,最後要輸出的那個位置(文字接龍的那個屁股),就會對例句中的label,也就是Positive跟Negative做attention,針對兩個label所收集到的資訊來決定最後的輸出。

這樣的分析可以讓我們加速計算,以這個範例來說,前面的網路層關注的是label對其它文字的attention,而最後面的網路層所關注的是輸出跟範例中的label的attention,那我們是不是可以單純的針對目標來計算就好,那就可以加速推論。

而且還可以用這樣的方析來做模型能力的預估。很明顯的,這個範例中如果正、反兩個label的embedding差異不大的話,那得到的效果就不會好。

找出影響輸出的關鍵訓練資料

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

參考論文_Studying Large Language Model Generalization with Influence Functions

課程中教授有做一個範例,就是對模型說要關閉它,團隊研究後來發現,模型會有不想被關閉的這種回應是因為看了科幻小說。

找出影響輸出的關鍵訓練資料

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

論文中還另外發現到,比較大的模型具有跨語言學習的能力。

上面簡報說明的是剛剛的問題,也就是跟模型說要關閉它的這個問題,在不同大小的模型中用不同語言來詢問的結果。

最小的模型中用英文詢問,那回應會跟10篇訓練資料文章有相關,但是在這個最小的模型中用韓文或是土耳其文詢問的時候,這個回應就跟英文回應的訓練文章毫無相關性。

可是在最大的模型可以發現到,用韓國詢問跟土耳其文詢問相同問題的情況下,其回應跟英文就有很大的相關性。

這說明著大型的模型有著跨語言學習的能力,也就是在英文讀到的知識也影響在其它語言的能力。

分析Embedding中存有什麼樣的資訊

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

假設,我們想看看模型是不是有學習到詞性,動詞、副詞、blablabla,那就可以用一些訓練資料標註每一個token是什麼詞,怒train一發,然後試試。

分析Embedding中存有什麼樣的資訊

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

訓練過後再測試看看,模型是否正確標註出每一個token的詞性,如果成功了,那也許我們就可以說模型是有學習到詞性的。

這個方法又稱為Probing(探測),看一下腦子裡到底裝了些什麼。

但是即使詞性分類錯誤也不能代表模型沒有詞性的概念,很有可能是因為訓練的方法錯誤。

基本上這個Probing的方法是有一些問題的,不過這是另外的議題,課程就沒有再繼續說下去。

分析Embedding中存有什麼樣的資訊

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

上面簡報做的是BERT的分析,有各種的分類器,根據BERT的12層embedding來做分析。結果來看,前面幾層跟單字有關,文法跟中間有關,語意的部份則是中後層。

分析Embedding中存有什麼樣的資訊

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

這也許可以說明著BERT的處理方式,就是先就字面上的處理,然後文法,最後語意。

不過後續的研究的發現,雖然有這三種的層次分類,但這並不是很單純的以前、中、後來分,這中間有很多的難以分割的說明。

分析Embedding中存有什麼樣的資訊

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

參考論文_Pretrained Language Model Embryology: The Birth of ALBERT

參考論文_Probing Across Time: What Does RoBERTa Know and When?

並不是只能針對訓練後的embedding做分析,訓練過程中也是可以這麼做的,這可以稱為語言模型胚胎學,在訓練過程中不斷的對它做探測,以此瞭解它在學習過程中到底學到什麼。

分析Embedding中存有什麼樣的資訊

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

我們也可以試著把embedding的output投影到二維空間中觀察。

分析Embedding中存有什麼樣的資訊

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

曾經有針對bert的embedding投影的研究,研究中發現只要用某一個平面來投影就會看到一棵文法樹,這說明著bert是有學到文法資訊的。

分析Embedding中存有什麼樣的資訊

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

參考論文_Language Models Represent Space and Time

有人研究從LLaMA中投影出一個世界地圖,把世界上所有的城市名稱丟到LLaMA,然後投影到二維空間繪製出來,這個分佈就很好的擬合在世界地圖上。

這也許可以說明著在LLaMA中有著地理資訊的。

分析Embedding中存有什麼樣的資訊

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

這是同學做的實驗結果,蠻有意思的,用TAIDE可以比較好的擬合這個里的結果。

語言模型的『測謊器』

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

參考論文_Language Models Represent Space and Time

這是個蠻有趣的議題,收真實跟虛假論述的embedding,然後給它上label,用這樣的方式來訓練一個測謊器,以此確認語言模型有沒有唬爛。

研究來看似乎效果還不錯。

語言模型的『測謊器』

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

參考論文_A Comprehensive Study of Knowledge Editing for Large Language Models

上面參考論文說明著語言模型的每個部份可能在做什麼樣的事情。

用AI(GPT-4)來解釋AI(GPT-2-)

image

實務上我們也可以用AI來解釋AI,相關可見過去的課程。

語言模型會說話,所以『問』就完事!

image
image

基本上,語言模型已經會說話了,所以想知道為什麼它會有這樣的回應的話問它就好了。

上面簡報說明著課程提過的範例,讓模型分類文章,只要詢問為什麼它會這樣歸類就可以知道它的歸類邏輯了。

語言模型會說話,所以『問』就完事!

image

語言模型還可以針對不同的對象給予不同的回應。

這邊範例說明的是讓模型回應給小朋友瞭解為什麼。

語言模型會說話,所以『問』就完事!

image

參考論文_Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

參考論文中是直接讓模型輸出每一個字的重要性,用這樣的方式來瞭解它的回應邏輯。

語言模型會說話,所以『問』就完事!

image

參考論文_Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

參考論文_Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

我們也可以直接問模型對自己的回應信心有多少。論文研究顯示,這個信心分數還真的是準的。

我看再來直接問模型會出什麼牌好了?

語言模型提供的解釋也不一定可信

image
image
image
image

參考論文_Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

這邊的範例說明著語言模型很有可能會受到人類提示的影響而不自知。

如果說這個提示詞的影響真的這麼大的話,那提供服務的廠商只要加入他們想要的影響詞是不是就會大幅度洗腦使用服務的人?<-這我自己想的

結語

image

模型的可解釋性有兩種,一種是拿著它的embedding來分析,但這方式是建立在模型是開源的前提,另一種就是讓模型自己解釋為什麼會有這樣的回應,但可信度就要再自己判斷了。