ML(Hung-yi Lee)_Lecture03. CNN & Self-Attention

# ML(Hung-yi Lee)_Lecture03. CNN & Self-Attention ###### tags: `Machine Learning` ## Convolutional Neural Network **image classification** ![](https://i.imgur.com/rUhcKdh.png) * 把影像當作input? * image:3 dimension tensor (height*width*channel) (channel:RGB) * 把3維tensor拉成一個vector ### CNN講解_Neuron Version #### 考慮影像特性不需要將每一個input和neuron都有一個weight 簡化fully connection的方法 1. **只需要把圖片一小部份當作輸入** **Receptive field** : 每一個neuron都只關心自己receptive field裡發生的事情 * 把receptive field的$3*3*3$拉直變成一個27個的weight再加上bias形成一個neuron作為下一個neuron的輸入 ![](https://i.imgur.com/4uauDeO.png) * 自定義receptive field * Typical Setting 1. 會看所有的channel 2. 描述只需講高寬（channel全看）高寬合稱kernel size(常見3*3) 3. 同一個receptive field會有一組neuron守備 4. **stride** 自訂不同receptive field重疊的範圍 2. **讓不同receptive field的neuron共享參數** **parameter sharing** : 兩個neuron的weight設定相同（**filter**） * Typical Setting 每一個receptive field都只有一組參數 **filter** fiter作用：要在圖片中抓取一些pattern * 相同pattern可能會出現在圖片的不同區域（例:鳥嘴可能出現再圖片上方or下方，但這樣這兩邊的守備範圍都有判斷鳥嘴的neuron，參數量太多） ![](https://i.imgur.com/SNpEfog.png) **receptive field**+**parameter sharing**=**Convolutional Layer** * 加入CNN後network彈性變小，model bias較大，CNN專門for影像 * 當model bias小時，彈性較大容易overfitting，filly connected可做各式各樣事情但不見得能把事情做好 ### CNN講解_Filter Version 1. 我們有一組filter只偵測small parttern（不用看整張圖片） 2. 一個filter要掃過整張圖片（共享參數） ![](https://i.imgur.com/KSJ5Aml.png) * filter中對角線的值都是1，因此看到image中出現連3個1時數值最大，圖片中左上角和左下角有出現這個pattern * image和filter做inner product會產生**feature map**（圖片右邊）（if there are 64 filters then product 64 feature maps）（zero padding : if filter超出範圍就補0） ![](https://i.imgur.com/NJEeVBs.png) filter size:$3*3$時在第2層convolution裡實際上是對應到$5*5$的範圍 **network疊得愈深，同樣是$3*3$的filter看的範圍越來越大** ### Pooling ![](https://i.imgur.com/eHbb5wz.png) 把大張圖片做unsampling（刪掉偶數row之類），但不影響圖片內容 * pooling 沒有參數 * **把圖片變小**，maybe會對performance造成傷害，有些小物件難偵測 * 功用：減少運算量 * max pooling * 每個filter產生一些數字，把這些數字幾個一組，從中挑選出最大的 ![](https://i.imgur.com/879WtlV.png) ### CNN完整步驟 ![](https://i.imgur.com/57WmJRN.png) * CNN application:playing Go圍棋 * CNN不能夠處理scaling和rotation$\rightarrow$需要做**data augmentation** * 可以處理scaling的架構：[**spatial transformer layer**](https://www.youtube.com/watch?v=SoCywZ1hZak&list=PLJV_el3uVTsPMxPbjeX7PicgWbY7F8wW9&index=6) ## Self-attention * 想解決的問題： * past: input is a vector * self-attention: input is **a set of vectors**（長度不等） * input 1. 文字處理怎麼把文字轉成向量？ * **one-hot encoding**：不能表示不同文字之間關聯性 ![](https://i.imgur.com/cthuD3b.png) * **word embeding** 2. 聲音訊號 3. Graph（social network） * output 1. 輸入有各自對應的輸出label **sequence labeling** ![](https://i.imgur.com/QCJR0NY.png) ex. POS tagging(詞性標註) ![](https://i.imgur.com/U56oQrD.png) 2. 整個輸入只會產生一個label ex. Sentiment analysis ![](https://i.imgur.com/jD4j8ho.png) 3. 不知道要輸出多少label ex. 翻譯（兩種不同語言辭彙量不同） ### Sequence labeling 若單純使用fully-connected會沒有考慮到前後語意問題（ex.I saw a saw.）用self-attention考慮一整個sequence ![](https://i.imgur.com/0wCM8OA.png) ![](https://i.imgur.com/NtdSEzz.png) * 根據$a^1$考慮和後面的$a$是否有相關程度$\alpha$ * 做dot product計算relevant attention score : $\alpha_{1,2}=q^1*k^2, \ q^1=w^q*a^1, \ k^2=w^k*a^2$ （$q=$query, $k=$key） ![](https://i.imgur.com/RFHVDQx.png) if $a^1$和$a^2$算出來關聯性較大，得到的分數較高，做完weigth sum以後得到的值會較接近$v^2$ ![](https://i.imgur.com/2DH8Xkk.png) 把輸入的vector sequence乘上3個不同矩陣 self-attention 步驟：（一連串矩陣運算） 1. 先產生q,k,v 2. 根據q找相關位置對v做weight sum self-attention 參數 $W^q,W^k,W^v$需透過training data找出來 ### Multi-head Self-attention * Different types of relevance * no position information in self-attention(但在語意表示上位置可能重要) * **positional encoding** * each position has a unique positional vector $e^i$ * 告訴self-attention關於位置的資訊 ### self-attention application * speech 語音是一長串vector sequence，可用**truncated self-attention**，不用考慮一整段句子，可以加快運算速度 * image 把一張圖片看成是一個vector set，每一個pixel看做是一個3-dim的向量 * graph 不只要考慮node還要考慮edge ![](https://i.imgur.com/NFGWoh7.png) 在graph上只需考慮有edge相連的node [**Graph Neural Network(GNN)**](https://www.youtube.com/watch?v=eybCCtNKwzA) ### self-attention vs CNN * self-attention : 想要關注的pixel產生query，其他pixel產生key，一次考慮整張圖片 ![](https://i.imgur.com/zMOEQXI.png) * CNN : 有一個receptive field，只會關注範圍內的pixel，可稱CNN為簡化版的self-attention [論文:On the Relationship between Self-Attention and Convolutional Layers ](https://arxiv.org/abs/1911.03584) * 用數學證明CNN就是self-attention的特例![](https://i.imgur.com/FSmbg4s.png) ![](https://i.imgur.com/wZFPnVZ.png) 隨著model越大，self-attention的performance越好(因為self-attention彈性較大需要較多資料避免overfitting) ### self-attention vs RNN [**Recurrent Neural Network(RNN)**](https://www.youtube.com/watch?v=xCGidAeyS4M&feature=youtu.be): 和self-attention一樣處理input是一連串sequence問題 ![](https://i.imgur.com/RroS2F1.png) 不同 : 1. RNNif最右邊要考慮最左邊的sequence，左邊的東西要先存在memory在最後一個時間點才考慮 2. RNN不能平行處理所有output，運算速度上self-attention更有效率

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.