---
# System prepended metadata

title: Machine Learning

---

# Machine Learning


> 該筆記內容為李弘毅老師2021機器學習課程


## 0705

[Fully Connected Layer & Loss function](https://ithelp.ithome.com.tw/articles/10220782)

### Training 步驟
>**Step 1:** Function with unknown
>**Step 2:** Define loss from training data
>**Step 3:** ==Optimization==
### Loss的可能性
![](https://i.imgur.com/M6CSsRN.png)

:::info
- 名詞解釋
    - Hyperparameter = 人決定的而非機器找出來的(eg.batch size, learning rate, sigmoid)
    - 2個reLU疊起來 = 1個Hard sigmoid (這兩個都是Activate func)
    - In general, training會選擇reLU，sigmoid訓練較困難
    - (N-fold) Cross validation 
    - Model bias (大海撈針但集合裡沒有針, ==model set is too simple==)
    - Optimization Issue (集合裡有針但找不到)
:::


## 0707 


[【機器學習2021】類神經網路訓練不起來怎麼辦 (二)： 批次 (batch) 與動量 (momentum)](https://www.youtube.com/watch?v=zzbr1h9sF54&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=7&ab_channel=Hung-yiLee) [![small-gradient-v7.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/small-gradient-v7.pdf)


### Optimization with Batch
- 1 **epoch** = see all the batches one -> **Shuffle** after each epoch

- Small Batch v.s. Large Batch
    - Small : ~~**Long** time for cooldown~~, but **powerful**
    - Large : ~~**Short** time for cooldown~~, but **noisy**
    - Larger batch size不一定需要更長的時間計算gradient (但也不能過大)
    - eg. Size = 60,000, Batch = 1 & 1000 比較

    ![](https://i.imgur.com/goNRZIr.png =600x250)

    - "Noisy" update is better for training (每一步可能選到不同loss func)
    - Minima (Flat is better than Sharp)
    - eg. 
    ![](https://i.imgur.com/T7mFUyX.png =550x300)
    
:::danger
<center>Batch size is hyperparameter you have to decide.</center>
:::

### Momentum
- 下一步 = Gradient的反方向+ 上一步的移動方向
![](https://i.imgur.com/aOXy9Sn.png =550x350)
- $m^i$ is the weighted sum of all the previous gradient (init $m^0$=0 代入)

:::warning
++Conclude:++
- Critical points have zero gradient
- Critical points 有可能是<font color="red" >saddle points</font> or <font color="red" >local minima</font>

	- 可以由[Hessian matrix](https://www.easyatm.com.tw/wiki/%E9%BB%91%E5%A1%9E%E7%9F%A9%E9%99%A3)決定
	- 可以從Hessian matrix的eigenvectors走出saddle point
	- Local minima可能很少
- ==Smaller batch size== and ==momentum== help escape critical points.
:::
---
### Learning Rate

[【機器學習2021】類神經網路訓練不起來怎麼辦 (三)：自動調整學習速率 (Learning Rate)](https://www.youtube.com/watch?v=HYUXEeh3kwY&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=6&ab_channel=Hung-yiLee) [![optimizer_v4.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/optimizer_v4.pdf)

- Training stuck != Small Gradient
- Error Surface陡峭, Learning rate小；反之Learning rate大
- 調整Learning Rate
    - Used in **Adagrad**
    ![](https://i.imgur.com/9riKyFU.png =550x350)
    - Used in **RMSProp**
    ![](https://i.imgur.com/5tR72BV.png =550x350)
- Optimization中使用的**Adam**就是==RMSProp+Momentum==
- Learning rate scheduling 
    - Learning rate decay -> 因為靠近目的地，所以降低rate值
    - Warm up -> rate要先變大再變小 (一開始先探索搜尋sigma和learning rate數據)
:::info
- 名詞解釋
    - Convex function (凸函數)
    - Error surface (誤差曲線: Total Loss 對參數的變化)
    - Adagrad、RMSProp 
:::

:::warning
++Conclude:++
- momentum是梯度所有方向總和、sigma是絕對值只有大小，**兩者不會抵銷**

    ![](https://i.imgur.com/dSoLTD8.png =550x350)

:::

## 0708
[【機器學習2021】類神經網路訓練不起來怎麼辦 (四)：損失函數 (Loss) 也可能有影響](https://www.youtube.com/watch?v=O2VkP8dJ5FE&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=7&ab_channel=Hung-yiLee) [![classification_v2.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/classification_v2.pdf)

### Classification
- Regression & Classification
    - Classification: Class as **one-hot vector**
    ![](https://i.imgur.com/t5ABlDZ.png =550x350)
    - Softmax (使y_head能控制在0~1、用於2個分類以上)
        ![](https://i.imgur.com/A6DWZxS.png =450x80)
        - exp(x) = $e^x$, e=2.71
    - Sigmoid(用於binary classification condition)
- Loss of Classification
    ![](https://i.imgur.com/2ljDlJg.png =450x150)
    - pytorch中Cross-entropy (包含**softmax**)
    - MSE在**Large loss**的error surface非常平坦會卡住，Cross entropy不會

:::warning
++Concludes:++ 
- **Minimizing cross-entropy** is equivalent to **maximizing likelihood**

    > The difference between MLE and cross-entropy is that **MLE** represents ==a structured and principled approach to modeling and training==, and **binary/softmax cross-entropy** simply ==represent special cases of that applied to problems== that people typically care about.
- Changing the loss function can change the difficulty of optimization
:::

## 0709

[【機器學習2021】類神經網路訓練不起來怎麼辦 (五)： 批次標準化 (Batch Normalization) 簡介](https://www.youtube.com/watch?v=BABPWOkSbLE&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=9&ab_channel=Hung-yiLee) [![normalization_v4.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/normalization_v4.pdf)

### Batch Normalization

- ==Changing Landscape==(使不同輸入x之間的範圍值差不多，達到好走的error surface)
    - 如何讓內部nodes的數值類似或相同 (統稱Feature Normalization)
    - 可以優化梯度下降法
    - 但不只一開始的input要標準化，**神經層裡乘完權重後的值**也是另一種input，也需要標準化
- 內部nodes也要做feature normalization
    ![](https://i.imgur.com/Uqru495.png =600x400)
    - eg.$z^1$改值會影響其他值，所以採用Batch的方式而不是一次更新全部資料來減少GPU負擔
- Batch Normalization formula
    - 由γ控制Scale、β控制Shift
    
    ![](https://i.imgur.com/wkE4YqT.png =400x300)
- Testing = Inference
    - Testing的時候就不用計算μ和σ，拿training算好的
    - Pytorch在training的時候會紀錄每次batch的μ和σ並計算平均值(moving average)
- Internal ==Covariate Shift==  <font color="red">(有實驗證明和Training Network & BN可能不太有關係)</font>
    - Batch Normalization make a and $a'$ have similar statistics
    ![](https://i.imgur.com/jZCVaGE.png =600x400)

:::info
- 名詞解釋
    - Covariate Shift: 假設使用X預測Y時，當X的分配隨著時間有變化時，**模型逐漸失效**
    - 知名 Normalization
        - Batch Renormalization
        - Layer Normalization
        - Instance Normalization
        - Group Normalization
        - Weight Normalization
        - Spectrum Normalization
:::

:::warning
++Concludes:++ 
- 根據實驗結果及理論分析，Batch Normalization**有助於Optimization(改善error surface**)
- Batch Normalization 優點
    - 收斂快(Train faster)。
    - Use higher learning rates
    - 權重初始化較容易(Parameter initialization is easier)
    - Activation function在訓練過程中易消失或提早停止學習，經過Batch Normalization 會再復活(Makes activation functions viable by regulating the inputs to them)
    - Better results overall 
    - 有**類似Dropout的效果**，防止過度擬合(Overfitting)；用Batch Normalization，就少一點Dropout，否則可能低度擬合(Underfitting)。
        - **Explain**: 沒有BN的話，當input $x_1$很小、$x_2$很大，那$W_1$的值就要很大，$x_1$ * $W_1$才會和$x_2$ * $W_2$一樣大，但這是建立在$x_1$都很小的情況下，如果testing的資料不是，就會造成overfitting
:::

## 0716

[【機器學習2021】卷積神經網路 (Convolutional Neural Networks, CNN)](https://www.youtube.com/watch?v=OP5HcXJg2Aw&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=10&ab_channel=Hung-yiLee) [![cnn_v4.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/cnn_v4.pdf)

### Image Classification(Convolutional Layers & Pooling)
- output is one-hot vector, eg.output =  2000 x 1 -> 有兩千種預估可能
- 3-D tensor= 長x寬xChannels
- Typical Setting on Receptive field:
    - 簡化一
        - all channels(**kernel** sizes: 3x3)
        - 通常會有64 or 128 neurons守備同一個receptive field
        - stride:位移大小，希望Rf間彼此有連接(overlap)，不會漏掉資訊
        - 邊邊的方式使用padding
    - 簡化二
        - neuron可以有share paremeters(不同Rf才要，相同的話output也會一樣)
- The whole CNN
1. Convolution(放大圖片)
    - CNN的network越深，3x3所代表的filter就能看到較大的pattern
    ![](https://i.imgur.com/gEAeHUx.png =600x400)
2. Pooling(縮小圖片)
    - Max Pooling(在一個NxN的小矩陣，只留下最大值)
    - 為了減少運算量，但可能==漏掉細節==，運算能力強的話可以full Convolution
3. Flatten(拉直矩陣)
![](https://i.imgur.com/gYM1YqV.png =600x400)

:::info
- 名詞解釋:
    - tensor: 維度超過2的矩陣
    - channel: RGB的顏色強度, = 3 是彩色, = 1 是黑白
    - Receptive field: 局部感受眼的性質，非全連接而是一小塊區域連接，這就是局部感受眼
        - 可以重疊(也可以是同個field，藉以觀察更完整的pattern)
        - 可以是長方形
        - 可以只觀察一些特定channels
    - Filter: 用於neuron共用參數的名稱，用來搜尋圖片中與filter一樣的pattern
:::

:::warning
++Concludes:++ 
- Convolutional Layer
    ![](https://i.imgur.com/yGR85mt.png =550x350)
    
    - CNN認不得**放大縮小旋轉**的圖像
    - ==Data augmentation==用來解決這個問題，在學習的時候需要將圖片截出一小塊進行放大縮小旋轉的動作讓CNN學習
:::
---
[【機器學習2021】自注意力機制 (Self-attention) (上)](https://www.youtube.com/watch?v=hYdO9CscNes&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=10&ab_channel=Hung-yiLee) [![self_v7.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/self_v7.pdf)

[【機器學習2021】自注意力機制 (Self-attention) (下)](https://www.youtube.com/watch?v=gmsMY5kc-zw&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=11&ab_channel=Hung-yiLee) [![self_v7.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/self_v7.pdf)

### Self Attention

- Self Attention: 整個sequence透過self-attnetion可以形成有context的向量
- 文字、語音、影像、圖像(社群連結、分子)都可以表示成Vector set
- Input長度 = Output長度 (eg.POS tagging詞性標註、語音辨識、社群graph的特性)
- 整個sequence 只有一個label (eg.情感分析、語音辨識speaker、哪種分子)
![](https://i.imgur.com/njmbn9s.png =600x400)
- Implementation
    ![](https://i.imgur.com/OI2nNZu.png =600x400)
    - 在計算a'時，$a^1$當成query，其他$a^x$為key，各自乘完weight後做**dot product**，得到attention score(**值越大代表關聯性越高**)
    - 也可以把自己同時當成query & key做dot product
    - Soft-max(右上角為公式)可以換成其他Activate function
    ![](https://i.imgur.com/ZcflmFD.png =600x400)
    - 可以透過$b^i$的結果得知哪個$a'$的影響力最大，**再從$v^i$抽取關鍵資訊** ?
    - 透過右上角的公式可以算出b1，$b^i$的算法分別是把i當成query，其他當key，套入此公式 
    ![](https://i.imgur.com/xWMT8hb.png =600x400)

:::info
- 名詞解釋:
	- Multi-head Self-attention
	    - head數目即為一個input的q、k、v分別要生成幾個
	    - 假設head設為2，則output會是(bi,1 & bi,2)兩個結果，在與W矩陣transform變成$b^i$
	- Truncated Self-attention(可能不需要考慮整個input->cuz語音辨識資訊量大)
	- Positional Encoding
	    - ==self-attention沒有位置的資訊==
	    - 每個位置都有一個專屬的e，input+$e^i$完成標註
	    - hand-crafted、可以透過資料學習
:::

:::warning
++Concludes:++ 
- CNN v.s. Self-attention
    - CNN每個pixel只考慮receptive field，但self-attention會去計算所有關聯
    - Self-attention只要設定特定參數==即可和CNN有一樣的效果==(CNN是self-attention的子集合)
    - **資料量小，適合限制較多**的模型(CNN)，反之有彈性的model可能會overfitting
- RNN v.s. Self-attention
    - 雖然RNN可以是雙向的，但最右邊的input難考慮到一開始的input
    - RNN無法平行化(Self-attention逐漸取代RNN架構)
:::

## 0720

[【機器學習2021】Transformer (上)](https://www.youtube.com/watch?v=n9TlOhRjYoc&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=14&ab_channel=Hung-yiLee) [![seq2seq_v9.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/seq2seq_v9.pdf)
[【機器學習2021】Transformer (下)](https://www.youtube.com/watch?v=N6aRv06iv2g&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=14&ab_channel=Hung-yiLee) [![seq2seq_v9.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/seq2seq_v9.pdf)


### Transformer (Seq2Seq)

- Input a sequence, output a sequence
    - 語音辨識、語音翻譯 (當input的語言沒有context、可用訓練集: 鄉土劇)
    - 機器翻譯
    - 聊天機器人
    - Syntactic Parsing

- QA can be done by seq2seq
    - question、context -> Seq2Seq model -> answer

#### Encoder
![](https://i.imgur.com/cNFd4nt.png =600x400)
- 透過Positional Encoding解決沒有未知資訊的問題
- Add & Norm -> Residual + layer norm
- Feed Forward(Fully connected network)

![](https://i.imgur.com/IkSjypr.png =200x250)


## 0724

#### Decoder (Autoregressive)

- Masked Self-attention
    ![](https://i.imgur.com/SmpKyGQ.png =600x400)
    - 不考慮未出現的詞 (eg.a'2,2只考慮$a^1$ & $a^2$)
    - Encoder是一個一個output來當Decoder的input，所以只考慮目前左邊的內容
- AT v.s. NAT
    ![](https://i.imgur.com/CbRsrUU.png =600x400)
- Decoder比Encoder多一個Cross Attention的過程
    ![](https://i.imgur.com/AHfk0kJ.png =600x400)
    - 原始paper是拿Encoder最後一層layer的output當成Decoder的input

### Training

- Copy Mechanism (Chat-bot: 從輸入copy詞彙當成輸出、文章摘要)
- Guided Attention 
    - 適用於語音辨識、合成
    - 語音合成的attention是要從左到右，但在training的時候可能不會，要==強制==
- Beam Search
    - 每次找分數最高的 -> Greedy Decoding，可能不是最佳解
    - 要看任務本身的特性，如果結果準確(語音辨識只有一種可能)，beam search就會非常有用
    - 但Sentence completion, TTS(Text to Speech)結果有多種可能，BS會沒用
- Cross Entropy v.s. BLEU score -> 怎麼做最佳化，試著使用reinforcement learning
:::info
- 名詞解釋:
    - Residual connection: 廣泛用於deep learning，output會加上input
    - Ground truth: 訓練集對監督學習技術的分類的準確性
    - Cross entropy: 觀測預測的機率分布及實際機率分布的誤差
    - Teacher forcing: 將Ground truth當成input (Decoder)
    - Text to Speech (TTS):語音合成
    - Exposure bias: Training過程中都有給Decoder正確的input，但Testing只要一出現錯，就會錯非常多
:::

++Concludes:++ 

![](https://i.imgur.com/mtusi3b.png =350x550)

## 0727

[【機器學習2021】自督導式學習 (Self-supervised Learning) (一) – 芝麻街與進擊的巨人](https://www.youtube.com/watch?v=e422eloJ0W4&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=18&ab_channel=Hung-yiLee) 
[【機器學習2021】自督導式學習 (Self-supervised Learning) (二) – BERT簡介](https://www.youtube.com/watch?v=gh0hewYkjgo&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=19&ab_channel=Hung-yiLee)
[【機器學習2021】自督導式學習 (Self-supervised Learning) (三) – BERT的奇聞軼事](https://www.youtube.com/watch?v=ExXA05i8DEQ&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=21&ab_channel=Hung-yiLee) 
[【機器學習2021】自督導式學習 (Self-supervised Learning) (四) – GPT的野望](https://www.youtube.com/watch?v=WY_E0Sd4K80&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=21&ab_channel=Hung-yiLee) 
[![bert_v8.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/bert_v8.pdf)
### Self-supervised Learning

#### BERT (Transformer Encoder)

- Input = (Word + Segment + Position) embedding
    - [Implemented way](https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a)
- Pre-train
1. Masked token prediction (SpanBERT研究遮住多個token -> 機率性選擇要遮幾個)
2. Next Sentence Prediction (RoBERTa研究指出沒什麼用)
    - SOP (Sentence order prediction)比較有用 -> Used in ALBERT
- Fine-tune (How to use BERT)
    
    - Case 1: Sentiment analysis
        - Input sequence、Output yes or no
        - BERT(init by pre-train, better than random)
        - Linear transform (random initialization, 採用gradient descent更新)
        - Train from scratch(從頭開始訓練) vs. fine-tune (win)
    - Case 2: POS (part of speech) tagging＄
        - Input sequence length = output
    - Case 3: Natural Language Inference
        - Input two sequences、Output a class (Contradiction?)
    - Case 4: Extraction-based Question Answering
        - Output (2 integers:  start, end) 一定在 input (Document﹐Query)裡
        - Random initialized(2 vectors)與doc做inner product之後softmax，
        找出起始和結束位置
- GLUE scores: 拿人類當基準1，去評斷機器相較於人類的評分
- Pre-train + Fine-tune 是一種semi-supervised的方法
- (Pre-train)學會做填空題過程中，也透過==上下文==學習了同字不同義 ->類似於 (CBOW) Word Embedding = **Contextualized** word embedding
- Multi-lingual BERT (104 languages in pre-train )
    - 資料量很重要，越多越好 -> Better alignment
    - 雖然不同語言同樣意思的embedding接近，但model還是知道語言不一樣
- GPT (Predict Next Token)
    ![](https://i.imgur.com/Lud0n3k.png =600x400)
    - 像是transformer decoder，但attention的時候不看之後的輸入
    - GPT模型太過巨大，可能無法fine-tune -> ==In-context Learning==
        - Input: task description、examples、prompt (要output這題答案)
        - examples給幾個 -> few-shot learning
        - example只給一個 -> one-shot learning
        - 沒給example -> zero-shot learning
- Self-supervised learning也可以用於Image、Speech... 

:::info
- 名詞解釋:
    - Downstream Tasks
        - The tasks we care
        - We have a little bit labeled data
    - Scratch: 不使用預訓練模型的初始化
:::

## 0810
[【機器學習2021】生成式對抗網路 (Generative Adversarial Network, GAN) (一) – 基本概念介紹](https://www.youtube.com/watch?v=4OWp0wDu6Xw&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=15&ab_channel=Hung-yiLee)
[【機器學習2021】生成式對抗網路 (Generative Adversarial Network, GAN) (二) – 理論介紹與WGAN](https://www.youtube.com/watch?v=jNY1WBb8l4U&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=15&ab_channel=Hung-yiLee)
[【機器學習2021】生成式對抗網路 (Generative Adversarial Network, GAN) (三) – 生成器效能評估與條件式生成
](https://www.youtube.com/watch?v=MP0BnVH2yOo&ab_channel=Hung-yiLee)
[【機器學習2021】生成式對抗網路 (Generative Adversarial Network, GAN) (四) – Cycle GAN](https://www.youtube.com/watch?v=wulqhgnDr7E&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=17&ab_channel=Hung-yiLee)[![gan_v10.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/gan_v10.pdf)

![](https://i.imgur.com/ti4auSJ.png =600x400)
- 將input透過Network後output出一個distribution -> Generator
- Normal distribution -> Generator -> Complex distribution
- Why we need distribution？
    - 同樣的輸入有多種不同的輸出 (需要創造力)
    - i.e. Drawing, Video prediction
- 其中要加入一個機制：Discriminator
    - input: image, output:Scalar (large:real, small:fake)
    - 分辨generator的結果和real images的差異，讓generator修改model
    - 這個互動稱為adversarial (Generator: 要騙過dicriminator)
- Algorithm
    - Step 1.固定Generator，訓練Discriminator
    - Step 2.固定Discriminator，訓練Generator
    - repeating Step 1 & 2 
    
![](https://i.imgur.com/bK7DDEx.png =600x400)
- How to divergence?
    - 雖然不知道$P_G$ & $P_{data}$的分佈，但可以從中找到sample
    - $P_G$: Generaotr產生的sample，$P_{data}$: 從database產生的sample
    - ![](https://i.imgur.com/Oqww1Qo.png =600x400)
    - 給一個固定的generator，透過dicriminator找出Objective function的最大值
    - 再從中找出最小的值當作generator 
- Tips for GAN
    - js divergence:看不出距離差異，結果都為$log_2$，除非重疊會是０ (效果不好)
    - Another solution (Wasserstein distance)
        - 想像成推土機將P的土丟到Q，但移動的方式有無窮多種
        - 窮舉所有解法，找出最小的移動方式
        - WGAN (Spectral Normalization) -> 讓gradient在任何地方皆小於1
- Evaluation
    - 分類越平坦，Diversity越高（一張圖片）
    - 分類越集中，Qulaity越好 (多張圖片)
    - Inception Score(IS): Good quality, large diversity
    - 也要避免Generator產出的東西和real data一樣
- Conditional GAN
    - input: image+text描述, output: image符合text描述 -> Image translation
    - 透過GAN+Supervised能有更好的生成結果
- Use GAN in Unsupervised (以上皆以監督式為主)
    - Cycle GAN 
        - Image style transfer 3D -> 2D
        ![](https://i.imgur.com/33YHFqn.png =600x400)
            - 需要2個Generators & 2個Dicriminators
            - 分別從3D -> 2D & 2D -> 3D做兩邊的轉換
            - 目的：讓生成的2D圖片能保有原來3D圖片的特徵
        - Text style transfer(Seq 2 Seq)
        ![](https://i.imgur.com/70JBxwE.png =600x400)
            - 將Negetive sentence轉換成Positive sentence
    - Document -> Summary, Language 1 -> Language 2, Audio -> Text


:::info
- 名詞解釋:
    - Mode Collapse: Generated model產生的只有幾張圖片(train到後面會幾乎都長一樣)，透過discriminator的盲點產生相同高divergence的圖片
    - Mode Dropping: 雖然diversity夠而且沒有model collapse，但基本上都能從每張照片找到相同的特徵
:::

## 0817

[【機器學習2021】自編碼器 (Auto-encoder) (上) – 基本概念](https://www.youtube.com/watch?v=3oHlf8-J3Nc&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=22&ab_channel=Hung-yiLee)
[【機器學習2021】自編碼器 (Auto-encoder) (下) – 領結變聲器與更多應用](https://www.youtube.com/watch?v=JZvEzb5PV3U&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=23&ab_channel=Hung-yiLee)[![auto_v8.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/auto_v8.pdf)

### Auto-encoder
![](https://i.imgur.com/moi4vmG.png =600x400)
- Also Called: Dimension Reduction
- De-nosing Auto-encoder（input加入雜訊，output要還原無雜訊的input）
    - BERT中的MASKED類似此概念
- Feature Disentagle (也可以用於語音)
    - 能不能從code (encoder將input轉成vector)再細分出其代表的資訊
        - Vector = Content + Speaker info
    - Application: Voice conversion
- Discrete Latent Representation
    - 可以將code用binary，甚至是one hot vector表示
    ![](https://i.imgur.com/DAEhnes.png =600x400)
    ![](https://i.imgur.com/78vj7I8.png =600x400)
- More Application
    - Compression (Img -> low dim -> Img, 會失真)
    - Anomaly Detection (異常檢測)
        - 判斷input是否是訓練資料裡面的data
        - Auto-encoder的input&output，透過兩者相差loss來判斷是否anomaly

## 0819
[【機器學習2021】來自人類的惡意攻擊 (Adversarial Attack) (上) – 基本概念](https://www.youtube.com/watch?v=xGQKhbjrFRk&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=24&ab_channel=Hung-yiLee)
[【機器學習2021】來自人類的惡意攻擊 (Adversarial Attack) (下) – 類神經網路能否躲過人類深不見底的惡意？](https://www.youtube.com/watch?v=z-Q9ia5H2Ig&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=25&ab_channel=Hung-yiLee)[![attack_v3.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/attack_v3.pdf)

### Adversarial Attack
- How to attack
    - Benign Image (原本圖片), Attack Image (圖片加入雜訊，但肉眼看不出來)
    - Non-targeted: 只要output是錯的就好
        - 找到output和正解最大的entropy -> 也就是找出負的最小entropy
    - Targeted: 不只錯還要輸出想要的目標
    ![](https://i.imgur.com/6EypG7G.png =600x400)
    ![](https://i.imgur.com/df6gB3h.png =600x400)
        - 要設計一個肉眼看不出來，但對機器來說差很多的noise
        - L-infinity越小，人類越看不出來
        - 訓練方式類似Gradient descent讓Loss最小
            - 調input而非parameters
            - 當差距大到肉眼看得出來，要調回肉眼看不到的正方形
            ![](https://i.imgur.com/zO5Zvzt.png =200x200)
    - FGSM (Fast Gradient Sign Method)
        - 只update一次
        - 加入sing method: 讓gradient只為 +- 1
        - 點都會出現在L-infinity框框的四個點上
- White Box Attack (知道Network參數才能攻擊)
- Black Box Attack
    - 透過同樣的training set進行training產生Network Proxy
    - Network Black & Network Proxy可能相似
    - 可以透過攻擊Network Proxy來達到相同效果
    - Ensemble Attack (用騙過多個Network的model來攻擊)
    - Main reason -> 可能是資料上的features在不同資料集都差不多
- Others Data也會有影響
    - Speech Processing, NLP
- Backdoor in model (從Training階段就開始攻擊)

### Defense
- Passive
    - Add Filter擋住Noisy signal 
    - Smoothing -> 有side effect
    - Image Compression, Generator
    - Randomization -> 任意改變圖片，讓attack不知道攻擊哪層
- Proactive
    - Adversarial Training (Like Data Augmentation)
        - 在訓練過程中attack自己產生一組錯誤的label
        - 再將這些標錯的label改回正確的label
        - Data Set大會很吃力 -> 因為要生double的資料量
- Conclude

    ![](https://i.imgur.com/SfF6NtL.png =550x180)

## 0825

[【機器學習2021】機器學習模型的可解釋性 (Explainable ML) (上) – 為什麼類神經網路可以正確分辨寶可夢和數碼寶貝呢？](https://www.youtube.com/watch?v=WQY85vaQfTI&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=27&ab_channel=Hung-yiLee)
[【機器學習2021】機器學習模型的可解釋性 (Explainable ML) (下) –機器心中的貓長什麼樣子？](https://www.youtube.com/watch?v=0ayIPqbdHYQ&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=27&ab_channel=Hung-yiLee)[![xai_v4.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/xai_v4.pdf)

### Explainable ML
- Local Explaination -> 為什麼圖片長這樣就是貓？
    ![](https://i.imgur.com/Gh5Ks4f.jpg =600x400)
    - 透過Saliency Map找出機器判別圖片的根據
    - 確保機器是看到關鍵資訊後才判斷正確 (i.e. 不是看到水草而判斷出水牛)
    - Smooth Gradient -> 加入隨機的noises到image並平均起來，結果會比較明顯
    - Probing
        - 訓練一個分類器放在神經網路的某一層，看準確度高低，判斷這一層有沒有我們要分類的資訊 (要小心是沒訓練好、參數沒設好的原因)
        - i.e. 在某一層放入詞性的分類器確認這一層有無詞性關係
        - i.e. 在某一層有沒有可能去除語者特徵，只保留聲音資訊
- Global Explaination -> 貓長什麼樣子?

![](https://i.imgur.com/KLy6Tqn.png =600x400)

## 0901
[【機器學習2021】概述領域自適應 (Domain Adaptation)](https://www.youtube.com/watch?v=Mnk_oUrgppM&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=30&ab_channel=Hung-yiLee)[![da_v6.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/da_v6.pdf)

### Domain Adaptation

- Domain Shift: 訓練集和測試集如果資料上的分佈不一樣，結果可能會不好
- 使用時機：有大量的target data(unlabeled)
- 期望透過Feature Extractor找出兩個data set相同的分佈
![](https://i.imgur.com/KdpWgH6.png =600x400)
    - Domain Classifier是二元分類器
    - Source = labeled data, Target = unlabeled data