# Machine Learning > 該筆記內容為李弘毅老師2021機器學習課程 ## 0705 [Fully Connected Layer & Loss function](https://ithelp.ithome.com.tw/articles/10220782) ### Training 步驟 >**Step 1:** Function with unknown >**Step 2:** Define loss from training data >**Step 3:** ==Optimization== ### Loss的可能性 ![](https://i.imgur.com/M6CSsRN.png) :::info - 名詞解釋 - Hyperparameter = 人決定的而非機器找出來的(eg.batch size, learning rate, sigmoid) - 2個reLU疊起來 = 1個Hard sigmoid (這兩個都是Activate func) - In general, training會選擇reLU,sigmoid訓練較困難 - (N-fold) Cross validation - Model bias (大海撈針但集合裡沒有針, ==model set is too simple==) - Optimization Issue (集合裡有針但找不到) ::: ## 0707 [【機器學習2021】類神經網路訓練不起來怎麼辦 (二): 批次 (batch) 與動量 (momentum)](https://www.youtube.com/watch?v=zzbr1h9sF54&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=7&ab_channel=Hung-yiLee) [![small-gradient-v7.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/small-gradient-v7.pdf) ### Optimization with Batch - 1 **epoch** = see all the batches one -> **Shuffle** after each epoch - Small Batch v.s. Large Batch - Small : ~~**Long** time for cooldown~~, but **powerful** - Large : ~~**Short** time for cooldown~~, but **noisy** - Larger batch size不一定需要更長的時間計算gradient (但也不能過大) - eg. Size = 60,000, Batch = 1 & 1000 比較 ![](https://i.imgur.com/goNRZIr.png =600x250) - "Noisy" update is better for training (每一步可能選到不同loss func) - Minima (Flat is better than Sharp) - eg. ![](https://i.imgur.com/T7mFUyX.png =550x300) :::danger <center>Batch size is hyperparameter you have to decide.</center> ::: ### Momentum - 下一步 = Gradient的反方向+ 上一步的移動方向 ![](https://i.imgur.com/aOXy9Sn.png =550x350) - $m^i$ is the weighted sum of all the previous gradient (init $m^0$=0 代入) :::warning ++Conclude:++ - Critical points have zero gradient - Critical points 有可能是<font color="red" >saddle points</font> or <font color="red" >local minima</font> - 可以由[Hessian matrix](https://www.easyatm.com.tw/wiki/%E9%BB%91%E5%A1%9E%E7%9F%A9%E9%99%A3)決定 - 可以從Hessian matrix的eigenvectors走出saddle point - Local minima可能很少 - ==Smaller batch size== and ==momentum== help escape critical points. ::: --- ### Learning Rate [【機器學習2021】類神經網路訓練不起來怎麼辦 (三):自動調整學習速率 (Learning Rate)](https://www.youtube.com/watch?v=HYUXEeh3kwY&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=6&ab_channel=Hung-yiLee) [![optimizer_v4.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/optimizer_v4.pdf) - Training stuck != Small Gradient - Error Surface陡峭, Learning rate小;反之Learning rate大 - 調整Learning Rate - Used in **Adagrad** ![](https://i.imgur.com/9riKyFU.png =550x350) - Used in **RMSProp** ![](https://i.imgur.com/5tR72BV.png =550x350) - Optimization中使用的**Adam**就是==RMSProp+Momentum== - Learning rate scheduling - Learning rate decay -> 因為靠近目的地,所以降低rate值 - Warm up -> rate要先變大再變小 (一開始先探索搜尋sigma和learning rate數據) :::info - 名詞解釋 - Convex function (凸函數) - Error surface (誤差曲線: Total Loss 對參數的變化) - Adagrad、RMSProp ::: :::warning ++Conclude:++ - momentum是梯度所有方向總和、sigma是絕對值只有大小,**兩者不會抵銷** ![](https://i.imgur.com/dSoLTD8.png =550x350) ::: ## 0708 [【機器學習2021】類神經網路訓練不起來怎麼辦 (四):損失函數 (Loss) 也可能有影響](https://www.youtube.com/watch?v=O2VkP8dJ5FE&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=7&ab_channel=Hung-yiLee) [![classification_v2.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/classification_v2.pdf) ### Classification - Regression & Classification - Classification: Class as **one-hot vector** ![](https://i.imgur.com/t5ABlDZ.png =550x350) - Softmax (使y_head能控制在0~1、用於2個分類以上) ![](https://i.imgur.com/A6DWZxS.png =450x80) - exp(x) = $e^x$, e=2.71 - Sigmoid(用於binary classification condition) - Loss of Classification ![](https://i.imgur.com/2ljDlJg.png =450x150) - pytorch中Cross-entropy (包含**softmax**) - MSE在**Large loss**的error surface非常平坦會卡住,Cross entropy不會 :::warning ++Concludes:++ - **Minimizing cross-entropy** is equivalent to **maximizing likelihood** > The difference between MLE and cross-entropy is that **MLE** represents ==a structured and principled approach to modeling and training==, and **binary/softmax cross-entropy** simply ==represent special cases of that applied to problems== that people typically care about. - Changing the loss function can change the difficulty of optimization ::: ## 0709 [【機器學習2021】類神經網路訓練不起來怎麼辦 (五): 批次標準化 (Batch Normalization) 簡介](https://www.youtube.com/watch?v=BABPWOkSbLE&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=9&ab_channel=Hung-yiLee) [![normalization_v4.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/normalization_v4.pdf) ### Batch Normalization - ==Changing Landscape==(使不同輸入x之間的範圍值差不多,達到好走的error surface) - 如何讓內部nodes的數值類似或相同 (統稱Feature Normalization) - 可以優化梯度下降法 - 但不只一開始的input要標準化,**神經層裡乘完權重後的值**也是另一種input,也需要標準化 - 內部nodes也要做feature normalization ![](https://i.imgur.com/Uqru495.png =600x400) - eg.$z^1$改值會影響其他值,所以採用Batch的方式而不是一次更新全部資料來減少GPU負擔 - Batch Normalization formula - 由γ控制Scale、β控制Shift ![](https://i.imgur.com/wkE4YqT.png =400x300) - Testing = Inference - Testing的時候就不用計算μ和σ,拿training算好的 - Pytorch在training的時候會紀錄每次batch的μ和σ並計算平均值(moving average) - Internal ==Covariate Shift== <font color="red">(有實驗證明和Training Network & BN可能不太有關係)</font> - Batch Normalization make a and $a'$ have similar statistics ![](https://i.imgur.com/jZCVaGE.png =600x400) :::info - 名詞解釋 - Covariate Shift: 假設使用X預測Y時,當X的分配隨著時間有變化時,**模型逐漸失效** - 知名 Normalization - Batch Renormalization - Layer Normalization - Instance Normalization - Group Normalization - Weight Normalization - Spectrum Normalization ::: :::warning ++Concludes:++ - 根據實驗結果及理論分析,Batch Normalization**有助於Optimization(改善error surface**) - Batch Normalization 優點 - 收斂快(Train faster)。 - Use higher learning rates - 權重初始化較容易(Parameter initialization is easier) - Activation function在訓練過程中易消失或提早停止學習,經過Batch Normalization 會再復活(Makes activation functions viable by regulating the inputs to them) - Better results overall - 有**類似Dropout的效果**,防止過度擬合(Overfitting);用Batch Normalization,就少一點Dropout,否則可能低度擬合(Underfitting)。 - **Explain**: 沒有BN的話,當input $x_1$很小、$x_2$很大,那$W_1$的值就要很大,$x_1$ * $W_1$才會和$x_2$ * $W_2$一樣大,但這是建立在$x_1$都很小的情況下,如果testing的資料不是,就會造成overfitting ::: ## 0716 [【機器學習2021】卷積神經網路 (Convolutional Neural Networks, CNN)](https://www.youtube.com/watch?v=OP5HcXJg2Aw&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=10&ab_channel=Hung-yiLee) [![cnn_v4.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/cnn_v4.pdf) ### Image Classification(Convolutional Layers & Pooling) - output is one-hot vector, eg.output = 2000 x 1 -> 有兩千種預估可能 - 3-D tensor= 長x寬xChannels - Typical Setting on Receptive field: - 簡化一 - all channels(**kernel** sizes: 3x3) - 通常會有64 or 128 neurons守備同一個receptive field - stride:位移大小,希望Rf間彼此有連接(overlap),不會漏掉資訊 - 邊邊的方式使用padding - 簡化二 - neuron可以有share paremeters(不同Rf才要,相同的話output也會一樣) - The whole CNN 1. Convolution(放大圖片) - CNN的network越深,3x3所代表的filter就能看到較大的pattern ![](https://i.imgur.com/gEAeHUx.png =600x400) 2. Pooling(縮小圖片) - Max Pooling(在一個NxN的小矩陣,只留下最大值) - 為了減少運算量,但可能==漏掉細節==,運算能力強的話可以full Convolution 3. Flatten(拉直矩陣) ![](https://i.imgur.com/gYM1YqV.png =600x400) :::info - 名詞解釋: - tensor: 維度超過2的矩陣 - channel: RGB的顏色強度, = 3 是彩色, = 1 是黑白 - Receptive field: 局部感受眼的性質,非全連接而是一小塊區域連接,這就是局部感受眼 - 可以重疊(也可以是同個field,藉以觀察更完整的pattern) - 可以是長方形 - 可以只觀察一些特定channels - Filter: 用於neuron共用參數的名稱,用來搜尋圖片中與filter一樣的pattern ::: :::warning ++Concludes:++ - Convolutional Layer ![](https://i.imgur.com/yGR85mt.png =550x350) - CNN認不得**放大縮小旋轉**的圖像 - ==Data augmentation==用來解決這個問題,在學習的時候需要將圖片截出一小塊進行放大縮小旋轉的動作讓CNN學習 ::: --- [【機器學習2021】自注意力機制 (Self-attention) (上)](https://www.youtube.com/watch?v=hYdO9CscNes&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=10&ab_channel=Hung-yiLee) [![self_v7.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/self_v7.pdf) [【機器學習2021】自注意力機制 (Self-attention) (下)](https://www.youtube.com/watch?v=gmsMY5kc-zw&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=11&ab_channel=Hung-yiLee) [![self_v7.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/self_v7.pdf) ### Self Attention - Self Attention: 整個sequence透過self-attnetion可以形成有context的向量 - 文字、語音、影像、圖像(社群連結、分子)都可以表示成Vector set - Input長度 = Output長度 (eg.POS tagging詞性標註、語音辨識、社群graph的特性) - 整個sequence 只有一個label (eg.情感分析、語音辨識speaker、哪種分子) ![](https://i.imgur.com/njmbn9s.png =600x400) - Implementation ![](https://i.imgur.com/OI2nNZu.png =600x400) - 在計算a'時,$a^1$當成query,其他$a^x$為key,各自乘完weight後做**dot product**,得到attention score(**值越大代表關聯性越高**) - 也可以把自己同時當成query & key做dot product - Soft-max(右上角為公式)可以換成其他Activate function ![](https://i.imgur.com/ZcflmFD.png =600x400) - 可以透過$b^i$的結果得知哪個$a'$的影響力最大,**再從$v^i$抽取關鍵資訊** ? - 透過右上角的公式可以算出b1,$b^i$的算法分別是把i當成query,其他當key,套入此公式 ![](https://i.imgur.com/xWMT8hb.png =600x400) :::info - 名詞解釋: - Multi-head Self-attention - head數目即為一個input的q、k、v分別要生成幾個 - 假設head設為2,則output會是(bi,1 & bi,2)兩個結果,在與W矩陣transform變成$b^i$ - Truncated Self-attention(可能不需要考慮整個input->cuz語音辨識資訊量大) - Positional Encoding - ==self-attention沒有位置的資訊== - 每個位置都有一個專屬的e,input+$e^i$完成標註 - hand-crafted、可以透過資料學習 ::: :::warning ++Concludes:++ - CNN v.s. Self-attention - CNN每個pixel只考慮receptive field,但self-attention會去計算所有關聯 - Self-attention只要設定特定參數==即可和CNN有一樣的效果==(CNN是self-attention的子集合) - **資料量小,適合限制較多**的模型(CNN),反之有彈性的model可能會overfitting - RNN v.s. Self-attention - 雖然RNN可以是雙向的,但最右邊的input難考慮到一開始的input - RNN無法平行化(Self-attention逐漸取代RNN架構) ::: ## 0720 [【機器學習2021】Transformer (上)](https://www.youtube.com/watch?v=n9TlOhRjYoc&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=14&ab_channel=Hung-yiLee) [![seq2seq_v9.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/seq2seq_v9.pdf) [【機器學習2021】Transformer (下)](https://www.youtube.com/watch?v=N6aRv06iv2g&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=14&ab_channel=Hung-yiLee) [![seq2seq_v9.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/seq2seq_v9.pdf) ### Transformer (Seq2Seq) - Input a sequence, output a sequence - 語音辨識、語音翻譯 (當input的語言沒有context、可用訓練集: 鄉土劇) - 機器翻譯 - 聊天機器人 - Syntactic Parsing - QA can be done by seq2seq - question、context -> Seq2Seq model -> answer #### Encoder ![](https://i.imgur.com/cNFd4nt.png =600x400) - 透過Positional Encoding解決沒有未知資訊的問題 - Add & Norm -> Residual + layer norm - Feed Forward(Fully connected network) ![](https://i.imgur.com/IkSjypr.png =200x250) ## 0724 #### Decoder (Autoregressive) - Masked Self-attention ![](https://i.imgur.com/SmpKyGQ.png =600x400) - 不考慮未出現的詞 (eg.a'2,2只考慮$a^1$ & $a^2$) - Encoder是一個一個output來當Decoder的input,所以只考慮目前左邊的內容 - AT v.s. NAT ![](https://i.imgur.com/CbRsrUU.png =600x400) - Decoder比Encoder多一個Cross Attention的過程 ![](https://i.imgur.com/AHfk0kJ.png =600x400) - 原始paper是拿Encoder最後一層layer的output當成Decoder的input ### Training - Copy Mechanism (Chat-bot: 從輸入copy詞彙當成輸出、文章摘要) - Guided Attention - 適用於語音辨識、合成 - 語音合成的attention是要從左到右,但在training的時候可能不會,要==強制== - Beam Search - 每次找分數最高的 -> Greedy Decoding,可能不是最佳解 - 要看任務本身的特性,如果結果準確(語音辨識只有一種可能),beam search就會非常有用 - 但Sentence completion, TTS(Text to Speech)結果有多種可能,BS會沒用 - Cross Entropy v.s. BLEU score -> 怎麼做最佳化,試著使用reinforcement learning :::info - 名詞解釋: - Residual connection: 廣泛用於deep learning,output會加上input - Ground truth: 訓練集對監督學習技術的分類的準確性 - Cross entropy: 觀測預測的機率分布及實際機率分布的誤差 - Teacher forcing: 將Ground truth當成input (Decoder) - Text to Speech (TTS):語音合成 - Exposure bias: Training過程中都有給Decoder正確的input,但Testing只要一出現錯,就會錯非常多 ::: ++Concludes:++ ![](https://i.imgur.com/mtusi3b.png =350x550) ## 0727 [【機器學習2021】自督導式學習 (Self-supervised Learning) (一) – 芝麻街與進擊的巨人](https://www.youtube.com/watch?v=e422eloJ0W4&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=18&ab_channel=Hung-yiLee) [【機器學習2021】自督導式學習 (Self-supervised Learning) (二) – BERT簡介](https://www.youtube.com/watch?v=gh0hewYkjgo&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=19&ab_channel=Hung-yiLee) [【機器學習2021】自督導式學習 (Self-supervised Learning) (三) – BERT的奇聞軼事](https://www.youtube.com/watch?v=ExXA05i8DEQ&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=21&ab_channel=Hung-yiLee) [【機器學習2021】自督導式學習 (Self-supervised Learning) (四) – GPT的野望](https://www.youtube.com/watch?v=WY_E0Sd4K80&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=21&ab_channel=Hung-yiLee) [![bert_v8.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/bert_v8.pdf) ### Self-supervised Learning #### BERT (Transformer Encoder) - Input = (Word + Segment + Position) embedding - [Implemented way](https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a) - Pre-train 1. Masked token prediction (SpanBERT研究遮住多個token -> 機率性選擇要遮幾個) 2. Next Sentence Prediction (RoBERTa研究指出沒什麼用) - SOP (Sentence order prediction)比較有用 -> Used in ALBERT - Fine-tune (How to use BERT) - Case 1: Sentiment analysis - Input sequence、Output yes or no - BERT(init by pre-train, better than random) - Linear transform (random initialization, 採用gradient descent更新) - Train from scratch(從頭開始訓練) vs. fine-tune (win) - Case 2: POS (part of speech) tagging$ - Input sequence length = output - Case 3: Natural Language Inference - Input two sequences、Output a class (Contradiction?) - Case 4: Extraction-based Question Answering - Output (2 integers: start, end) 一定在 input (Document﹐Query)裡 - Random initialized(2 vectors)與doc做inner product之後softmax, 找出起始和結束位置 - GLUE scores: 拿人類當基準1,去評斷機器相較於人類的評分 - Pre-train + Fine-tune 是一種semi-supervised的方法 - (Pre-train)學會做填空題過程中,也透過==上下文==學習了同字不同義 ->類似於 (CBOW) Word Embedding = **Contextualized** word embedding - Multi-lingual BERT (104 languages in pre-train ) - 資料量很重要,越多越好 -> Better alignment - 雖然不同語言同樣意思的embedding接近,但model還是知道語言不一樣 - GPT (Predict Next Token) ![](https://i.imgur.com/Lud0n3k.png =600x400) - 像是transformer decoder,但attention的時候不看之後的輸入 - GPT模型太過巨大,可能無法fine-tune -> ==In-context Learning== - Input: task description、examples、prompt (要output這題答案) - examples給幾個 -> few-shot learning - example只給一個 -> one-shot learning - 沒給example -> zero-shot learning - Self-supervised learning也可以用於Image、Speech... :::info - 名詞解釋: - Downstream Tasks - The tasks we care - We have a little bit labeled data - Scratch: 不使用預訓練模型的初始化 ::: ## 0810 [【機器學習2021】生成式對抗網路 (Generative Adversarial Network, GAN) (一) – 基本概念介紹](https://www.youtube.com/watch?v=4OWp0wDu6Xw&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=15&ab_channel=Hung-yiLee) [【機器學習2021】生成式對抗網路 (Generative Adversarial Network, GAN) (二) – 理論介紹與WGAN](https://www.youtube.com/watch?v=jNY1WBb8l4U&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=15&ab_channel=Hung-yiLee) [【機器學習2021】生成式對抗網路 (Generative Adversarial Network, GAN) (三) – 生成器效能評估與條件式生成 ](https://www.youtube.com/watch?v=MP0BnVH2yOo&ab_channel=Hung-yiLee) [【機器學習2021】生成式對抗網路 (Generative Adversarial Network, GAN) (四) – Cycle GAN](https://www.youtube.com/watch?v=wulqhgnDr7E&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=17&ab_channel=Hung-yiLee)[![gan_v10.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/gan_v10.pdf) ![](https://i.imgur.com/ti4auSJ.png =600x400) - 將input透過Network後output出一個distribution -> Generator - Normal distribution -> Generator -> Complex distribution - Why we need distribution? - 同樣的輸入有多種不同的輸出 (需要創造力) - i.e. Drawing, Video prediction - 其中要加入一個機制:Discriminator - input: image, output:Scalar (large:real, small:fake) - 分辨generator的結果和real images的差異,讓generator修改model - 這個互動稱為adversarial (Generator: 要騙過dicriminator) - Algorithm - Step 1.固定Generator,訓練Discriminator - Step 2.固定Discriminator,訓練Generator - repeating Step 1 & 2 ![](https://i.imgur.com/bK7DDEx.png =600x400) - How to divergence? - 雖然不知道$P_G$ & $P_{data}$的分佈,但可以從中找到sample - $P_G$: Generaotr產生的sample,$P_{data}$: 從database產生的sample - ![](https://i.imgur.com/Oqww1Qo.png =600x400) - 給一個固定的generator,透過dicriminator找出Objective function的最大值 - 再從中找出最小的值當作generator - Tips for GAN - js divergence:看不出距離差異,結果都為$log_2$,除非重疊會是0 (效果不好) - Another solution (Wasserstein distance) - 想像成推土機將P的土丟到Q,但移動的方式有無窮多種 - 窮舉所有解法,找出最小的移動方式 - WGAN (Spectral Normalization) -> 讓gradient在任何地方皆小於1 - Evaluation - 分類越平坦,Diversity越高(一張圖片) - 分類越集中,Qulaity越好 (多張圖片) - Inception Score(IS): Good quality, large diversity - 也要避免Generator產出的東西和real data一樣 - Conditional GAN - input: image+text描述, output: image符合text描述 -> Image translation - 透過GAN+Supervised能有更好的生成結果 - Use GAN in Unsupervised (以上皆以監督式為主) - Cycle GAN - Image style transfer 3D -> 2D ![](https://i.imgur.com/33YHFqn.png =600x400) - 需要2個Generators & 2個Dicriminators - 分別從3D -> 2D & 2D -> 3D做兩邊的轉換 - 目的:讓生成的2D圖片能保有原來3D圖片的特徵 - Text style transfer(Seq 2 Seq) ![](https://i.imgur.com/70JBxwE.png =600x400) - 將Negetive sentence轉換成Positive sentence - Document -> Summary, Language 1 -> Language 2, Audio -> Text :::info - 名詞解釋: - Mode Collapse: Generated model產生的只有幾張圖片(train到後面會幾乎都長一樣),透過discriminator的盲點產生相同高divergence的圖片 - Mode Dropping: 雖然diversity夠而且沒有model collapse,但基本上都能從每張照片找到相同的特徵 ::: ## 0817 [【機器學習2021】自編碼器 (Auto-encoder) (上) – 基本概念](https://www.youtube.com/watch?v=3oHlf8-J3Nc&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=22&ab_channel=Hung-yiLee) [【機器學習2021】自編碼器 (Auto-encoder) (下) – 領結變聲器與更多應用](https://www.youtube.com/watch?v=JZvEzb5PV3U&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=23&ab_channel=Hung-yiLee)[![auto_v8.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/auto_v8.pdf) ### Auto-encoder ![](https://i.imgur.com/moi4vmG.png =600x400) - Also Called: Dimension Reduction - De-nosing Auto-encoder(input加入雜訊,output要還原無雜訊的input) - BERT中的MASKED類似此概念 - Feature Disentagle (也可以用於語音) - 能不能從code (encoder將input轉成vector)再細分出其代表的資訊 - Vector = Content + Speaker info - Application: Voice conversion - Discrete Latent Representation - 可以將code用binary,甚至是one hot vector表示 ![](https://i.imgur.com/DAEhnes.png =600x400) ![](https://i.imgur.com/78vj7I8.png =600x400) - More Application - Compression (Img -> low dim -> Img, 會失真) - Anomaly Detection (異常檢測) - 判斷input是否是訓練資料裡面的data - Auto-encoder的input&output,透過兩者相差loss來判斷是否anomaly ## 0819 [【機器學習2021】來自人類的惡意攻擊 (Adversarial Attack) (上) – 基本概念](https://www.youtube.com/watch?v=xGQKhbjrFRk&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=24&ab_channel=Hung-yiLee) [【機器學習2021】來自人類的惡意攻擊 (Adversarial Attack) (下) – 類神經網路能否躲過人類深不見底的惡意?](https://www.youtube.com/watch?v=z-Q9ia5H2Ig&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=25&ab_channel=Hung-yiLee)[![attack_v3.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/attack_v3.pdf) ### Adversarial Attack - How to attack - Benign Image (原本圖片), Attack Image (圖片加入雜訊,但肉眼看不出來) - Non-targeted: 只要output是錯的就好 - 找到output和正解最大的entropy -> 也就是找出負的最小entropy - Targeted: 不只錯還要輸出想要的目標 ![](https://i.imgur.com/6EypG7G.png =600x400) ![](https://i.imgur.com/df6gB3h.png =600x400) - 要設計一個肉眼看不出來,但對機器來說差很多的noise - L-infinity越小,人類越看不出來 - 訓練方式類似Gradient descent讓Loss最小 - 調input而非parameters - 當差距大到肉眼看得出來,要調回肉眼看不到的正方形 ![](https://i.imgur.com/zO5Zvzt.png =200x200) - FGSM (Fast Gradient Sign Method) - 只update一次 - 加入sing method: 讓gradient只為 +- 1 - 點都會出現在L-infinity框框的四個點上 - White Box Attack (知道Network參數才能攻擊) - Black Box Attack - 透過同樣的training set進行training產生Network Proxy - Network Black & Network Proxy可能相似 - 可以透過攻擊Network Proxy來達到相同效果 - Ensemble Attack (用騙過多個Network的model來攻擊) - Main reason -> 可能是資料上的features在不同資料集都差不多 - Others Data也會有影響 - Speech Processing, NLP - Backdoor in model (從Training階段就開始攻擊) ### Defense - Passive - Add Filter擋住Noisy signal - Smoothing -> 有side effect - Image Compression, Generator - Randomization -> 任意改變圖片,讓attack不知道攻擊哪層 - Proactive - Adversarial Training (Like Data Augmentation) - 在訓練過程中attack自己產生一組錯誤的label - 再將這些標錯的label改回正確的label - Data Set大會很吃力 -> 因為要生double的資料量 - Conclude ![](https://i.imgur.com/SfF6NtL.png =550x180) ## 0825 [【機器學習2021】機器學習模型的可解釋性 (Explainable ML) (上) – 為什麼類神經網路可以正確分辨寶可夢和數碼寶貝呢?](https://www.youtube.com/watch?v=WQY85vaQfTI&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=27&ab_channel=Hung-yiLee) [【機器學習2021】機器學習模型的可解釋性 (Explainable ML) (下) –機器心中的貓長什麼樣子?](https://www.youtube.com/watch?v=0ayIPqbdHYQ&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=27&ab_channel=Hung-yiLee)[![xai_v4.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/xai_v4.pdf) ### Explainable ML - Local Explaination -> 為什麼圖片長這樣就是貓? ![](https://i.imgur.com/Gh5Ks4f.jpg =600x400) - 透過Saliency Map找出機器判別圖片的根據 - 確保機器是看到關鍵資訊後才判斷正確 (i.e. 不是看到水草而判斷出水牛) - Smooth Gradient -> 加入隨機的noises到image並平均起來,結果會比較明顯 - Probing - 訓練一個分類器放在神經網路的某一層,看準確度高低,判斷這一層有沒有我們要分類的資訊 (要小心是沒訓練好、參數沒設好的原因) - i.e. 在某一層放入詞性的分類器確認這一層有無詞性關係 - i.e. 在某一層有沒有可能去除語者特徵,只保留聲音資訊 - Global Explaination -> 貓長什麼樣子? ![](https://i.imgur.com/KLy6Tqn.png =600x400) ## 0901 [【機器學習2021】概述領域自適應 (Domain Adaptation)](https://www.youtube.com/watch?v=Mnk_oUrgppM&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J&index=30&ab_channel=Hung-yiLee)[![da_v6.pdf](https://i.imgur.com/TfnlIqI.png)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/da_v6.pdf) ### Domain Adaptation - Domain Shift: 訓練集和測試集如果資料上的分佈不一樣,結果可能會不好 - 使用時機:有大量的target data(unlabeled) - 期望透過Feature Extractor找出兩個data set相同的分佈 ![](https://i.imgur.com/KdpWgH6.png =600x400) - Domain Classifier是二元分類器 - Source = labeled data, Target = unlabeled data