Machine Learning (李宏毅老師)

:::info * **Special thanks to Vickie; part of this note is modified from hers.** * For Colab and PyTorch tutorial, see ML2022 slides. * **Still, the note didn't cover the whole content, but focus only on classical ANN, CNN, and GNN**. * RNN is nearly replaced by the self-attention and is in the 紀老師's handouts. Based on: 1. Prof. Lee's ML 2021 ([YouTube](https://www.youtube.com/playlist?list=PLJV_el3uVTsPM2mM-OQzJXziCGJa8nJL8), [Homepage](https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.php), [Github](https://github.com/ga642381/ML2021-Spring)) 2. Prof. Lee's ML 2022 ([YouTube](https://www.youtube.com/watch?v=7XZR0-4uS5s&list=PLJV_el3uVTsPM2mM-OQzJXziCGJa8nJL8), [Homepage](https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php), [Github](https://github.com/virginiakm1988/ML2022-Spring)) & Homeworks 3. extra materials (video links @ each section) 4. 計資中心紀老師深度學習講義 ::: ###### tags: `AI` [TOC] # Intro. of ML / DL :::success 機器學習是什麼？就是機器在找函數的過程～像是我們今天希望做語音處理，機器就可以用函式吃語音訊息，輸出語音訊息的內容！ ::: ## Different types of Functions * **Regression** : The function outputs a **scalar**. * **Classification** : Given **options (classes)**, the function outputs the correct one. (e.g. playing Go, classifying spam mails, etc.) * **Structured Learning** : create **something with structure**(image, document). ## Machine Learing - Find the function ![](https://i.imgur.com/wt5rV5H.png =70%x) ### 1. Function with Unknown Parameters -> domain knowledge ![](https://i.imgur.com/NfiYsiL.png =70%x) ### 2. Define Loss from Training Data * Loss is a function of parameters $L(b,w)$, measuring how good a set of values is (small : good) * Loss : $L= {1\over N} \sum\limits_ne_n$ * $e=|y-\hat y|$ ----- $L$ is mean absolute error **(MAE)** * $e=(y-\hat y)^2$ ------ $L$ is mean square error **(MSE)** * If $y$ and $\hat y$ are both probability distributions ------ **Cross-entropy (get to this later)** ![Screenshot 2025-06-30 at 22.09.08](https://hackmd.io/_uploads/Hkeq5zgrle.png) ### 3. Optimization :::info $w^*, b^*=arg\ \min\limits_{w,b} L$ ::: * **Gradient Descent** * (Randomly) Pick an initial value $w^0$ * Compute ${\partial L}\over{\partial w}$$|_{w=w^0}$, 決定往哪裡跨，|斜率| : 大/小跨越大/小 * Negative $\rightarrow$ Increase $w$ * Possitive $\rightarrow$ Decrease $w$ * <text style="color:red">$\eta$</text>${\partial L}\over{\partial w}$$|_{w=w^0}$ * <text style="color:red">$\eta$</text> : learning rate ==這種要自己設定的東西 : hyperparameters== ![](https://i.imgur.com/5s9UiWy.png =70%x) * Update $w$ iteratively: 不小心到極值 ? 但其實不是問題 ?之後會再說！ (if you have experience of training neural network, you will know that there's no worries) ![Screenshot 2025-06-30 at 22.32.52](https://hackmd.io/_uploads/S1-mlXxSex.png) * 推廣到兩個參數 ![Screenshot 2025-06-30 at 22.34.56](https://hackmd.io/_uploads/BkkogQeSel.png) ![Screenshot 2025-06-30 at 22.37.14](https://hackmd.io/_uploads/rkw7ZXgBgg.png) ### 4. increasing features to consider... ? ![Screenshot 2025-06-30 at 22.44.57](https://hackmd.io/_uploads/SJ8lX7xrgl.png) :::danger But linear model has severe issue, **Model Bias**, 有太多curve不是只是一條直線，所以我們要連起好幾個piecewise的線段變成continuous curve ![](https://i.imgur.com/9lac9jN.png =70%x) 下面我們來看看怎麼處理！！！ ![Screenshot 2025-07-24 at 11.16.55](https://hackmd.io/_uploads/ryOhHm1vle.png) ![Screenshot 2025-07-02 at 10.33.32](https://hackmd.io/_uploads/HyuF5zGSel.png) ::: ## Issue with linear model ### Step 1. Function with Unknown Parameters - Sigmoid #### sigmoid function(平滑的) 去表示 hard sigmoid(有稜角的) ![](https://i.imgur.com/lNuGqKu.png =70%x) :::success 所以有人說需要做特徵縮放就是因為 $\rightarrow$ 要讓features的分佈儘量在sigmoid的梯度變化裡面！ ::: ##### 調整 w, b, c，分別調整斜率、位移、高 ![](https://i.imgur.com/uDml4je.png =70%x) ##### 紅=很多 sigmoid functions 合, more points, more smooth ![](https://i.imgur.com/Z9Kcn6c.png =70%x) ![](https://i.imgur.com/Owxjd7z.png =70%x) ##### 再拆得細一點, and 變成matrix * $i$ 代表每一個sigmoid function, 可以想成是不同位置的形狀 * $j$ 代表每一個features, 像是前七天的觀看人數 * $r$ 是等等要送到 sigmoid function 的 ![](https://i.imgur.com/fzsBfE3.png =70%x) #### 寫成矩陣 ![](https://i.imgur.com/hlAKhIj.png =70%x) #### r 送入 sigmoid 得到 a ![](https://i.imgur.com/8rpqpKl.png =70%x) #### 得到 y ![](https://i.imgur.com/YZFIkeI.png =70%x) ![](https://hackmd.io/_uploads/Sy3WqzfBeg.png) ==**Note**: 綠色的b是向量，灰色的是數值== ### Step 2. Define Loss from Training Data :::danger ![Screenshot 2025-07-02 at 10.34.57](https://hackmd.io/_uploads/ryxyjGfBex.png) ::: #### Loss is a function of $L(\theta)$ ![](https://i.imgur.com/2vlarPk.png =70%x) ### Step 3. Optimization of New Model $\theta ^*=arg\ \min\limits_{\theta}L$ * (Randomly) Pick an initial value $\theta^0$ * Compute $g$, 實作上通常gradient不會是0, 通常是我們不做了 ![](https://i.imgur.com/DYxwVqw.png =70%x) ![Screenshot 2025-07-02 at 10.45.28](https://hackmd.io/_uploads/SJILazfHex.png) * 實作上 : 把 $L$ 切很多batch * 一個$B$做出來的Loss就是 $L^1$ * 當然如果$B$很大，那說不定$L^1$會很接近$L$ * **Batch** -> 1 epoch to see all batches once -> **shuffle** -> prevent learning order bias ![](https://i.imgur.com/xAURvok.png =70%x) * epoch & update are different terms * 一個epoch我們不會知道他是幾次update * Below is an example ![](https://i.imgur.com/eJz0f2S.png =70%x) ## ReLU * Rectified Linear Unit (ReLU) * 兩個 ReLU 合成一個 hard sigmoid ![](https://i.imgur.com/ZH3r5Yr.png =70%x) * Sigmoid 和 ReLU 都叫 **Activation Function** $\rightarrow$ ReLU 比較好 ![](https://i.imgur.com/eCiGEmP.png =70%x) ### Another hyperparameter? ![](https://i.imgur.com/e10O6kX.png =70%x) :::warning 簡而言之，我們可以做很深，很多層sigmoid / ReLU 至於為什麼是很多層 (Deep network), 而不是一開始就連很多Sigmoid (fat network), 這個之後會說到！ ::: ### This is Deep learning? * Deep = Many hidden layer * function like sigmoid = Neuron ![](https://i.imgur.com/loEh7SR.png =70%x) ![](https://i.imgur.com/wJUWQMS.png =70%x) * 而這種layer變多反而誤差變大的狀況，叫做overfitting :::warning **Overfitting** : Better on training data, worse on unseen data (Testing data). ::: ## Tradeoff of model complexity (Poke vs. Digi) ### Inequality #### Assume Pokemon線條比較少.... h as threshold... ![Screenshot 2025-07-10 at 11.53.12](https://hackmd.io/_uploads/HkdVF3nHle.png) #### L as loss function, 其實就是正答率 ![Screenshot 2025-07-10 at 11.54.41](https://hackmd.io/_uploads/Sk0YF33Slx.png) #### 這邊要知道 we are never gonna have $D_{all}$ ![Screenshot 2025-07-10 at 12.33.47](https://hackmd.io/_uploads/HyOnzT2Sxg.png) #### but we hope they are close ![Screenshot 2025-07-10 at 12.37.27](https://hackmd.io/_uploads/rkBcmTnBxg.png) #### Mathematical proof ![Screenshot 2025-07-10 at 12.37.51](https://hackmd.io/_uploads/BJnj7TnHeg.png) #### let $\epsilon = \delta /2$, we want to know the prob. ![Screenshot 2025-07-10 at 12.42.22](https://hackmd.io/_uploads/H1i34phBee.png) ![Screenshot 2025-07-10 at 12.42.49](https://hackmd.io/_uploads/rkLR4ThHlg.png) ![Screenshot 2025-07-10 at 12.44.05](https://hackmd.io/_uploads/HJXQrphBxe.png) #### hoeffding's inequality ![Screenshot 2025-07-10 at 12.44.46](https://hackmd.io/_uploads/rJiHST2reg.png) ![Screenshot 2025-07-10 at 12.46.18](https://hackmd.io/_uploads/ByvoHpnBle.png) #### example ![Screenshot 2025-07-10 at 12.47.47](https://hackmd.io/_uploads/r11-8anBel.png) ### Model complexity ![Screenshot 2025-07-10 at 12.50.09](https://hackmd.io/_uploads/HyJ5La3rll.png) #### 現在的問題是我們沒辦法決定我們的N (usually), 但是我們如果把H設定得很小的話, 自然loss就會很大 (though 兩loss會很接近) #### You can see the picture below ![image](https://hackmd.io/_uploads/SJvV_QxYge.png) :::info how to 魚與熊掌兼得？ $\rightarrow$ Deep learning ([video](https://youtu.be/yXd2D5J0QDU?si=okZxHeozCGma1a-w)) * 一個 hidden layer 可以產出任何 function，但高瘦 > 矮胖，可以產生較複雜的 function，而且矮胖要較多 parameter，更容易 overfitting，同時也代表，**高瘦只要比較少的訓練資料** * 通常，如果 expected function 比較複雜且有規律 -> 適合 deep (ex. image, speech) ![](https://i.imgur.com/aDpQvbT.png =70%x) ::: # Deep Learning :::success added notes from 額外的補充影片們 ([video 1](https://youtu.be/Dr-WRlEFefw?si=7H51g3fESWsdlZ5e), [video 2](https://youtu.be/ibJpTrp5mcE)) 額外知識像是initializer或是loss function, optimizer可參考紀的lec. 6 但是==有一些知識是錯的==哈哈...like AdaM optimizerr, etc. ![Screenshot 2025-07-03 at 10.54.30](https://hackmd.io/_uploads/BJ9gZdQrll.png) ::: ## Deep Learning vs. Machine Learning :::danger 所以⋯⋯ machine learning $\rightarrow$ deep learning, 並沒有變比較簡單！沒有deep learning的時候，我們在做的事情是feature engineering i.e. 做影像辨識的時候要好好抽取一些我們決定的features給函數train. **issues:抽什麼features** 有deep learning的時候我們可以硬幹但是我們需要決定的變成函數本身 i.e. 影像辨識的時候直接丟pixel進去，但你要design network 的structure. **issues: 怎麼抽features** Which one is easier? $\rightarrow$ it depends on your tasks. ::: ### Step 1. Define a function (Neural Network) * 每一組neuron都有他的weight跟bias * **Given network structure $\rightarrow$ define a function set** #### Fully connect feedforward network ![Screenshot 2025-07-04 at 17.15.14](https://hackmd.io/_uploads/Sy-3jGHSgl.png) ![Screenshot 2025-07-04 at 17.16.37](https://hackmd.io/_uploads/SJzW3MBBxg.png) * These can be expressed in matrix form, **easy for GPU to calculate**. #### example ![Screenshot 2025-07-04 at 17.27.04](https://hackmd.io/_uploads/HkBuCzBrgl.png) ![Screenshot 2025-07-04 at 17.27.58](https://hackmd.io/_uploads/Hy9j0zSrel.png) ==The function we need... is neural network!== ![Screenshot 2025-07-04 at 17.30.32](https://hackmd.io/_uploads/B15B1mrBeg.png) * 而要幾個hidden layers, 或是neurons都是我們自己決定 * So we need **"intuition"** and **"trial and error"** ### Step 2. & Step 3. Pick good function ==**for cross entropy, here's a**== **[reference](https://flag-editors.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92%E5%8B%95%E6%89%8B%E5%81%9Alesson-10-%E5%88%B0%E5%BA%95cross-entropy-loss-logistic-loss-log-loss%E6%98%AF%E4%B8%8D%E6%98%AF%E5%90%8C%E6%A8%A3%E7%9A%84%E6%9D%B1%E8%A5%BF-%E4%B8%8A%E7%AF%87-2ebb74d281d)** ![Screenshot 2025-07-04 at 17.52.50](https://hackmd.io/_uploads/S1lYN7BBeg.png) ![Screenshot 2025-07-04 at 17.53.17](https://hackmd.io/_uploads/rk594mBHee.png) ## BackPropagation :::success To compute the gradients efficiently... ![Screenshot 2025-07-06 at 00.00.27](https://hackmd.io/_uploads/ByFm2p8rlg.png) 上圖表示，我們只要能算特定一個data的$\partial l^n/\partial w$, 全部加起來，你就可以算出total loss對$w$的偏微分。（所以再來下面的計算也都只focus在某一筆data $\rightarrow$ 某一個neuron） ![image](https://hackmd.io/_uploads/BkHFVb_Blg.png) ::: ### Forward pass ![Screenshot 2025-07-06 at 22.13.35](https://hackmd.io/_uploads/r1yiNWuSge.png) ![Screenshot 2025-07-06 at 22.13.56](https://hackmd.io/_uploads/rkb2VWuSxx.png) ### Backward pass ![Screenshot 2025-07-06 at 22.17.50](https://hackmd.io/_uploads/HkkjBZdHxg.png) ![Screenshot 2025-07-06 at 22.18.32](https://hackmd.io/_uploads/H1BarWOSlx.png) * Assume ?是已知 * 上圖如果你最後一層有1000個neurons, 那自然就是加個1000次！！！ ![Screenshot 2025-07-06 at 23.08.17](https://hackmd.io/_uploads/BylubMOSlg.png) * 現在可以想像有另外一個neuron ![Screenshot 2025-07-06 at 23.11.26](https://hackmd.io/_uploads/Sy3XffOrlx.png) ### Split orange neuron into two cases... #### Case 1 ![Screenshot 2025-07-06 at 23.18.16](https://hackmd.io/_uploads/Skv6XMdHee.png) #### Case 2 ![Screenshot 2025-07-06 at 23.19.44](https://hackmd.io/_uploads/ByZmEzdSgl.png) * 如果不是output layer, 我們還是可以假設他的下一層是known直到最後到達output layer ![Screenshot 2025-07-06 at 23.21.19](https://hackmd.io/_uploads/SJn_4fOSge.png) * ==結論：反著算== ![Screenshot 2025-07-06 at 23.24.30](https://hackmd.io/_uploads/HyT4BGuSgl.png) # General Guidance on workflow :::info Framework of ML ![](https://i.imgur.com/2v2zp6u.png) ::: ## General Guide 1 - Training Loss large :::warning 有可能是 **1)model bias**, **2)optimization** ![](https://i.imgur.com/hm0Pt8d.png =80%x) ::: ### 1) Model Bias #### 要找的可以降低loss的函數現在的model根本描述不出來(橘點) ![](https://i.imgur.com/GYphGYd.png =80%x) ### 2) Optimization Issue #### i.e. 像是Gradient descent找到local min. (找不到橘點) ![](https://i.imgur.com/W2T8VQT.png =80%x) ### 3) How to know which one? 1. **比較多層結果**: 像下圖是 optimization issue, 因為 56 層一定可以做到 20 層的事，所以不是 bias；外加training data的56層也比20層差，所以不會是overfitting. 2. 從**淺一點 / 其他簡單** model (比較好 optimize 的) 開始做 3. 再來做深的，如果比較深的沒有比較小的 training loss -> optimization issue ![](https://i.imgur.com/4UvQMwa.png =80%x) ## General Guide 2 - Testing Loss large :::warning 有可能是 **1)overfitting**, **2)mismatch** ![](https://i.imgur.com/HAMP3lv.png =80%x) ::: ### 1) overfitting #### 1 - 用更多 training data, 自己查之類的 #### 2- data augmentation, 自己創造出新的 data, 像是貓咪照片放大或左右翻轉都還是貓咪。 ![](https://i.imgur.com/gVADhqY.png =70%x) #### 3 - constrained model 自己設計不那麼 flexible 的，但不能限制太大. 第一點之後CNN會說到，就是因為給了比較大的限制，所以在影像處理上CNN才可以表現這麼好, 即使做出來的集合中的function沒那麼多。 ![](https://i.imgur.com/vi0R6zE.png =50%x) ![](https://i.imgur.com/c5yrGMt.png =80%x) #### 4_regularization, dropout #### 5_太constrained會回到 model bias.. ![](https://i.imgur.com/7TtZSdx.png =50%x) ![](https://i.imgur.com/M6dJ6rx.png =40%x) #### 6_Bias-Complexity Trade-off ![](https://i.imgur.com/Fa67A2L.png =70%x) ### 2) N-fold Cross Validation ![](https://i.imgur.com/mXTzh6g.png) ### 3) Mismatch - Your training and testing data have different distribution ![](https://i.imgur.com/02umuKR.png =70%x) # Optimization - Gradient is small... #### Optimixation fails because...critical point * **local minima** * **saddle point** ![](https://i.imgur.com/Z9Tk7L3.png) :::warning * Critical points have zero gradients * Critical points can be either saddle points or local minima * Can be determined by the Hessian matrix * It is possible to escape saddle points along the direction of eigenvectors of the Hessian matrix * Local minima may be rare, i.e. local min. in 2D might be saddle point in higher dimensions such as 5D! * Smaller batch size and momentum help escape critical points ::: ## Taylor Series Approximation * Gradient $g$ : 一次微分 * Hessian $H$ : 二次微分 * ctritical point 的時候 $g=0$，只剩 hessian ![](https://i.imgur.com/4Mug9tR.png) ### Hessian (symmetric if function is continuous) let $v=\theta-\theta'$ At critical point, $L(\theta)\approx L(\theta')+{1\over2}(\theta - \theta')^TH(\theta - \theta')$ $=L(\theta')+{1\over2}v^THv$ * $v^THv > 0$ Around $\theta'$ : $L(\theta)>L(\theta')$ $\rightarrow$ **Local minima** $\rightarrow$ $H$ is positive definite = All eigen values are positive * $v^THv < 0$ Around $\theta'$ : $L(\theta)<L(\theta')$ $\rightarrow$ **Local maxima** $\rightarrow$ $H$ is negative definite = All eigen values are negative * Sometimes $v^THv > 0$ , somtimes $v^THv < 0$ $\rightarrow$ **Saddle point** $\rightarrow$ Some eigen values are positive, some are negative ==eigenvalues, elements是兩件不同的東西== ==簡而言之算出矩陣後看他的eigenvalue就對了！！！！！== ### example ![](https://i.imgur.com/WlDz2dZ.png =60%x) ![](https://i.imgur.com/UbfYJnh.png =60%x) ![](https://i.imgur.com/eDisoHg.png =60%x) ## Don't be afraid of saddle point? ==$H$ may tell us the update direction of parameters!== $u$ is an eigen vector of $H$ $\lambda$ is the eigen value of $u$ $\rightarrow$ $u^THu=u^T(\lambda u)=\lambda ||u||^2$ * if $\lambda < 0$ , $\lambda ||u||^2 <0$ , $u^THu<0$ $L(\theta)\approx L(\theta')+{1\over2}(\theta - \theta')^TH(\theta - \theta')$ $\rightarrow L(\theta)<L(\theta')$ $\theta-\theta'=u$ $\rightarrow$ $\theta=\theta'+u$ , **decrease L** 沿著 $u$ 去更新方向 ==簡單來說，我們找到一個負的eigenvalue的eigenvector，把它加在原本的$\theta$去找新參數就可以了！！！！！ (帥氣 by Tony)== :::info Well, but we seldom use this in practice. ::: ### example ![](https://i.imgur.com/hgwWrch.png =60%x) ## Large v.s. Small Batch (saw this @Step 3 of Intro ) ### Time ![](https://i.imgur.com/SIQLMEv.png =70%x) * **Larger** batch size **does not require longer time** to compute gradient (unless batch size is too large) * **Smaller** batch **requires longer time** for one epoch (longer time for seeing all data, 跑太多epoch.) ![](https://i.imgur.com/8hKhgJd.png =70%x) ### performance * **Smaller batch size has better performance** (如下圖, L1卡住了, L2不一定會卡住啊) * What’s wrong with large batch size? **Optimization Fails** (not model bias or overfitting.) ![](https://i.imgur.com/SSBAxwM.png =70%x) #### Small batch is better on testing data small : 256 large : 0.1*dataset large 傾向走到 sharp minima(比較不好的 minima) small 因為亂跳，所以很容易跳出去 ==說到底就是一個paper的解釋== ![](https://i.imgur.com/aW6x5RH.png) ![](https://i.imgur.com/20FItN3.png) ## Momentum ### Gradient Descent + Momentum Movement: **movement of last step** minus **gradient at present** 兩者取中間 ![](https://i.imgur.com/NzJBGQU.png =70%x) * 會發現 $m^i$ 可以寫成 the weighted sum of all the previous gradient: $g^0, g^1, ..., g^{i-1}$ * **到 minima 的時候因為有 last movement 所以還能向前** ![](https://i.imgur.com/bbrX81R.png =70%x) # Error surface is rugged... ==training stuck = small gradient== * Different parameters needs different learning rate $\rightarrow$ gradient 越大，learning rate 要越小 * 有時候我們根本到不了critical point, loss就穩定了原本 : $\theta^{t+1}_i \leftarrow \theta ^t_i-\eta g^t_i$ ，$g^t_i={\partial L \over \partial \theta_i}|_{\theta=\theta^t}$，t 代表第幾次 update ![Screenshot 2025-07-07 at 15.38.23](https://hackmd.io/_uploads/r1xYFlFSex.png) 現在客製化 : $\theta^{t+1}_i \leftarrow \theta ^t_i-{\eta \over \sigma^t_i} g^t_i$ ## Root Mean Square $\sigma^{t}_i=\sqrt{{1\over{t+1}}\sum^t_{i=0}(g^t_i)^2}$ ![](https://i.imgur.com/TSrYZKQ.png =70%x) :::danger 剛剛的假設有點像是，同一個參數，gradient 的大小就會固定是差不多的值 ::: ## RMSProp $\sigma^{t}_i=\sqrt{\alpha (\sigma^{t-1}_i)^2+(1-\alpha)(g^t_i)^2}$，$0<\alpha < 1$ ![](https://i.imgur.com/I8gZMXC.png =70%x) ![](https://i.imgur.com/Ocd9KQS.png =70%x) :::danger Adam = RMSProp + Momentum ::: w/ & w/o Adaptive Learning Rate w/o:![](https://i.imgur.com/kgib7LX.png =70%x) w/:![](https://i.imgur.com/bhy1pby.png =50%x) 還是會噴，但會回來 :::danger 解決 : learning rate scheduling ::: ## Learning Rate Scheduling $\eta^t$ $\theta^{t+1}_i \leftarrow \theta ^t_i-{\eta^t \over \sigma^t_i} g^t_i$ * **Learning Rate Decay** : As the training goes, we are closer to the destination, so we reduce the learning rate ![](https://i.imgur.com/G634n52.png =50%x) ![](https://i.imgur.com/zEDuiyQ.png) * **Warm Up** : Increase and then decrease At the beginning, the estimate of $\sigma^t_i$ has large variance ![](https://i.imgur.com/vfVNM4b.png =50%x) :::warning summary: * momentum 有方向 ![](https://i.imgur.com/5u9z6E5.png) ::: ## Batch Normalization (no more rugged surface) :::warning **recap**: * **dimension** usually refers to number of features. * **feature vectors** usually refers to one row containing all features one sample has. (you can switch the column and rows.) ![Screenshot 2025-07-10 at 17.41.27](https://hackmd.io/_uploads/Bkm0cW6Heg.png) ![Screenshot 2025-07-10 at 21.34.03](https://hackmd.io/_uploads/ryt8WSTBgl.png) ::: * Here we assume that $x^1$ to $x^R$ are every feature vectors. * we want to make the mean of all dimensions =0, and the std become 1 ![Screenshot 2025-07-10 at 19.43.32](https://hackmd.io/_uploads/H1BuP7TBgx.png) ![Screenshot 2025-07-10 at 19.44.01](https://hackmd.io/_uploads/SkntPXpHel.png) * 放到neural network的話就會是一堆標準化的input的一層層output, 我們還要再算他們的平均值跟標準差, 計算量可以說是非常大！ ![Screenshot 2025-07-10 at 20.25.34](https://hackmd.io/_uploads/Sk3S-4aBll.png) * so we need **batch normalization** * But we should note that the batch should have relevant size. Otherwise the batch normalization cannot represent the distribution of the whole data. ![image](https://hackmd.io/_uploads/rJGs4XLkvee.png) * batch normalization in testing ![Screenshot 2025-07-10 at 20.39.14](https://hackmd.io/_uploads/SyJKVNprxl.png) # Classification * class as one-hot vector * 我們可以讓network從output一個數值到跑出三個, and then we can cope with one-hot vector * 不過在**binary classification要注意features的multilinearity**, 要drop first (in keras), for example, 我如果有三個值前兩個one-hot是0, 那我第三個一定是1!!! * 但這邊也可以注意的是for multiclass classification, we cannot drop first for label haha. * ==Only usr drop first for one-hot encoding of features.== ![](https://i.imgur.com/eKGnfzA.png =50%x) ![](https://i.imgur.com/iIOnsVM.png) ![](https://i.imgur.com/YqmD9Qr.png =70%x) :::danger 加 softmax 讓 $y$ 變成 $y'$更接近 $\hat{y}$ ::: ## Soft-max * $y_i'={exp(y_i)\over \sum_j exp(y_j)}$，$y$ : logit (input) * $1>y_i'>0$ * $\sum_iy_i'=1$ :::danger vs. sigmoid : [link](https://medium.com/@yingyuchentw003/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92%E5%9F%BA%E6%9C%AC%E6%A6%82%E5%BF%B5-activation-functions-8d890d650e8a) ::: ## Loss of Classification $L={1\over N}\sum_ne_n$ ![](https://i.imgur.com/Je8EhhT.png =70%x) * MSE $e=\sum_i(\hat{y_i}-y_i')^2$ * **Cross-entropy** `win` $e=-\sum_i(\hat{y_i}\ lny_i')$ **Minimizing cross-entropy** is equivalent to **maximizing likelihood** 證明 cross-entropy 比較好 MSE 在 loss 很大的時候很平坦 -> stuck [補充影片](http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Deep%20More%20(v2).ecm.mp4/index.html) ![Screenshot 2025-07-10 at 13.54.37](https://hackmd.io/_uploads/S1tsBR2Bxl.png) # CNN :::info One-hot vector 的 dimension larger (越多項目)，就表示我們可以classified出的類別越多, 可想成每一個element即是一個feature. Later we wil see one problem connected to CNN $\rightarrow$ Do we really need the fully connected network? ![Screenshot 2025-07-25 at 16.17.15](https://hackmd.io/_uploads/S1qcpnePle.png) ![Screenshot 2025-07-25 at 16.26.24](https://hackmd.io/_uploads/By7aJagPll.png) ::: ## Receptive Field * channel : 深，決定圖片的顏色, not limited to 1 or 3 * 黑白的channel = 1 * tensor : > 2D 的矩陣 ![](https://i.imgur.com/EbBdX8D.png =70%x) * 對一個小 neuron，只要管 3\*3*3 的範圍 : receptive field ![](https://i.imgur.com/ZOGIx3f.png =70%x) * 也可以多個 neuron 同個 receptive field * receptive field 可以 overlapped * receptive field 大小可以不同、可以 cover 特定 channel * receptive field 也不一定要是連續範圍 (爽就好，大概，但就要知其所以然。) ### typical setting : kernel size * kernel size : $3*3$(高*寬) * stride = 平移的量, 勁量會想要重疊是怕如果沒重疊，又有pattern出現在他們的交界的話= =就好笑了 * padding, 即是補值, 一般來說都在超出影像範圍的地方補0 ![](https://i.imgur.com/CiPReFX.png) ## Parameter Sharing * 不同 neuron 共享相同參數 : weight 一樣 * 這就像所有系所一起來修機器學習大班課哈 ![](https://i.imgur.com/edEVKnG.png =50%x) ### typical setting * 每個 receptive field 對不同neuron都只有設定好的一組參數 * **filter** : 這些參數叫做 filter * **下圖一個圈圈代表一個 neuron**, 同顏色的filter一樣 ![](https://i.imgur.com/QOjLIuj.png) ## Benefit of Convolutiional layer * 增加對 neuron 的限制 * 這樣雖然large model bias, 但是也不太會對一個image就會overfitting ![](https://i.imgur.com/Kphh6C7.png =70%x) :::warning CNN = receptive field + sharing parameters ::: ### Steps (Story 2) * 我們先假設filter 1,size是$3\times3\times channel size$ (先當一), and 裡面是他的parameters, 我們在做的時候就是把他跟image的數值做inner product. * 如下圖一的filter, 就會是找image裡面diagonal方向的1這一個pattern. ![](https://i.imgur.com/qjPveBV.png =70%x) ![](https://i.imgur.com/MUHSP1q.png =70%x) 64 filters去如上圖這樣做 -> 得到64 群數字 -> 64 組數字的 feature map -> 可以看成新的 image with 64 channel -> 當成下一層輸入 -> 這時候看 3\*3 其實是考慮了原本 5\*5 的範圍(沒看圖完全不知道5\*5在幹嘛超好笑，所以可以重看一次講義或影片) -> 所以==network夠深就一定可以偵測到很大範圍的pattern== ![](https://i.imgur.com/YQIxESM.png =70%x) ![](https://i.imgur.com/DLYgS66.png) ### Pooling - Max Pooling 選最大的留下 -> 把圖片變小 ![](https://i.imgur.com/4wAUP2E.png =70%x) ![](https://i.imgur.com/dqb3eXJ.png =70%x) 通常一層 conv，一層 pooling -> 但也可能丟失資訊 -> alphago no polling ### The whole CNN ![](https://i.imgur.com/FQGjlGf.png =70%x) # Self-Attention :::success * An important module in the transformer * If your input is a set of vectors... ![Screenshot 2025-08-11 at 14.06.34](https://hackmd.io/_uploads/H1F_dbwuxe.png) * Graph is also a set of vectors * like social network, molecules... ![Screenshot 2025-08-11 at 14.08.46](https://hackmd.io/_uploads/HypxKbw_ll.png) ![Screenshot 2025-08-11 at 14.09.07](https://hackmd.io/_uploads/B1ZztZwOeg.png) ::: ## Type of output * Each vector has a label ![Screenshot 2025-08-11 at 14.10.44](https://hackmd.io/_uploads/H1G_tbPOll.png) * The whole sequence has a label ![Screenshot 2025-08-11 at 14.11.32](https://hackmd.io/_uploads/SyWoFWwdxx.png) * Model decides the number of labels #### seq2seq ![Screenshot 2025-08-11 at 14.12.27](https://hackmd.io/_uploads/HkuRY-Pdxl.png) ## Intro of self-attention ### Overview * 是一種network的架構 * 我們有多少vector被input進去，我們就會利用self-attention再生出同樣數目的vectors，然後再送進去fully connnected network. * And the vectors we produced is the **informed vector** that read **all the information in the sequence** and know the context. * 我們可以一直交互疊加self-attention 和 fully connected network ![Screenshot 2025-08-11 at 14.22.08](https://hackmd.io/_uploads/HJZaz3bPdge.png) ### mechanism ![Screenshot 2025-08-11 at 14.29.21](https://hackmd.io/_uploads/H1kRTZDOge.png) * And we say the whole sequence, but we still don't want to let the length of window equals to thed whole sequence. * So we need to find the vectors to be input that has the relevance with each other$\rightarrow$ find the relevant vectors in sequence. ![Screenshot 2025-08-11 at 14.31.16](https://hackmd.io/_uploads/rkzSRbDdll.png) * and that relevance is called the attention score. * attention score has lots of calculation methods, but at here we will use the dot product. * And after the calculation of the dot product, we should use the softmax function to normalize them. (you can also use other functions) * q for **query**, and k for **key** ![Screenshot 2025-08-11 at 14.38.03](https://hackmd.io/_uploads/HJu0yGPuel.png) ![Screenshot 2025-08-11 at 14.38.30](https://hackmd.io/_uploads/Sk7exfwdgl.png) * And finally we extract the information based on the attention score ![Screenshot 2025-08-16 at 16.47.56](https://hackmd.io/_uploads/SJiarp6Oll.png) * 可以看到說attention score越高的,他最後加成b1的時候b1的值也會越接近他！！！ * Who has higher attention score, whose v has dominant effects in the resulting b * I will stop here. And for more detailed self-attention, you can start from watching **"self-attention part2" + "transformer" and so on...** at machine learning 2021 channel.(e.g. multihead self-attention, position information in self-attention, etc.) ### Self-attention for Graph ![image](https://hackmd.io/_uploads/HyOmJvktll.png) * In the picture above, we should notice that in the graph case, there's **no need to calculate the attention score (most cases)**. * If we consider edge, that is, the line concectint two nodes, then those connected nodes **need attention!** * This is also one type of the Graph Neural Network (GNN) # GNN :::danger * Please refer to [video1](https://youtu.be/eybCCtNKwzA?si=bOECe4pSrhKnUoCh) and [video2](https://youtu.be/M9ht8vsVEw8?si=QQ0KaKoxHfVUUXbu) * Graph = node + edge * 基本上只要我們可以把input變成graph, and then output is also the graph. Then this is the graph neural network. * To put it simply, how to consider the entity's features, along with its relationship with other entities **at the same time** $\rightarrow$ GNN. ::: ## How? * 如下圖，但今天的狀況是說我們一般來說，training data量很多的時候，我們的unlabeled data is usually larger than the labeled data. * So how do we deal with this problem? How to let nodes learn from its neighbors? ![image](https://hackmd.io/_uploads/B1hB6RMFeg.png) ## Roadmap * We like GCN and GAT * It can also be used in the NLP ![image](https://hackmd.io/_uploads/HkqjJkXFxg.png) ### Spacial-based * It's like generalizing the CNN to the graph. * We use **aggregation**: from features of the neighbors to update the next hidden state. * And then **readout**. (for example, we are predicting some properties like hydrophilicity of the molecules.) ![image](https://hackmd.io/_uploads/Hkb3m1mFgl.png) #### NN4G ==I think it is the most simple one since both its aggregation and the readout are calculated using sum.== ![image](https://hackmd.io/_uploads/r1F7IyXFxg.png) ![image](https://hackmd.io/_uploads/SyMSIJmYxe.png) #### MoNET * Use weighted sum instead of simple summing all neighbor features (compared to NN4G) #### GAT * Not only weighted sum, we also let the model to decide the weights (compared to the MoNET) * You **do attention** to your neighbors :::info And still, there're lots of model using different type of thet aggregation methods, such as some models may use LSTM. ::: ### Spectral-based * 先把signal和filter(similar to thata in CNN)去做fourier transform, and then we do the multiplication. * Finally we then do the inverse fourier transform to get the results. ![image](https://hackmd.io/_uploads/ryx9xxmtex.png) ## Summary * Please note that I ignored lots and only record what I think relevant. Detailed content in the slides. * GAT and GCN are the most popular GNNs * Although GCN is mathematically driven, we tend to ignore its math * GNN (or GCN) suffers from information lose while getting deeper * Many deep learning models can be slightly modified and designed to fit graph data, such as Deep Graph InfoMax, Graph Transformer, GraphBert. * GNN can be applied to a variety of tasks