:::info
* **Special thanks to Vickie; part of this note is modified from hers.**
* For Colab and PyTorch tutorial, see ML2022 slides.
* **Still, the note didn't cover the whole content, but focus only on classical ANN, CNN, and GNN**.
* RNN is nearly replaced by the self-attention and is in the 紀老師's handouts.
Based on:
1. Prof. Lee's ML 2021 ([YouTube](https://www.youtube.com/playlist?list=PLJV_el3uVTsPM2mM-OQzJXziCGJa8nJL8), [Homepage](https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.php), [Github](https://github.com/ga642381/ML2021-Spring))
2. Prof. Lee's ML 2022 ([YouTube](https://www.youtube.com/watch?v=7XZR0-4uS5s&list=PLJV_el3uVTsPM2mM-OQzJXziCGJa8nJL8), [Homepage](https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php), [Github](https://github.com/virginiakm1988/ML2022-Spring)) & Homeworks
3. extra materials (video links @ each section)
4. 計資中心紀老師深度學習講義
:::
###### tags: `AI`
[TOC]
# Intro. of ML / DL
:::success
機器學習是什麼?就是機器在找函數的過程~
像是我們今天希望做語音處理,機器就可以用函式吃語音訊息,輸出語音訊息的內容!
:::
## Different types of Functions
* **Regression** : The function outputs a **scalar**.
* **Classification** : Given **options (classes)**, the function outputs the correct one. (e.g. playing Go, classifying spam mails, etc.)
* **Structured Learning** : create **something with structure**(image, document).
## Machine Learing - Find the function

### 1. Function with Unknown Parameters -> domain knowledge

### 2. Define Loss from Training Data
* Loss is a function of parameters $L(b,w)$, measuring how good a set of values is (small : good)
* Loss : $L= {1\over N} \sum\limits_ne_n$
* $e=|y-\hat y|$ ----- $L$ is mean absolute error **(MAE)**
* $e=(y-\hat y)^2$ ------ $L$ is mean square error **(MSE)**
* If $y$ and $\hat y$ are both probability distributions ------ **Cross-entropy (get to this later)**

### 3. Optimization
:::info
$w^*, b^*=arg\ \min\limits_{w,b} L$
:::
* **Gradient Descent**
* (Randomly) Pick an initial value $w^0$
* Compute ${\partial L}\over{\partial w}$$|_{w=w^0}$, 決定往哪裡跨,|斜率| : 大/小跨越大/小
* Negative $\rightarrow$ Increase $w$
* Possitive $\rightarrow$ Decrease $w$
* <text style="color:red">$\eta$</text>${\partial L}\over{\partial w}$$|_{w=w^0}$
* <text style="color:red">$\eta$</text> : learning rate
==這種要自己設定的東西 : hyperparameters==

* Update $w$ iteratively: 不小心到極值 ? 但其實不是問題 ?之後會再說! (if you have experience of training neural network, you will know that there's no worries)

* 推廣到兩個參數


### 4. increasing features to consider... ?

:::danger
But linear model has severe issue, **Model Bias**, 有太多curve不是只是一條直線,所以我們要連起好幾個piecewise的線段變成continuous curve

下面我們來看看怎麼處理!!!


:::
## Issue with linear model
### Step 1. Function with Unknown Parameters - Sigmoid
#### sigmoid function(平滑的) 去表示 hard sigmoid(有稜角的)

:::success
所以有人說需要做特徵縮放就是因為 $\rightarrow$ 要讓features的分佈儘量在sigmoid的梯度變化裡面!
:::
##### 調整 w, b, c,分別調整斜率、位移、高

##### 紅=很多 sigmoid functions 合, more points, more smooth


##### 再拆得細一點, and 變成matrix
* $i$ 代表每一個sigmoid function, 可以想成是不同位置的形狀
* $j$ 代表每一個features, 像是前七天的觀看人數
* $r$ 是等等要送到 sigmoid function 的

#### 寫成矩陣

#### r 送入 sigmoid 得到 a

#### 得到 y


==**Note**: 綠色的b是向量,灰色的是數值==
### Step 2. Define Loss from Training Data
:::danger

:::
#### Loss is a function of $L(\theta)$

### Step 3. Optimization of New Model
$\theta ^*=arg\ \min\limits_{\theta}L$
* (Randomly) Pick an initial value $\theta^0$
* Compute $g$, 實作上通常gradient不會是0, 通常是我們不做了


* 實作上 : 把 $L$ 切很多batch
* 一個$B$做出來的Loss就是 $L^1$
* 當然如果$B$很大,那說不定$L^1$會很接近$L$
* **Batch** -> 1 epoch to see all batches once -> **shuffle** -> prevent learning order bias

* epoch & update are different terms
* 一個epoch我們不會知道他是幾次update
* Below is an example

## ReLU
* Rectified Linear Unit (ReLU)
* 兩個 ReLU 合成一個 hard sigmoid

* Sigmoid 和 ReLU 都叫 **Activation Function** $\rightarrow$ ReLU 比較好

### Another hyperparameter?

:::warning
簡而言之,我們可以做很深,很多層sigmoid / ReLU
至於為什麼是很多層 (Deep network), 而不是一開始就連很多Sigmoid (fat network), 這個之後會說到!
:::
### This is Deep learning?
* Deep = Many hidden layer
* function like sigmoid = Neuron


* 而這種layer變多反而誤差變大的狀況,叫做overfitting
:::warning
**Overfitting** : Better on training data, worse on unseen data (Testing data).
:::
## Tradeoff of model complexity (Poke vs. Digi)
### Inequality
#### Assume Pokemon線條比較少.... h as threshold...

#### L as loss function, 其實就是正答率

#### 這邊要知道 we are never gonna have $D_{all}$

#### but we hope they are close

#### Mathematical proof

#### let $\epsilon = \delta /2$, we want to know the prob.



#### hoeffding's inequality


#### example

### Model complexity

#### 現在的問題是我們沒辦法決定我們的N (usually), 但是我們如果把H設定得很小的話, 自然loss就會很大 (though 兩loss會很接近)
#### You can see the picture below

:::info
how to 魚與熊掌兼得? $\rightarrow$ Deep learning ([video](https://youtu.be/yXd2D5J0QDU?si=okZxHeozCGma1a-w))
* 一個 hidden layer 可以產出任何 function,但
高瘦 > 矮胖,可以產生較複雜的 function,而且矮胖要較多 parameter,更容易 overfitting,同時也代表,**高瘦只要比較少的訓練資料**
* 通常,如果 expected function 比較複雜且有規律 -> 適合 deep
(ex. image, speech)

:::
# Deep Learning
:::success
added notes from 額外的補充影片們 ([video 1](https://youtu.be/Dr-WRlEFefw?si=7H51g3fESWsdlZ5e), [video 2](https://youtu.be/ibJpTrp5mcE))
額外知識像是initializer或是loss function, optimizer可參考紀的lec. 6
但是==有一些知識是錯的==哈哈...like AdaM optimizerr, etc.

:::
## Deep Learning vs. Machine Learning
:::danger
所以⋯⋯
machine learning $\rightarrow$ deep learning, 並沒有變比較簡單!
沒有deep learning的時候,我們在做的事情是feature engineering
i.e. 做影像辨識的時候要好好抽取一些我們決定的features給函數train.
**issues:抽什麼features**
有deep learning的時候我們可以硬幹但是我們需要決定的變成函數本身
i.e. 影像辨識的時候直接丟pixel進去,但你要design network 的structure.
**issues: 怎麼抽features**
Which one is easier? $\rightarrow$ it depends on your tasks.
:::
### Step 1. Define a function (Neural Network)
* 每一組neuron都有他的weight跟bias
* **Given network structure $\rightarrow$ define a function set**
#### Fully connect feedforward network


* These can be expressed in matrix form, **easy for GPU to calculate**.
#### example


==The function we need... is neural network!==

* 而要幾個hidden layers, 或是neurons都是我們自己決定
* So we need **"intuition"** and **"trial and error"**
### Step 2. & Step 3. Pick good function
==**for cross entropy, here's a**== **[reference](https://flag-editors.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92%E5%8B%95%E6%89%8B%E5%81%9Alesson-10-%E5%88%B0%E5%BA%95cross-entropy-loss-logistic-loss-log-loss%E6%98%AF%E4%B8%8D%E6%98%AF%E5%90%8C%E6%A8%A3%E7%9A%84%E6%9D%B1%E8%A5%BF-%E4%B8%8A%E7%AF%87-2ebb74d281d)**


## BackPropagation
:::success
To compute the gradients efficiently...

上圖表示,我們只要能算特定一個data的$\partial l^n/\partial w$, 全部加起來,你就可以算出total loss對$w$的偏微分。(所以再來下面的計算也都只focus在某一筆data $\rightarrow$ 某一個neuron)

:::
### Forward pass


### Backward pass


* Assume ?是已知
* 上圖如果你最後一層有1000個neurons, 那自然就是加個1000次!!!

* 現在可以想像有另外一個neuron

### Split orange neuron into two cases...
#### Case 1

#### Case 2

* 如果不是output layer, 我們還是可以假設他的下一層是known直到最後到達output layer

* ==結論:反著算==

# General Guidance on workflow
:::info
Framework of ML

:::
## General Guide 1 - Training Loss large
:::warning
有可能是 **1)model bias**, **2)optimization**

:::
### 1) Model Bias
#### 要找的可以降低loss的函數現在的model根本描述不出來(橘點)

### 2) Optimization Issue
#### i.e. 像是Gradient descent找到local min. (找不到橘點)

### 3) How to know which one?
1. **比較多層結果**: 像下圖是 optimization issue, 因為 56 層一定可以做到 20 層的事,所以不是 bias;外加training data的56層也比20層差,所以不會是overfitting.
2. 從**淺一點 / 其他簡單** model (比較好 optimize 的) 開始做
3. 再來做深的,如果比較深的沒有比較小的 training loss -> optimization issue

## General Guide 2 - Testing Loss large
:::warning
有可能是 **1)overfitting**, **2)mismatch**

:::
### 1) overfitting
#### 1 - 用更多 training data, 自己查之類的
#### 2- data augmentation, 自己創造出新的 data, 像是貓咪照片放大或左右翻轉都還是貓咪。

#### 3 - constrained model
自己設計不那麼 flexible 的,但不能限制太大. 第一點之後CNN會說到,就是因為給了比較大的限制,所以在影像處理上CNN才可以表現這麼好, 即使做出來的集合中的function沒那麼多。


#### 4_regularization, dropout
#### 5_太constrained會回到 model bias..


#### 6_Bias-Complexity Trade-off

### 2) N-fold Cross Validation

### 3) Mismatch
- Your training and testing data have different distribution

# Optimization - Gradient is small...
#### Optimixation fails because...critical point
* **local minima**
* **saddle point**

:::warning
* Critical points have zero gradients
* Critical points can be either saddle points or local minima
* Can be determined by the Hessian matrix
* It is possible to escape saddle points along the direction of eigenvectors of the Hessian matrix
* Local minima may be rare, i.e. local min. in 2D might be saddle point in higher dimensions such as 5D!
* Smaller batch size and momentum help escape critical points
:::
## Taylor Series Approximation
* Gradient $g$ : 一次微分
* Hessian $H$ : 二次微分
* ctritical point 的時候 $g=0$,只剩 hessian

### Hessian (symmetric if function is continuous)
let $v=\theta-\theta'$
At critical point,
$L(\theta)\approx L(\theta')+{1\over2}(\theta - \theta')^TH(\theta - \theta')$
$=L(\theta')+{1\over2}v^THv$
* $v^THv > 0$
Around $\theta'$ : $L(\theta)>L(\theta')$ $\rightarrow$ **Local minima**
$\rightarrow$ $H$ is positive definite = All eigen values are positive
* $v^THv < 0$
Around $\theta'$ : $L(\theta)<L(\theta')$ $\rightarrow$ **Local maxima**
$\rightarrow$ $H$ is negative definite = All eigen values are negative
* Sometimes $v^THv > 0$ , somtimes $v^THv < 0$
$\rightarrow$ **Saddle point**
$\rightarrow$ Some eigen values are positive, some are negative
==eigenvalues, elements是兩件不同的東西==
==簡而言之算出矩陣後看他的eigenvalue就對了!!!!!==
### example



## Don't be afraid of saddle point?
==$H$ may tell us the update direction of parameters!==
$u$ is an eigen vector of $H$
$\lambda$ is the eigen value of $u$
$\rightarrow$ $u^THu=u^T(\lambda u)=\lambda ||u||^2$
* if $\lambda < 0$ , $\lambda ||u||^2 <0$ , $u^THu<0$
$L(\theta)\approx L(\theta')+{1\over2}(\theta - \theta')^TH(\theta - \theta')$
$\rightarrow L(\theta)<L(\theta')$
$\theta-\theta'=u$ $\rightarrow$ $\theta=\theta'+u$ , **decrease L**
沿著 $u$ 去更新方向
==簡單來說,我們找到一個負的eigenvalue的eigenvector,把它加在原本的$\theta$去找新參數就可以了!!!!! (帥氣 by Tony)==
:::info
Well, but we seldom use this in practice.
:::
### example

## Large v.s. Small Batch (saw this @Step 3 of Intro )
### Time

* **Larger** batch size **does not require longer time** to compute gradient (unless batch size is too large)
* **Smaller** batch **requires longer time** for one epoch (longer time for seeing all data, 跑太多epoch.)

### performance
* **Smaller batch size has better performance** (如下圖, L1卡住了, L2不一定會卡住啊)
* What’s wrong with large batch size? **Optimization Fails** (not model bias or overfitting.)

#### Small batch is better on testing data
small : 256
large : 0.1*dataset
large 傾向走到 sharp minima(比較不好的 minima)
small 因為亂跳,所以很容易跳出去
==說到底就是一個paper的解釋==


## Momentum
### Gradient Descent + Momentum
Movement: **movement of last step** minus **gradient at present**
兩者取中間

* 會發現 $m^i$ 可以寫成 the weighted sum of all the previous gradient: $g^0, g^1, ..., g^{i-1}$
* **到 minima 的時候因為有 last movement 所以還能向前**

# Error surface is rugged...
==training stuck = small gradient==
* Different parameters needs different learning rate
$\rightarrow$ gradient 越大,learning rate 要越小
* 有時候我們根本到不了critical point, loss就穩定了
原本 : $\theta^{t+1}_i \leftarrow \theta ^t_i-\eta g^t_i$ ,$g^t_i={\partial L \over \partial \theta_i}|_{\theta=\theta^t}$,t 代表第幾次 update

現在客製化 : $\theta^{t+1}_i \leftarrow \theta ^t_i-{\eta \over \sigma^t_i} g^t_i$
## Root Mean Square
$\sigma^{t}_i=\sqrt{{1\over{t+1}}\sum^t_{i=0}(g^t_i)^2}$

:::danger
剛剛的假設有點像是,同一個參數,gradient 的大小就會固定是差不多的值
:::
## RMSProp
$\sigma^{t}_i=\sqrt{\alpha (\sigma^{t-1}_i)^2+(1-\alpha)(g^t_i)^2}$,$0<\alpha < 1$


:::danger
Adam = RMSProp + Momentum
:::
w/ & w/o Adaptive Learning Rate
w/o:
w/:
還是會噴,但會回來
:::danger
解決 : learning rate scheduling
:::
## Learning Rate Scheduling
$\eta^t$
$\theta^{t+1}_i \leftarrow \theta ^t_i-{\eta^t \over \sigma^t_i} g^t_i$
* **Learning Rate Decay** : As the training goes, we are closer to the destination, so we reduce the learning rate


* **Warm Up** : Increase and then decrease
At the beginning, the estimate of $\sigma^t_i$ has large variance

:::warning
summary:
* momentum 有方向

:::
## Batch Normalization (no more rugged surface)
:::warning
**recap**:
* **dimension** usually refers to number of features.
* **feature vectors** usually refers to one row containing all features one sample has. (you can switch the column and rows.)


:::
* Here we assume that $x^1$ to $x^R$ are every feature vectors.
* we want to make the mean of all dimensions =0, and the std become 1


* 放到neural network的話就會是一堆標準化的input的一層層output, 我們還要再算他們的平均值跟標準差, 計算量可以說是非常大!

* so we need **batch normalization**
* But we should note that the batch should have relevant size. Otherwise the batch normalization cannot represent the distribution of the whole data.

* batch normalization in testing

# Classification
* class as one-hot vector
* 我們可以讓network從output一個數值到跑出三個, and then we can cope with one-hot vector
* 不過在**binary classification要注意features的multilinearity**, 要drop first (in keras), for example, 我如果有三個值前兩個one-hot是0, 那我第三個一定是1!!!
* 但這邊也可以注意的是for multiclass classification, we cannot drop first for label haha.
* ==Only usr drop first for one-hot encoding of features.==



:::danger
加 softmax 讓 $y$ 變成 $y'$更接近 $\hat{y}$
:::
## Soft-max
* $y_i'={exp(y_i)\over \sum_j exp(y_j)}$,$y$ : logit (input)
* $1>y_i'>0$
* $\sum_iy_i'=1$
:::danger
vs. sigmoid : [link](https://medium.com/@yingyuchentw003/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92%E5%9F%BA%E6%9C%AC%E6%A6%82%E5%BF%B5-activation-functions-8d890d650e8a)
:::
## Loss of Classification
$L={1\over N}\sum_ne_n$

* MSE
$e=\sum_i(\hat{y_i}-y_i')^2$
* **Cross-entropy** `win`
$e=-\sum_i(\hat{y_i}\ lny_i')$
**Minimizing cross-entropy** is equivalent to **maximizing likelihood**
證明 cross-entropy 比較好
MSE 在 loss 很大的時候很平坦 -> stuck
[補充影片](http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Deep%20More%20(v2).ecm.mp4/index.html)

# CNN
:::info
One-hot vector 的 dimension larger (越多項目),就表示我們可以classified出的類別越多, 可想成每一個element即是一個feature.
Later we wil see one problem connected to CNN $\rightarrow$ Do we really need the fully connected network?


:::
## Receptive Field
* channel : 深,決定圖片的顏色, not limited to 1 or 3
* 黑白的channel = 1
* tensor : > 2D 的矩陣

* 對一個小 neuron,只要管 3\*3*3 的範圍 : receptive field

* 也可以多個 neuron 同個 receptive field
* receptive field 可以 overlapped
* receptive field 大小可以不同、可以 cover 特定 channel
* receptive field 也不一定要是連續範圍 (爽就好,大概,但就要知其所以然。)
### typical setting : kernel size
* kernel size : $3*3$(高*寬)
* stride = 平移的量, 勁量會想要重疊是怕如果沒重疊,又有pattern出現在他們的交界的話= =就好笑了
* padding, 即是補值, 一般來說都在超出影像範圍的地方補0

## Parameter Sharing
* 不同 neuron 共享相同參數 : weight 一樣
* 這就像所有系所一起來修機器學習大班課 哈

### typical setting
* 每個 receptive field 對不同neuron都只有設定好的一組參數
* **filter** : 這些參數叫做 filter
* **下圖一個圈圈代表一個 neuron**, 同顏色的filter一樣

## Benefit of Convolutiional layer
* 增加對 neuron 的限制
* 這樣雖然large model bias, 但是也不太會對一個image就會overfitting

:::warning
CNN = receptive field + sharing parameters
:::
### Steps (Story 2)
* 我們先假設filter 1,size是$3\times3\times channel size$ (先當一), and 裡面是他的parameters, 我們在做的時候就是把他跟image的數值做inner product.
* 如下圖一的filter, 就會是找image裡面diagonal方向的1這一個pattern.


64 filters去如上圖這樣做 -> 得到64 群數字 -> 64 組數字的 feature map -> 可以看成新的 image with 64 channel -> 當成下一層輸入 -> 這時候看 3\*3 其實是考慮了原本 5\*5 的範圍(沒看圖完全不知道5\*5在幹嘛超好笑,所以可以重看一次講義或影片) -> 所以==network夠深就一定可以偵測到很大範圍的pattern==


### Pooling - Max Pooling
選最大的留下 -> 把圖片變小


通常一層 conv,一層 pooling -> 但也可能丟失資訊 -> alphago no polling
### The whole CNN

# Self-Attention
:::success
* An important module in the transformer
* If your input is a set of vectors...

* Graph is also a set of vectors
* like social network, molecules...


:::
## Type of output
* Each vector has a label

* The whole sequence has a label

* Model decides the number of labels
#### seq2seq

## Intro of self-attention
### Overview
* 是一種network的架構
* 我們有多少vector被input進去,我們就會利用self-attention再生出同樣數目的vectors,然後再送進去fully connnected network.
* And the vectors we produced is the **informed vector** that read **all the information in the sequence** and know the context.
* 我們可以一直交互疊加self-attention 和 fully connected network

### mechanism

* And we say the whole sequence, but we still don't want to let the length of window equals to thed whole sequence.
* So we need to find the vectors to be input that has the relevance with each other$\rightarrow$ find the relevant vectors in sequence.

* and that relevance is called the attention score.
* attention score has lots of calculation methods, but at here we will use the dot product.
* And after the calculation of the dot product, we should use the softmax function to normalize them. (you can also use other functions)
* q for **query**, and k for **key**


* And finally we extract the information based on the attention score

* 可以看到說attention score越高的,他最後加成b1的時候b1的值也會越接近他!!!
* Who has higher attention score, whose v has dominant effects in the resulting b
* I will stop here. And for more detailed self-attention, you can start from watching **"self-attention part2" + "transformer" and so on...** at machine learning 2021 channel.(e.g. multihead self-attention, position information in self-attention, etc.)
### Self-attention for Graph

* In the picture above, we should notice that in the graph case, there's **no need to calculate the attention score (most cases)**.
* If we consider edge, that is, the line concectint two nodes, then those connected nodes **need attention!**
* This is also one type of the Graph Neural Network (GNN)
# GNN
:::danger
* Please refer to [video1](https://youtu.be/eybCCtNKwzA?si=bOECe4pSrhKnUoCh) and [video2](https://youtu.be/M9ht8vsVEw8?si=QQ0KaKoxHfVUUXbu)
* Graph = node + edge
* 基本上只要我們可以把input變成graph, and then output is also the graph. Then this is the graph neural network.
* To put it simply, how to consider the entity's features, along with its relationship with other entities **at the same time** $\rightarrow$ GNN.
:::
## How?
* 如下圖,但今天的狀況是說我們一般來說,training data量很多的時候,我們的unlabeled data is usually larger than the labeled data.
* So how do we deal with this problem? How to let nodes learn from its neighbors?

## Roadmap
* We like GCN and GAT
* It can also be used in the NLP

### Spacial-based
* It's like generalizing the CNN to the graph.
* We use **aggregation**: from features of the neighbors to update the next hidden state.
* And then **readout**. (for example, we are predicting some properties like hydrophilicity of the molecules.)

#### NN4G
==I think it is the most simple one since both its aggregation and the readout are calculated using sum.==


#### MoNET
* Use weighted sum instead of simple summing all neighbor features (compared to NN4G)
#### GAT
* Not only weighted sum, we also let the model to decide the weights (compared to the MoNET)
* You **do attention** to your neighbors
:::info
And still, there're lots of model using different type of thet aggregation methods, such as some models may use LSTM.
:::
### Spectral-based
* 先把signal和filter(similar to thata in CNN)去做fourier transform, and then we do the multiplication.
* Finally we then do the inverse fourier transform to get the results.

## Summary
* Please note that I ignored lots and only record what I think relevant. Detailed content in the slides.
* GAT and GCN are the most popular GNNs
* Although GCN is mathematically driven, we tend to ignore its math
* GNN (or GCN) suffers from information lose while getting deeper
* Many deep learning models can be slightly modified and designed to fit graph data, such as Deep Graph InfoMax, Graph Transformer, GraphBert.
* GNN can be applied to a variety of tasks