Dive into Deep Learning 筆記

# 目錄&介紹教授推薦介紹 AI 的書: https://d2l.ai/ ，這是零AI基礎學生的自學筆記 [TOC] # Introduction - **machine learning 用途: 想不到 100% 正確的演算法時，另一種解決方案** - 問題特性: - pattern 不斷改變, (pixels, 分類項目等)關係太複雜... - 解決問題需要的步驟，可能極為複雜 - e.g. computer vision, natural language processing - **machine learning 進行方式**: - 不知道 Input 如何 mapping 到 Output? - 用 *dataset* 讓一些 *parameter* 改變，使 mapping(*model*) 愈加正確就好! - 此過程稱為 *training* - **learning algorithm**: 以 dataset 決定 parameter 的演算法 - 用大量 dataset 讓 model 更正確，可以想成正在 *learning* 的程式 ## Key Components - **Data**: - Each *example* (or *data point*, *data instance*, *sample*) = a set of *features* (or *covariates*, *inputs*) - 能用數值表示的特徵就能當 *feature* - 要注意 data 必須是**正確的** (*garbage in, garbage out.*) - e.g. 沒訓練黑膚色 data 的皮膚癌檢測 model, 有偏見 data 的履審查 model - **Model**: - 吃 input data，predict 成多種 output 中一種 - 跟 [statistical models](https://www.youtube.com/watch?v=KOBaNFEyrLA&ab_channel=Dr.JackAuty) 概念一樣 (用 sample 推論 population 特性) - **Objective Functions**: - 決定什麼是 "正確", "更好" 的嚴謹數學定義 - 通常定義 Output 愈低愈"好"，所以又稱 **loss functions** - 在 optimization 時就以降低此(*loss*)為目標 - *overfitting*: training dataset(改 parameter 用) 表現很好，但在 test dataset(實際測試用) 表現不佳的現象 - **Optimization Algorithms**: - searching for the best possible parameters for minimizing the loss function - e.g. *gradient descent*: 改一點 parameter, 如果 loss 小，就往該方向修 parameter ## Kinds of Machine Learning Problems ### **Supervised Learning** - *example*: 一個 feature–label pair - Supervised Learning 目標: **給 features，predict "對應 labels" 的機率** - e.g. - Predict cancer vs. not cancer, given a computer tomography image. - Predict the translation in French, given a sentence in English - 大致流程: - training data&label --`supervised learning algorithm`--> model - Input data --`model`--> Output label #### Regression - *feature vector*: - e.g. (sq. footage, no. of bedrooms, no. of bathrooms, walking distance) - New York 房子的 *feature vector* = $[600, 1, 1, 60]$ - 目標: predict 對應 label 的 *value* (**how much/many?**) - e.g. 房子*預計*能賣*多少錢* #### Classification - 目標: predict 是*哪個*對應 label (**which one?**) - e.g. 辨識是 0-9 哪一個 - 與 Regression 用的 toolset 不太一樣 - 會以 "機率" 表示結果 e.g. 是 0 的機率是 80% - 所以判斷時要考慮**犯錯的機率&代價** i.e. 不是毒菇機率是 80%，就接受它能吃? - *hierarchical classification*: - 必須犯錯時，寧願選擇 *relative* class 而非 *distant* class - i.g. 貴賓狗 > 哈士奇 ~~迅猛龍~~ #### Tagging - 目標: predict 是*哪些*對應 label - 又稱 *multi-label classification* - e.g. 圖片裡有哪些動物、文章要 tag 什麼關鍵字 #### Search - 目標: predict 對應的 label (*優先順序*) - 可以 assign a score to each element in set 來實現 - e.g. 原本 Google 搜尋引擎演算法: [PageRank](https://en.wikipedia.org/wiki/PageRank) #### Recommender Systems - 目標: predict *使用者* 對應的 label (喜歡程度) - 一樣是用 score system，但以使用者的各種數據為 input - 有很多問題待解決 e.g. - censored feedback: 大多人只在很有感時評價(所以 1星 5星人數遠大於其他等級) - feedback loops: 現在的 habit 是來自目前演算法，而 habit 又會影響演算法，導致某產品因賣的好，被更嚴重地 recommend #### Sequence Learning - 目標: predict *連續、相關輸入* 對應的 (*sequences of*) label - *sequence-to-sequence learning*: inputs & outputs 都是 variable-length sequences - e.g. - 一句話中辨識"名字" (可 *align* Input 與 Output，使其有 one-to-one 關係) - speech-to-text transcription (Input 與 Output 沒有 one-to-one，難度極高) - Machine Translation (不只不可 align, Output 順序還可能與 Input 不同) ### **Unsupervised and Self-Supervised Learning** - 丟一坨 data，然後叫程式 "*just do some data science!*" - 能做的事 e.g. - ***clustering*: 去幫 data 分成幾類** - e.g. 一坨 photo, 自己分成 landscape, dogs, cats, mountain peaks 四類 - ***subspace estimation*: 找出形容 data 的 parameter** - e.g. 球的拋物線，用"初度、質量、重力、球直徑"能準確描述 - data 能否用數學表示? - e.g. Rome - Italy + France = Paris - data 間是否有關係? - e.g. 房價、犯罪率、汙染、薪資、教育成度統計數據關聯? - 關鍵字: *causality* and *probabilistic graphical models* - 各種生成式 Model - e.g. variational autoencoders, generative adversarial networks, normalizing flows, diffusion models - **Self-Supervised Learning**: - 沒 label? 用其他 model 產生，給要訓練的 model 訓練 - e.g. 訓練 "語詞填空" model 的 label, 就用生成文字的 model 生成 data ## Interacting with an Environment - *offline learning*: 正常 learning 方式， model 接收 learnging data&label 後就不會與環境互動 ![image](https://hackmd.io/_uploads/rksrc2QP0.png) - 如果 Output 不只是 predict，還會影響 Envirnoment (再影響 input) 就要換方式 - 此叫做 *distribution shift* problem - 此問題要考慮: - envirnoment 會記得 Output(Action) 嗎? - envirnoment 會幫助/妨礙 model 達到 Output? - envirnoment 會"不斷"改變嗎？ (shifting dynamics) ## Reinforcement Learning - 解決 distribution shift problem 的 machine learning 方式 - e.g. dialogue systems, 部分 video game AI - *Deep reinforcement learning* = 用 deep learning 技術在 reinforcement learning - e.g. AlphaGo, deep Q-network - Reinforcement Learning 順序: ![image](https://hackmd.io/_uploads/HkAl1lVDC.png) 1. *agent* receives *observation* 2. choose an *action*, 去影響 *environment* via *actuator*(傳遞機制的統一名詞) 3. *agent* receive *reward* from *environment* - agent 在此改進、變棒 - 可能很多 action 後才有 reward. e.g. 下棋，只有輸/贏後才知道 reward 4. 重複 1. - 3. - ***policy*: function that maps from *observations* to *actions*** - Reinforcement Learning 目標: **產生 good policies** - supervised learning can be recast as reinforcement learning 喔 - 如 classification problem，class --> action, loss function --> reward - problems: - ***credit assignment problem*: determining which actions to credit or blame for an outcome** - e.g. 決定做什麼會影響 promotion - ***partial observability problem*: current observation 可能沒有 current state 所有資訊** - e.g. 避免掃地機器人進衣櫥、不斷原地打轉 - **選擇 policy 方式:** - 直接 *exploit* the best (currently) known strategy? - 先 *explore* 其他 strategies，(可能)放棄短期 reward 換取 knowledge - 可能的 speical case: - *Markov decision process*: environment is fully observed - *contextual bandit problem*: states 與 previous actions 無關 - *multi-armed bandit problem*: 沒有 states，只有 actions 與對應未知的 reward ## Roots - 歷史，有興趣再看: https://d2l.ai/chapter_introduction/index.html#roots - 有趣的點: - Bernoulli distribution, Gaussian distribution 等統計學工具，做的事與 machine learning 一樣: Data --> model 然後 Input -- model --> Output - 現在 deep learning 一部分 base on 古人概念(*Hebbian learning rule*): **neurons learn by positive reinforcement** - 如果像以前，**資料量很少**，能試試以前的 AI 技術: - *kernel method, decision tree, graphical model* - 還有**強大理論基礎、可預測的 result** ## The Road to Deep Learning - 也是歷史: https://d2l.ai/chapter_introduction/index.html#the-road-to-deep-learning - 列了很多 optimize machine learning 的事蹟，很值得看看 ## Success Stories - 更多歷史: https://d2l.ai/chapter_introduction/index.html#success-stories - ps. 目前的 AI 做不到 "主動修正 model"，所以無法做到自行進化 ## The Essence of Deep Learning - *Deep Learning* = many-layered neural networks (machine learning 的一種) - ***end-to-end training*: 全部、很多層的 layer 會一起被 data training** - 正常 machine learning 是把不同 component 分別完成，再組一起 - e.g. 之前 computer vision 需要用各種 filter 取 *feature* (累)，現在直接丟 feature extractor 就好 - 能做到正常 machine learning 做不到的事 - *fully nonparametric models*: - 不做 "解 euqation" 去形容一件事(產生 Output) - 做 partial differential equation 的 numerical simulations - 理解: 對難解的 equation，用 data discretize 以取得 "近似" 解 ## Summary - learning: **automatically find the appropriate way to represent data** - deep learning: **multi-level representation learning** through learning many layers of transformations # Linear Neural Networks for Regression 程式使用 python 3.10, Jupyter Notebook, 以 **PyTorch** 為 deep learning frameworks - 根據[這個](https://www.simplilearn.com/keras-vs-tensorflow-vs-pytorch-article)，pytorch performance 比 tensorflow 慢，但比較 flexible，易上手 - 適合初學者, prototyping, 探索 AI 功能用 - 課本預設也用這個 - 依照[課本需求](https://d2l.ai/chapter_installation/index.html#installing-the-deep-learning-framework-and-the-d2l-package)，載 [CUDA](https://developer.nvidia.com/cuda-toolkit-archive)([維基介紹](https://zh.wikipedia.org/zh-tw/CUDA))等啟用 GPU，然後載 [PyTorch](https://pytorch.org/get-started/previous-versions/) 與 d2l (課本幫忙寫好的常用 liberary) - 根據[這個](https://medium.com/@anannannan0102/pytorch-%E6%83%B3%E7%94%A8gpu%E8%B7%91ml%E7%92%B0%E5%A2%83%E5%8D%BB%E8%A3%9D%E4%B8%8D%E5%A5%BD-%E7%9C%8B%E5%AE%8C%E9%80%99%E7%AF%87%E5%B8%B6%E4%BD%A0%E9%81%BF%E9%96%8B%E5%90%84%E7%A8%AE%E9%9B%B7-3bf259fc7396)，不要馬上裝 CUDA！要取 GPU, Python, PyTorch 中**最低**支援 cuda 版本 (因為版本向下兼容): - GPU CUDA 版本: terminal 指令 `nvidia-smi` - Python 版本: terminal 指令 `python -V` - PyTorch 查詢: https://github.com/pytorch/pytorch/blob/main/RELEASE.md - 好的安裝順序：**依 GPU 載 [CUDA](https://developer.nvidia.com/cuda-toolkit-archive) > 看對應 PyTorch 的 Python 版本 > 載 Python > 載 [PyTorch](https://pytorch.org/get-started/previous-versions/)**(要用 `#CUDA` 的指令) - 要確定: ```py import torch torch.cuda.is_available() # true ``` 才是 GPU 版安裝成功！ - 記得 `pip install d2l==1.0.3` 喔 ## Linear Regression - **Regression problems: predict a *numerical value*** - e.g. 以屋齡和坪數預測房價 - ***training dataset* or *training set*: Input 與已有 Output 的 data** (屋齡, 坪數, 房價都要有) - *example* (or *data point*, *instance*, *sample*): 1 筆 data (屋齡, 坪數, 房價)，通常在 row - *label* (or *target*): 要 predict 的 Output - *features* (or *covariates*): Input (predict 用 variable) ### Basics - **conditional mean(expectation):** - $\vec{x}$ 是 features 的 (values) vector, $y$ 是 output target，$X, Y$ 分別是 $\vec{x}$, $y$ 的 random variable - $E(Y | X=\vec{x}) = \sum_yy\frac{P(X=\vec{x}, Y=y)}{P(X=\vec{x})}$， **$X=\vec{x}$ 時 $Y$(random variable output) 的期望值** - where $P(X=\vec{x}, Y=y)$ 是 $\vec{x}, y$ 都發生的機率 - 如果 $X, Y$ 是獨立事件，$P(X=\vec{x}, Y=y) = P(X=\vec{x})P(Y=y)$，那 $E(Y | X=\vec{x}) = \sum_yyP(Y=y)$，也就是正常的 expectation value - 如果數值連續，要換成 $E(Y | X=\vec{x}) = \int_{-\infty}^{\infty}yf_{Y|X}(y|\vec{x})$ - where $f_{Y|X}(y|\vec{x}) = \frac{f_{Y,X}(y,\vec{x})}{f_X(\vec{x})}$ - **Linear regression 的假設:** 1. 假設 features $\vec{x}$ and target $y$(可能只有一個 value) is *approximately **linear***, i.e. **假設 output $E(Y | X=\vec{x})$ 會 $=$ weighted sum of features $\vec{x}$** - 以期望值形容 output，就允許 output 與實際 observation 有誤差 (by *observation noise*(測量誤差)) 2. 假設 **observation noise "well behaved" (follow Gaussian distribution)** > Q: 假設成 Gaussian distribution 原因？ > 推測Ans (*我亂想的，**需專家 clarification***): > - 經驗法則(極端值發生機率極小，在期望值附近發生機率大)，有這種現象都會近似鐘形分布，所以想用 Gaussian distribution 近似(假設)它 > - 特別選 "Gaussian" distribution，可能因為 1. 只需用 mean, sd 表示 2. 對稱 3. 因為 central limit theorem, 這個 distribution 被研究最深，已"證明出"很多數據、特性與延伸公式（要"假設"當然先用方便、熟悉的東西假設） - Notations: - 第 $i$ 個 *sample*: $\vec{x}^{(i)}$ - 第 $i$ 個 *sample* 中第 $j$ 個 element($j$-th coordinate): $x^{(i)}_j$ ### Model - linearity 假設用於 "屋齡, 坪數預測房價" e.g. $$ price = w_{area} \cdot area+w_{age} \cdot age+b $$ - $w_{area}$, $w_{age}$ 等稱為 *weight* - 表示 "the influence of each feature on our prediction" - $b$ 稱為 *bias* (or *offset*, *intercept*) - 表示 "the value of the estimate when all features are zero" - ***affine transformation* = Linear Transformation(*weights*) + Translation(*bias*)** - 我理解成圖學 Rotation, Scale, Shear + Position 的概念 - Linear Regression 目標: **給 Input($\vec{x}$)，找 affine transformation($\vec{w}$ 與 $b$) 使 output ($y$) 與實際 data 愈接近愈好** - generalize formula: $$ \hat{y} = w_1x_1+ \ldots + w_dx_d + b $$ - $\hat{y}$: Output - $d$: 總共有 d 個 feature - 以 $d$ dimision vector($d*1$ matrix), matrix 相乘表示 (dot product 關係): $$ \hat{y} = \vec{x}^T \vec{w} + b $$ - $\vec{x}$ 裡面有 1 個 example 個 data，我們要表示 $n$ 個 example，所以把 $\vec{x}^T$ ($1*d$ matrix) 擴展成 $\vec{X}$ ($n*d$ matrix)，其中 row 是 examples, 每個 column 是各自的 feature $$ \hat{\vec{y}} = \vec{X} \cdot \vec{w} + b $$ - $n$ 筆 data 會有 $n$ 筆 Ouput，所以 $\hat{\vec{y}}$ 是 $n*1$ matrix - $\vec{X}$ 又稱 *design matrix* - $+b$ 的部分要 *broadcasting* (element-wise operation to result array) - Linear Regression 目標重申: **給 Input($\vec{X}$) 與 known labels $\vec{y}$，找 affine transformation($\vec{w}$ 與 $b$) 使新 Input 的 output ($\hat{\vec{y}}$) 與實際 data 誤差最小** - 即使 100% 確定是 linear model，一定也有測量誤差，所以"誤差"一定要考慮 - $\vec{w}$ 與 $b$: 就是 *model parameter* ## Loss Function - *Loss functions* quantify **the distance between the real and predicted values of the target** - 通常結果 >= 0，愈接近 0 愈好 - **function 都是自己定義的** - regression problem 最常用: *squared error* $$ l^{(i)}(\vec{w}, b) =\frac{1}{2}(\hat{y}^{(i)}-y^{(i)})^2 $$ - example $i$ 的輸出 $\hat{y}^{(i)}$，真正 label $y^{(i)}$ (都是 value)，*squared error* 就是 $l^{(i)}(\vec{w}, b)$ - $\frac{1}{2}$ 是後續計算方便(微分後 $*2$ 消除)，才加上去 - $\vec{w}, b$ 是我們僅能控制的 variable (藏在 $\hat{y}^{(i)}$ 項!) - 平方，使 loss >= 0 且強迫我們誤差要更精準 (記得**平方**在那邊就好) - 如果要 $n$ sample 的 squared error: $$ L(\vec{w},b)=\frac{1}{n}\sum_{i=1}^n l^{(i)}(\vec{w}, b) $$ - 把 $1$ 到 $n$ 的 squared error 取平均就是了 - 表示 "找 $(\vec{w}^*, b^*)$ 使 $L(\vec{w},b)$ 最小" 的數學式: $$ \DeclareMathOperator*{\argmin}{argmin} \vec{w}^*, b^* = \argmin_{\vec{w},b}L(\vec{w},b) $$ ### Analytic Solution -在 n 維做 Linear regression 能用簡單 formula 去解! 1. 把 $b$ 納入 $\vec{w}$ by $$ \begin{flalign} & \hat{\vec{y}} \\ &= \vec{X} \cdot \vec{w} + b \\ & = \vec{X} \cdot \vec{w} + \begin{bmatrix}1\\1\\ \vdots \\ 1 \end{bmatrix} \cdot b\\ & = \begin{bmatrix}&1\\ \vec{X} &\vdots \\& 1 \end{bmatrix} \cdot \begin{bmatrix}\vec{w} \\b \end{bmatrix} \end{flalign} $$ - $\vec{X}$ 多加一個全是 1s 的 column， $b$ 就能加入 $\vec{w}$ 中 2. 又根據 squared error 定義 $$ \begin{flalign} &L(\vec{w},b) \\ &=\frac{1}{n}\sum_{i=1}^{n} l^{(i)}(\vec{w}, b) \\ & = \frac{1}{n}\sum_{i=1}^{n} \frac{1}{2}(\hat{y}^{(i)}-y^{(i)})^2 \\ & = \frac{1}{2n}\sum_{i=1}^{n} ({\vec{x}^{(i)}}^T\vec{w}^{(i)}- y^{(i)})^2 \\ & = \frac{1}{2n}||\vec{y}-\vec{X}\vec{w}||^2 \\ \end{flalign} $$ - 每項相減、平方、相加，就是距離公式，$n$ elements 的 $\vec{y}$ 與 $\vec{X}\vec{w}$ 的距離公式 - 所以目標是 minimize $||\vec{y}-\vec{X}\vec{w}||^2 = (\vec{y}-\vec{X}\vec{w})^T(\vec{y}-\vec{X}\vec{w})$ 的**值**(它不是 vector!) 3. 令: $$ \begin{flalign} & G(\vec{w})\\ &=(\vec{y}-\vec{X}\vec{w})^T(\vec{y}-\vec{X}\vec{w}) \\ &=(\vec{y}^T-(\vec{X}\vec{w})^T)(\vec{y}-\vec{X}\vec{w}) \\ & =\vec{y}^T\vec{y} -\vec{y}^T(\vec{X}\vec{w}) - (\vec{X}\vec{w})^T\vec{y}+(\vec{X}\vec{w})^T(\vec{X}\vec{w}) \\ & = \vec{y}^T\vec{y} -2\vec{y}^T\vec{X}\vec{w}+\vec{w}^T\vec{X}^T\vec{X}\vec{w} \\ \end{flalign} $$ - 最後的 $\vec{y}^T(\vec{X}\vec{w})=(\vec{X}\vec{w})^T\vec{y}$，來自: 這兩個都在做 $\vec{y}$ 與 $(\vec{X}\vec{w})$ 的 dot product 4. 做 $\frac{\partial G(\vec{w})}{\partial \vec{w}}=\begin{bmatrix} \frac{\partial G(\vec{w})}{\partial w_1}\\\frac{\partial G(\vec{w})}{\partial w_2}\\ \vdots \\ \frac{\partial G(\vec{w})}{\partial w_n} \end{bmatrix}$ (這是定義，就是 Gradient vector，**$\vec{w}$ 的每一項為 $G(\vec{w})$ 帶來的變化速率**)，在 $=\vec{0}$ 處 $G(\vec{w})$ 會有極小值 (極大值在此不存在，所以極值就是極小) $$ \begin{flalign} & \frac{\partial G(\vec{w})}{\partial \vec{w}}=-2\vec{X}^T\vec{y}+2(\vec{X}^T\vec{X})\vec{w}=\vec{0}\\ \end{flalign} $$ - $\vec{y}^T\vec{y}$ 沒有任一項 $\vec{w}$，被微分掉了 - $-2\vec{y}^T\vec{X}\vec{w}$ 是 $(\vec{y}^T\vec{X})^T$ 與 $\vec{w}$ 的 dot product，對 $\vec{w}$ 每一項微分，會剩原本 $(\vec{y}^T\vec{X})^T$ 對應的項，也就是 $-2\vec{X}^T\vec{y}$ $$ \begin{bmatrix} \frac{\partial (a_1w_1+a_2w_2+ \ldots + a_nw_n)}{\partial w_1}\\ \frac{\partial (a_1w_1+a_2w_2+ \ldots + a_nw_n)}{\partial w_2} \\ \vdots \\ \frac{\partial (a_1w_1+a_2w_2+ \ldots + a_nw_n)}{\partial w_n} \\ \end{bmatrix} $$ - $\vec{w}^T\vec{X}^T\vec{X}\vec{w}$，想成 $\vec{w}^T(\vec{X}^T\vec{X})\vec{w}$，對 $w_i$ 微分後，留下的只有 $\sum_{k=1}^nx_{ik}w_k$ 與 $\sum_{m=1}^nx_{mi}w_m$，也就是被 $w_i$ 乘過的 row 與 column，**但 $\vec{X}^T\vec{X}$ 是 symmetric matrix**，$x_{ik}=x_{ki}$，所以能看成剩兩倍的 $\sum_{k=1}^nx_{ik}w_k$，也就是 $2(\vec{X}^T\vec{X})\vec{w}$ $$ \begin{flalign} & \begin{bmatrix}w_1 & w_2 & \ldots & w_n\end{bmatrix} \begin{bmatrix}x_{11}&x_{12} & \ldots & x_{1n}\\ x_{21}&x_{22} & \ldots & x_{2n}\\ &\vdots \\ x_{n1}&x_{n2} & \ldots & x_{nn}\end{bmatrix} \begin{bmatrix}w_1 \\ w_2 \\ \vdots \\ w_n\end{bmatrix} \\ & =\begin{bmatrix}w_1 & w_2 & \ldots & w_n\end{bmatrix} \begin{bmatrix}x_{11}w_1+x_{12}w_2+ \ldots + x_{1n}w_n \\ x_{21}w_1+x_{22}w_2+ \ldots + x_{2n}w_n \\ \vdots \\ x_{n1}w_1+x_{n2}w_2+ \ldots + x_{nn}w_n\end{bmatrix}\\ \end{flalign} $$ 5. 解 $-2\vec{X}^T\vec{y}+2(\vec{X}^T\vec{X})\vec{w}=\vec{0}$ $$ \begin{flalign} &(\vec{X}^T\vec{X})\vec{w}=\vec{X}^T\vec{y} \\ &\vec{w}=(\vec{X}^T\vec{X})^{-1}\vec{X}^T\vec{y} \\ \end{flalign} $$ 6. 因此只要 $\vec{X}$ invertible 使 $\vec{X}^T, \vec{X}^T\vec{X}$ invertible，那 $||\vec{y}-\vec{X}\vec{w}||^2$ min 的 $\vec{w}$ 就會存在，所求 $\vec{w}^*=(\vec{X}^T\vec{X})^{-1}\vec{X}^T\vec{y}$ > Q: $||\vec{y}-\vec{X}\vec{w}||^2$ 最小明顯是 $\vec{y}=\vec{X}\vec{w}$ 情形，為什麼不是 $\vec{w}^*=\vec{X}^{-1}\vec{y}$ 就好？ > A: **因為 $\vec{X}$ 可以不是 square Matrix！** > - 所以也能想成: 為了湊 square Matrix 把 $\vec{X}^T\vec{y}=\vec{X}^T\vec{X}\vec{w}^*$，就能推出 $\vec{w}^*=(\vec{X}^T\vec{X})^{-1}\vec{X}^T\vec{y}$ - 結: 要用公式解 $\vec{w}^*=(\vec{X}^T\vec{X})^{-1}\vec{X}^T\vec{y}$，前提是 $\vec{X}$ invertible - 雖然很合理(有 scalar mutiple 關係的 row 應該是一樣的 data 吧，或許挑掉就好)，但我們要*更自由*，用於*更複雜*(可能無法解 $\frac{\partial G(\vec{w})}{\partial \vec{w}}=\vec{0}$)的 model，因此發明... ### Minibatch Stochastic Gradient Descent - Gradient Descent: - **以 Gradient 方向調整參數 $\vec{w}$，減少 loss function 值的方式** - *naive way*: **直接對 loss function 取 derivative** - 雖然每次改動會極度有效，但**超慢** (每次要把"整個" dataset $\vec{X}$ 傳入、微分) - 如果 dataset 內很多 redundancy，改動效益也會大減 - 如果要像 linear regression 有漂亮公式解， loss function (含產出 y 值 model 與 y' 的複雜運算) 微分後要有漂亮的 $<只有 w_1 項, 只有 w_2 項, ..., 只有 w_n 項>$，但實際上第 $i$ 項都會混雜其他非 $w_i$ 的項(假設 $y_0$ = $w_0x_0w_1x_1$ i.e. 閹割版 fully connect, 用距離公式當 loss function 就好) - *stochastic gradient descent* (SGD): 根據"每筆資料"去 update loss - 看似比較快，但總時長會輸 naive way (一次大矩陣乘法比多次 memory load/store 的 vector 相乘好) - 而且無法 normalization (畢竟只有一筆 data) - 現在用折衷方式: *Minibatch Stochastic Gradient Descent* - 一次用一個 *minibatch* of observations - *minibatch* 大小由 memory, total data size 等決定，通常是 32 - 256 (2 的倍數) - 詳細執行方式: 1. 隨機 init model 參數 $\vec{w}, b$ 2. 在第 $t$ 個 iteration 中，隨機選到一個 minibatch $B_t$，以 $B_t$ 內所有 data 和 $\vec{w}, b$ 算出 Gradient vector (每單位 loss 增加的向量) 的值，取平均，往反方向減掉 $\vec{w}, b$ $$ \begin{flalign} & \vec{w} \leftarrow \vec{w} - \frac{\eta}{|B_t|}\sum_{i \in B_t} \partial_{\vec{w}}l^{(i)}(\vec{w},b) \\ & b \leftarrow b - \frac{\eta}{|B_t|}\sum_{i \in B_t} \partial_bl^{(i)}(\vec{w},b) \\ \end{flalign} $$ - $\eta$ 是 *learning rate*，控制每次要往 Gradient vector 減少多少 - 在單純 Linear Regression 中，以 square error 定義微分: $$ \begin{flalign} & \partial_{\vec{w}}l^{(i)}(\vec{w},b)\\ &=\partial_{\vec{w}}\frac{1}{2}({\vec{x}^{(i)}}^T\vec{w}^{(i)}+b- y^{(i)})^2\\ & =\partial_{\vec{w}} \frac{1}{2}({\vec{w}^{(i)}}^T\vec{x}^{(i)}{\vec{x}^{(i)}}^T\vec{w}^{(i)}+b^2+{y^{(i)}}^2+2{\vec{x}^{(i)}}^T\vec{w}^{(i)}b-2{\vec{x}^{(i)}}^T\vec{w}^{(i)}y^{(i)}-2by^{(i)}) \\ & =\frac{1}{2}(2\vec{x}^{(i)}{\vec{x}^{(i)}}^T\vec{w}^{(i)}+2\vec{x}^{(i)}b-2\vec{x}^{(i)}y^{(i)})\\ & = \vec{x}^{(i)}({\vec{x}^{(i)}}^T\vec{w}^{(i)}+b-y^{(i)})\\ \end{flalign} $$ $$ \begin{flalign} & \partial_bl^{(i)}(\vec{w},b)\\ & =\partial_b \frac{1}{2}({\vec{w}^{(i)}}^T\vec{x}^{(i)}{\vec{x}^{(i)}}^T\vec{w}^{(i)}+b^2+{y^{(i)}}^2+2{\vec{x}^{(i)}}^T\vec{w}^{(i)}b-2{\vec{x}^{(i)}}^T\vec{w}^{(i)}y^{(i)}-2by^{(i)}) \\ & =\frac{1}{2}(2b+2{\vec{x}^{(i)}}^T\vec{w}^{(i)}-2y^{(i)})\\ & = {\vec{x}^{(i)}}^T\vec{w}^{(i)}+b-y^{(i)}\\ \end{flalign} $$ 所以公式會變成: $$ \begin{flalign} & \vec{w} \leftarrow \vec{w} - \frac{\eta}{|B_t|}\sum_{i \in B_t}\vec{x}^{(i)}({\vec{x}^{(i)}}^T\vec{w}^{(i)}+b-y^{(i)}) \\ & b \leftarrow b - \frac{\eta}{|B_t|}\sum_{i \in B_t}{\vec{x}^{(i)}}^T\vec{w}^{(i)}+b-y^{(i)}\\ \end{flalign} $$ - 在微分時 $\vec{x}^{(i)}$ 不會動，如同 $b$ 的係數 $1$，所以 $\partial_{\vec{w}}$ 與 $\partial_b$ 結果一樣，不意外 - **用 chain rule 想就更合理了~** - square error 特性，會把 $b$ 全部乘一遍，所以每項都是 $b$ 係數，微分後也會留著，有 $2$ 倍是因為左乘右+右乘左+自己的平方項 3. 重複 2.，直到指定次數 or 滿足某些條件，得到的 $\vec{w}, b$ 應該會使 loss 在哪筆 data 的 loss 都不錯 - minibatch 大小 $|B_t|$ (**所有 minibatch 會一樣**)與 learning rate $\eta$ 是 user define 的，可以有一些 tuning 方式使結果更好 (e.g. Bayesian optimization) - 這些不會在 training loop update 的 parameter 叫做 *hyperparameters* - 最後訓練完，會把 $\vec{w},b$ 餵給 *validation dataset*(or *validation set*) 看 function 表現是否優異 - **Minibatch Stochastic Gradient Descent 無法真的找到 minimizers of loss** - 因為 "隨機選" $B_t$ - 但這不遭，這只是在 "training set" 上無法達到 min，實務上要在 "unseen data" 準確 predict 才好(要 *generalization*)，所以這種方式才廣被大家使用 - 所以才說我們在算**期望值**: $E(Y | X=\vec{x})$，只是**假設**這個值 $=\vec{x}^T\vec{w}+b$ ### Prediction 有 $\hat{y}=\vec{x}^T\vec{w}+b$ 的 $\vec{w},b$，就能吃新的 $\vec{x}$ 來 "Predict" 新的 $\hat{y}$ 了！ ## Vectorization for Speed - 這樣，可能會有硬體 vector 運算(GPU, SIMD等) 加持: ```py n = 10000 a = torch.ones(n) b = torch.ones(n) c = a + b # operator overloaded ``` - 所以不要這樣: ```py c = torch.zeros(n) for i in range(n): c[i] = a[i] + b[i] ``` ## The Normal Distribution and Squared Loss - Squared Loss 定義的特性: - 會給出一個 $E(Y | X=\vec{x})$ 即使 $X$ 不是真的 linear - data 離目前預測值有差，會有 large penalty - 還能兼容對 noise distribution 的 probabilistic assumptions - normal distribution 公式 (給標準差 $\sigma$ 與平均 $\mu$): $$ p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{-\frac{1}{2\sigma^2}(x-\mu)^2} $$ - 需要理解 central limit theorem 的推倒才能懂公式來源，我數學力不足了... > NEED HELP! - 記得這是 PDF(probability density function)，是連續的，積分($p(x)$ 底下面積)才是機率，單一個 $p(x)$ 高度沒有實際意義 - 定義 noise $\epsilon$ 來自**對 normal distribution 的抽樣** - "對 distribution 抽樣" 的意義: - **distribution 來自"某個 process" 抽樣無限多次，把抽樣結果的機率記下來後，會*遵守*此 function，"對 distribution 抽樣" 是指對 "某個 process" 抽樣的意思** - normal distribution 中，"某個 process" 能想成 central limit theorem 提到的 "從任意母群 $n$ 次抽樣的平均, $n \rightarrow\infty$" (記得 process 要做無限次) - 可能的抽樣方式 : - **隨機產生 0-1 之間的數，丟入 normal distribution 的 CDF(cumulative distribution function) 的 inverse** 就好 - 意義: - process 抽出 $\le \epsilon$ 數字的機率是"隨機產生 0-1 之間的數" - 但 inverse CDF 沒有公式，要用數值 mapping (以 x=0.001 區間代入正常 CDP 求(x,y)) 或使用各種近似的公式 - "隨機產生" 整數的方式: https://www.pcg-random.org/ - 簡單來說，先做 64 bits 的 `new=(a*old+c) % m`，只取 `new` 前 37 bits，對自己 shift 之後 XOR 原本自己，再用前 5 bits 決定後 32bits 要做多少 bit rotation，輸出後 32 bits 就是"隨機"整數 - 但為什麼是"好的隨機"還不懂... > NEED HELP! - 0-1 只要再除以能產生的最大產生整數即可 - Box-Muller transform: - 好文激推: https://medium.com/mti-technology/how-to-generate-gaussian-samples-3951f2203ab0 - 簡單來說: 假設 $f(x),f(y)$ 都來自 standard normal distribution 抽樣，$f(x,y)=f(x)f(y)$ 會是 $\frac{1}{2\pi}$ 與 $e^{-\frac{r^2}{2}}$ 的 joint PDF ($r^2=x^2+y^2$)，可以分別理解成 $\theta$ 與 $r$ 的 PDF 結果，所以把隨機 0-1 丟入 $\frac{1}{2\pi}$ 與 $e^{-s}$ 的 inverse CDF，求出 $\theta, r=\sqrt{2s}$，就會是對 $f(x)f(y)$ 的抽樣，$x=r\cos\theta$, $y=r\sin\theta$ - 實際上我們正在用 $s=\frac{r^2}{2}$ 與 $\theta$ 去算 $f(x,y)$，需要把 $f(x,y)$ 除以 [Jacobian](https://math.stackexchange.com/questions/267267/intuitive-proof-of-multivariable-changing-of-variables-formula-jacobian-withou) 確保 $f(r,\theta)$ 底下面積仍為 $1$，但 Jacobian = 1，所以不用了~ - Jacobian 簡單來說: 是 $dr, d\theta$ 為座標，$dx, dy$ 為 basis 的 transformation，會得到 $dr, d\theta$ 轉成 $dx, dy$ 坐標系後的樣子，因為積分要求面積，單位 $dr, d\theta$ 被轉換後會變成平行四邊形，才取 determine ![image](https://hackmd.io/_uploads/B1dLkf9tC.png) - 圖源: https://zh.wikipedia.org/zh-tw/%E8%A1%8C%E5%88%97%E5%BC%8F#%E5%87%A0%E4%BD%95%E6%84%8F%E4%B9%89%EF%BC%9A%E4%BA%8C%E7%BB%B4%E5%92%8C%E4%B8%89%E7%BB%B4%E6%AC%A7%E6%B0%8F%E7%A9%BA%E9%97%B4%E4%B8%AD%E7%9A%84%E4%BE%8B%E5%AD%90 - (ps. 統計學中，對任意母群一次抽樣 $n \ge 30$ 的抽樣平均，就大概能當成在 $\mu = \mu_{母群}$, $\sigma=\frac{\sigma_{母群}}{n^2}$ 的 noraml distribution 中的抽樣了，但我們能用以上方式，做更好) - 終於，能把預測結果 $y$ 寫成: $$ y=\vec{x}^T\vec{w}+b+\epsilon $$ - $\epsilon$ 來自 $\mu=0$, $\sigma^2$ 未定的 normal distribution 抽樣: $$ \epsilon \sim N(0,\sigma^2) $$ - 用 normal distribution 公式，能回推 $\vec{x}$ 下測到 $y$ 的 *likelihood*: $$ l(y|\vec{x}) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{-\frac{1}{2\sigma^2}(y-\vec{x}^T\vec{w}-b)^2} $$ 根據 *principle of maximum likelihood*，要**猜一個 $\vec{w},b$ 最好描述使這些抽樣 $\vec{X}$**，**最好猜能讓所有 *likelihood* 最大的 $\vec{w},b$，使 $\vec{y}$ 最可能發生**: $$ L(\vec{y}|\vec{X}) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{-\frac{1}{2\sigma^2}(y^{(i)}-{\vec{x}^{(i)}}^T\vec{w}-b)^2} $$ - 雖然在 PDF 中，函數值 * 底才是機率，但 /////////////////// - 最大化 $L(\vec{y}|\vec{X})$ 與最小化 $-ln(L(\vec{y}|\vec{X}))$ 等價: $$ \begin{flalign} & -\ln(L(\vec{y}|\vec{X}))\\ &=-1 \cdot (\ln((2\pi\sigma^2)^{-\frac{n}{2}})+\sum_{i=1}^n-\frac{1}{2\sigma^2}(y^{(i)}-{\vec{x}^{(i)}}^T\vec{w}-b)^2) \\ &=-\ln(e^{\ln(2\pi\sigma^2) \cdot -\frac{n}{2}})+\sum_{i=1}^n\frac{1}{2\sigma^2}(y^{(i)}-{\vec{x}^{(i)}}^T\vec{w}-b)^2 \\ & = \frac{n}{2}\ln(2\pi\sigma^2)+\sum_{i=1}^n\frac{1}{2\sigma^2}l^{(i)}(\vec{w},b) \end{flalign} $$ - $l^{(i)}(\vec{w},b)$ 是一開始 linear regression 定義的 *squared error* $(y^{(i)}-\hat{y})^2$ - 因為 $\frac{n}{2}\ln(2\pi\sigma^2)$ 與 $\frac{1}{2\sigma^2}$ 對"改動 $\vec{w}, b$ 使 $-\ln(L(\vec{y}|\vec{X}))$ 最小"沒影響(**$\sigma$ 是測量誤差的標準差，不會隨"預測模型"的參數 $\vec{w},b$ 變動而改變**)，所以我們的**目標還是最小化 *squared error*** - **結論**: - 假設 noise 來自 noraml distribution + 用 squared error 當成 loss function = 可以忽略 noise，直接 mininize loss function 就能找到最佳的 linear regression model - 在這樣假設下，**noise 對 model 沒有影響** ## Linear Regression as a Neural Network