【理論】算法挑選、CNN、RNN、Layers

# 【理論】算法挑選、CNN、RNN、Layers ###### tags: `機器學習` [TOC] ## 模型的優劣指標 ### 理想的模型 * 應該要**把「資料前處理（data preprocessing）」也作為模型的一部分去寫**。 * 承上，**最好自己寫資料前處理用的 code**。假若仰賴現成的 pipeline 去代理這一連串處理，可能會降低模型可攜性。即，使得在更換程式語言時，碰上使 perofromance 嚴重劣化的功能缺失，甚至害整個模型無法用。 > The ideal model should expect as input something as close as possible to raw data: an image model should expect RGB pixel values in the [0, 255] range, and a text model should accept strings of utf-8 characters. > > --[Introduction to Keras for Engineers](https://keras.io/getting_started/intro_to_keras_for_engineers/) * 對於無法直接處理「附非數字標籤的資料」的機器學習演算法： * Ex. ``'pet':['dog', 'fish', 'fox']``。 > Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric. > >In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves. > >This means that categorical data must be converted to a numerical form. > > -- [Why One-Hot Encode Data in Machine Learning?](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) ### R-squared * 意義：某函式 $f$ 所預測的資料關係，可以代表多少 % 的兩筆變數的分佈變化（Variation）。 * 計算方法： $$R^2=\frac{\text{Var(mean)-Var($f(x))$}}{\text{Var(mean)}}$$ * 其中 $\text{Var(mean)}=\Sigma(y_i-\bar y)^2$ $\text{mean}=$ $y$ 軸變數平均值 $\text{Var(f(x))}=\Sigma(y_i-f(x_i))^2$ （還是有點不太確定！） > 參考：[StatQuest: R-squared explained](https://www.youtube.com/watch?v=2AQKmw14mHM) :::warning Q: 為什麼會從其他實作模型的實習生口中聽到「R-squared 有出現負值」？ Q: 上文的「兩筆變數」是哪兩個？$x$ 和 $y$？ ::: ### MAPE = Mean Abslute Percentage Error 用**百分比**表示 test output 與正解間的偏移量。 $$\text{MAPE}=\frac{100\%}{n}\sum^n_{i=1}|(y_i-\hat y_i)/y_i|$$ 其中 $y_i$ 是正解，$\hat y_i$ 是機器所預測出來的解。 > 參考：[Root Mean Square Error (RMSE) Tutorial + MAE + MSE + MAPE+ MPE | By Dr. Ry @Stemplicity](https://www.youtube.com/watch?v=KzHJXdFJSIQ&t=618s) ## 各種 Layers * 我的感覺：從集合的角度來看，Layers 像函數的分類。 * Ex. 某一函數，若它將所有輸入的向量，用一次式運算為一個值輸出，則我們就 somehow 可以把它歸類到全結合層。 ### Dense Layer aka **全結合層**、**完全連接層**。 * 精神：「**結合所有參數為一個值**」。 Ex. 對於 input 進來的參（變）數們 $x_i$s，列出其一次式： $x_1a_1+x_2a_2+...+x_na_n$ 。 - 不過實作上，也可能是把 shape `(None, n)` 的資料再結合成較小的 `(None, m)`，其中 `n`>`m`。 #### 實作例：Keras ```python input_text = layers.Input(shape=(1,), dtype=tf.string) embedding = ElmoEmbeddingLayer()(input_text) dense = layers.Dense(256, activation='relu')(embedding) pred = layers.Dense(1, activation='sigmoid')(dense) model = Model(inputs=[input_text], outputs=pred) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.summary() model.fit(train_text, train_label, validation_data=(test_text, test_label), epochs=5, batch_size=32) # 中略.... model.save('ElmoModel.h5') pre_save_preds = model.predict(test_text[0:100]) # predictions before we clear and reload model # Clear and load model model = None model = build_model() model.load_weights('ElmoModel.h5') post_save_preds = model.predict(test_text[0:100]) # predictions after we clear and reload model all(pre_save_preds == post_save_preds) # Are they the same? ``` ``` _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) (None, 1) 0 _________________________________________________________________ elmo_embedding_layer_2 (Elmo (None, 1024) 4 _________________________________________________________________ dense_3 (Dense) (None, 256) 262400 _________________________________________________________________ dense_4 (Dense) (None, 1) 257 ================================================================= Total params: 262,661 Trainable params: 262,661 Non-trainable params: 0 _________________________________________________________________ True ``` > 來源：https://towardsdatascience.com/elmo-embeddings-in-keras-with-tensorflow-hub-7eb6f0145440 :::warning Q: 上例 Keras code 之中，`activation` 參數的 `relu`,`sigmoid` 的設定意義是？ - 是指全結合層的線性組合結果，要再用哪一個函數標準化再輸出嗎？ - 這麼想的話，全結合層有一種**負責整個模型訓練的オチ**的感覺。得到了好多個，$n$ 個權重之後，再把它真正套用在 input 資料 $x$（可能有經過數值化、標準化處理）上，產生可供計算 loss 的 $y$ 值的感覺。 ::: ### Input Layer aka **輸入層**。 NN 的第一層，負責接收輸入數據。 ### Output Layer aka **輸出層**。 NN 的最後一層，負責輸出數據。 :::warning Q: 如果我對全結合層的理解沒有錯，我們難道不能直接用 Dense function 套用它內建的 activation function（如果有），直接 return 數據（$y$）出來嗎？ - 或者 Output Layer 只是一種分類而已，全結合層可以屬於一種輸出層？ - 還是說，這裡的「數據」不是指計算 loss 用的 $y$，而是在計算 loss 之後的東西？ - 或者，是整個模型完成了所有 iteration 後，用以輸出歷經千辛萬苦，以全球暖化為代價所得到的最終 $n$ 個權重？ ::: ### Hidden Layer aka **隠れ層**、隱含層、處理層。相對於輸出與輸入層，夾在兩者之間，處理數據。 ### Convolution Layer aka **卷積層**。 * Ex. Conv1D、Conv2D、Conv3D :::warning CNN 的核心，之類的？ ::: * 資源 * [哇～ Convolution Neural Network(卷積神經網絡) 這麼特別！](https://medium.com/daai/%E5%93%87-convolution-neural-network-%E5%8D%B7%E7%A9%8D%E7%A5%9E%E7%B6%93%E7%B6%B2%E7%B5%A1-%E9%80%99%E9%BA%BC%E7%89%B9%E5%88%A5-36d02ce8b5fe) * 基礎概念的解說 * 乍看蠻詳細，卡住時可以翻看看 ### Recurrent Layer aka **循環神經層**。 * Ex. RNN、LSTM、GRU * 資源： * [Keras - Recurrent layers](https://keras.io/api/layers/recurrent_layers/) ### Max/min Pooling Layer 應用：符號相似度辨識。 * Ex. MaxPooling ## 模型的構成 ### 資料前處理 aka **data preprocessing** * Data **reshape** Ex. ```python X_train = [] #預測點的前 60 天的資料 y_train = [] #預測點 for i in range(60, 1258): # 1258 是訓練集總數 X_train.append(training_set_scaled[i-60:i, 0]) y_train.append(training_set_scaled[i, 0]) X_train, y_train = np.array(X_train), np.array(y_train) # 轉成numpy array的格式，以利輸入 RNN ``` - **Reshape** 因為現在 `X_train` 是 2-dimension，將它 reshape 成 3-dimension: `[stock prices, timesteps, indicators]` ```python X_train = np.reshape(X_train,(X_train.shape[0], X_train.shape[1], 1)) ``` :::warning Q: Why doing this? Q: 如果在做資料重整時，發現原本希望「row 與 row 為相對應的資料，只是欄位不一樣」，但事實上 row 與 row 沒有相合的話，該怎麼辦？ ::: ### ... ### ... ## 挑選算法 > Refer: [初學者碰上「機器學習」的第一道關卡：我應該使用哪種算法？](https://buzzorange.com/techorange/2017/05/25/which-method-in-ai/) * 選擇算法時考慮的條件： * 準確度 * 訓練時間 * 易用性 ### 演算法的分類也順便記下一些關鍵字。 * **回歸直線** v.s. **時間序列** v.s. ？ * 時間序列 * 季節性 v.s. 非季節性 :::warning Q: 季節性とは？ Q: 「時間序列」、「迴歸直線」在機器學習領域架構中的地位、維度在哪裡？（我感覺自己像白紙上的螞蟻） ::: * **監督式學習** v.s. **非監督式學習** v.s. **參半** * 監督式學習 * SVM 分類、 SVM 迴歸 * 隨機森林 * 線性回歸 * 分類演算法比較 * 決策樹 * 羅吉思迴歸 * ．．． * 非監督式學習 * DBSCAN 群聚演算法 * 分群、階層式分群 * 降維 * 主成分、核心主成分分析 * 局部線性嵌入 * k- 平均演算法 * ．．． * **線性回歸（Linear Regression）** v.s. **邏輯回歸（Logistic Regression）** * **Linear SVM** v.s. **Kernel SVM** :::warning Q: SVM 是什麼？ > SVM = Support Vector Machine ::: * **Tree** v.s. **Ensemble Tree** :::warning Q: What is ensemble tree? ::: * **神經網路（Nerual Network）** v.s. **深度學習（Deep Learning）** * **K-means / K-modes，高斯混合模型聚類（GMM clustering）** * **層級聚類（Hierarchical clustering）** * ### 回歸直線 v.s. 時間序列 * 回歸直線 * 精神：？想像：在 $x,y$ 平面上，點上資料點，求出一條近似直線。 Ex. 最小平方法。 * 時間序列 * 精神：每一個資料點都和它自己以外的資料點有關聯（函數關係？）。 * 目標：？？ * 學習階段心得： * 求「隨時間變化的現象規律」的感覺。 * 可以引入「動態系統概論」中學到的知識？（可惡，筆記不在手邊） * Ex. * ARIMA ### 演算法的侷限 * 凸優化問題 :::warning ? ::: * 梯度消失問題（Vanishing gradient problem） :::warning 似乎是在「Loss Function Optimization (minimization)」的步驟時，會因為 activation function 的選擇而引發的問題？ ::: ## Deep Learning ### What is deep learning ? > 優質說明（日文）：[【深層学習】ディープラーニングとは関数近似器である【ディープラーニングの世界 vol. 1 】](https://www.youtube.com/watch?v=SyWwoMpP_P4&list=PLhDAH9aTfnxKXf__soUoAEOrbLAOnVHCP&index=1) * Goal：使用演算法，推導（learn）出良率最高的目標函式參數（parameter）。 * Input： data * Output：funciton 決定鏈： ```mermaid graph TB; parameters --> function --> 良率 ``` ### 深度學習在做什麼？ > Refer: [【深層学習】学習 \- なぜ必要なのか？何をするのか？【ディープラーニングの世界 vol. 2 】](https://www.youtube.com/watch?v=RLlTmbyJORM&list=PLhDAH9aTfnxKXf__soUoAEOrbLAOnVHCP&index=2) 1. **決定「報酬函數」**（或「誤差函數」） * Goal：決定函數 $\Phi(f)$ * $\Phi$ 會輸出 $f$ 的「良率」（或「不良率」）分數，我們的目標就是將此「良率」最大化。換句話說，我們想要**最大化（或最小化）函數 $\Phi$ 的值**。 * Ex. （良率） * $\Phi(f_0)=100.\leftarrow$ 這個比較好！ * $\Phi(f_1)=50$ 2. **決定要使用的函數$f(a_0, a_1, ...,a_n)$** * 要**以什麼函數為模板**，來輸入向量（指大量項目的 data） $v$？ * 極簡例：一元二次方程式。 :::warning 這是我不知道的！按照數學上的直覺，應該有所謂「適合在 a 情況時用的函數模板 A；在 b 情況時用模板 B ...；」。我需要更理解眼前的問題類型。 * 如果來不及理解，就先搜集好各種函數模板，安裝下來邊實驗邊理解。 > 推測：此步驟相當於 [5大關鍵步驟！如何構建深度學習模型？](https://read01.com/zh-tw/P5NQ7xM.html) 網站提到的 **定義架構**： > > 摘寫其內文： >> 1. 計算機視覺任務，如圖像分割、圖像分類、面部識別和其他類似項目：首選卷積神經網絡（CNNs）或 ConvNets 。 >> 2. 自然語言處理、與文本數據相關：遞歸神經網絡（RNNs）、長短期記憶（LSTMs）。 > > 這裡的「架構」＝上述「函數模板」。 > > 決定基準：**檢視自己的需求**： > 1. 順序模型（Sequential Model） > 2. 功能性 API > 3. 用戶定義的自定義架構 > > 是上述哪一種。 > ::: 3. **找！**（學習） * 找「要填進函數模板裡的 parameter」 * 尋找可以讓 $\Phi$ 輸出最大值（或最小值）的**函數**的 **parameter**。 ## NLP ### 資源 * [[NLP] Word Embedding 筆記](https://clay-atlas.com/blog/2019/11/26/nlp-word-embedding-%E7%AD%86%E8%A8%98/) ## CNN = Convolution Neural Network aka **卷積神經網路** * 應用 * 文字圖形辨識（使用值的分佈特徵為一直線的 Convolution Layer)（大概） ### 原理概述 * 透過用小矩陣去**內積**大矩陣（圖片）中的各個區域，來得到大矩陣中的**局部區域值分佈特徵**。（Ex. 取 3x3 的一格，其中對角線值特別大，其他值特別小） * Ex. 對於一個圖像：$X\in \mathbb R^{28\times 28}$（28x28）的矩陣，我們可以用 $C\in\mathbb R^{3\times3}$ 的 convolution layer 去「內積」它，將其套用一個激發函數 $\phi$，排列得到的值，形成另一個矩陣 $Z\in \mathbb R^{26\times26}$。 > 參見： > [【深層学習】畳み込み層の本当の意味、あなたは説明できますか？【ディープラーニングの世界 vol. 5 】](https://www.youtube.com/watch?v=vU-JfZNBdYU&list=PLhDAH9aTfnxKXf__soUoAEOrbLAOnVHCP&index=5) ### VGG > 參見：[]() ## RNN = Recurrent Neural Network aka **循環神經網路** ### LSTM LSTM = Long Short-Term Memory #### 相關實作例： * [[Day27] 認識損失函數](https://ithelp.ithome.com.tw/articles/10227349) * [[實戰系列] 使用 Keras 搭建一個 LSTM 魔法陣（模型）](https://ithelp.ithome.com.tw/articles/10206312) * [Day 23：銷售量預測 -- LSTM 的另一個應用](https://ithelp.ithome.com.tw/articles/10195400) #### 理論講解： * [人人都能看懂的LSTM](https://zhuanlan.zhihu.com/p/32085405) #### 其他筆記： * [【Python】 ARIMA、sklearn、Keras](/j9hxP9hhTaugp2uCFiLnOg) ### GRU = Gate Recurrent Unit #### 理論講解： * [人人都能看懂的GRU](https://zhuanlan.zhihu.com/p/32481747) ## Jargons 筆記下我的理解： * **feature**: 俗稱的向量 $X$，maybe aka 特徵，即 Input 給模型的變數。 * **Categorical Data**: 有使用標籤分類後的資料。如： `data['運動番']={'sk8 the infinity', 'Haikyuu'}`, `data['懸疑番']={'The Promised Neverland'}`, `data['戰鬥番']={'Neon Genesis Evangelion'}` * **Numerical Data** * Converting Categorical Data to Numerical Data * **Integer Encoding/ Label Encoding**: 將原本非整數型態（往往是字串？）的標籤編碼為整數。 * **One-Hot Encoding** * **Tensor**（張量）