Machine Learning FAQ - HackMD

<style> body > #doc { padding-top: 0px !important; } body > .ui-infobar, body > .ui-toc, body > .ui-affix-toc { display: none !important; } </style> # Machine Learning FAQ # **2/26 課程提問區** ## 1. ``` Q:不同參數用的 learning rate 一定要一樣嗎？ ``` 不同參數的 learning rate 可以不一樣，之後的課程中會講到 Adagrad、Adam 等等給不同參數不同 learning rate 的方法 >by 李宏毅老師 ## 2. ``` Q:老師有指定的教科書可以參考嗎？ ``` 深度學習的教科書個人首推[https://d2l.ai/index.html](https://)這是一本線上互動教科書，講解得非常清楚，而且多數概念還附有範例程式說明，非常適合初學者 >by 李宏毅老師 ## 3. ``` Q:為什麼sigmoid左邊不是0？ ``` 我們的 sigmoid 左邊都是 0 喔 ![](https://i.imgur.com/eIpPNy3.png) 如果你是指上面這張圖的話，藍色這條線一直往左邊延伸最後還是會到 0 > [color=#d83478]All sigmoids have 0 on the left. If you refer to the picture above, the blue line will eventually reach 0 when x is small. >by 李宏毅老師 ## 4. ``` Q:剛剛把 data 分成很多個 batch，那麼最後一個 batch 是最後被更新的，所以 Model 會比較fit 最後一組 batch 嗎？ ``` 是的，network 有可能只記得剛看過的 batch ，而遺忘之前看過的資料；不過因為實際在訓練時，會跑過多個epoch，而且每一次跑完一個epoch 就會重新隨機製造batch (這個動作叫做shuffle)，也就是說每一個epoch 中放在同一個batch的資料是不同的，所以遺忘的問題在同一任務的訓練過程中並不嚴重(但在跨任務時就嚴重了，這件事之後會講到)。有關在學習過程中遺忘的問題，大家可以參考這篇文章，有更多有趣的觀察：[https://arxiv.org/abs/1812.05159](https://) > [color=#d83478]Yes, the network may only remember the batch it just trained and forget the data it saw before. However, in practice, we will run multiple epochs, and each time we run an epoch, we will re-create batches randomly (this action is called shuffle), that is to say, the data placed in the same batch in each epoch is different, so the problem of forgetting is not severe during the training process of a single task (but it is serious when we consider multiple tasks ... we will talk about this issue). For the problem of forgetting in the learning process, you can refer to this article for more interesting observations:[https://arxiv.org/abs/1812.05159](https://) >by 李宏毅老師 ## 5. ``` Q:loss一定可微嗎？ discrete也可以嗎？ ``` ![](https://i.imgur.com/CIUaNHH.png) 上面兩個 L 在 x = 0 的地方都是不可微分的，但是左圖仍然可以用 gradient descent 做 optimization ，但是右圖因為在 x=0 以外的位置微分算出來的都是 0 ，所以根本無法用 gradient descent 做 optimization > [color=#d83478]The above two L are not differentiable when x = 0. The left one can still use gradient descent for optimization. The right one always has zero gradient, so its gradient does not provide any useful information about update x. In such case, we cannot use gradient descent to minimize L. >by 李宏毅老師 ## 6. ``` Q:只要 Relu 數量比資料點多的 loss 應該要是 0 吧？ ``` 當你有非常非常多的 ReLU 時，確實會存在一組參數讓 loss 為 0 ；但是考慮到 optimization 時的總總難處 (local optima, saddle point 等等)，在實作時我們很難真的把 loss 訓練到 0 > [color=#d83478]When you have a huge number of ReLUs, there would exist a set of parameters to make the loss 0. However, considering the difficulties of optimization (local optima, saddle point, etc. ... we will talk about them in the future), it is difficult to truly make loss 0 during training. >by 李宏毅老師 ## 7. ``` Q:想請問疊加很多層sigmoid/ ReLU（Deep）會使得預測更準確的原因是什麼？ ``` neural network的layer目的在於逼近某些函數，但一層能夠逼近目標函數的程度仍有極限。如果疊多層，每層提供一個非線性的轉換，則有機會逼近更複雜的函數。 > [color=#d83478]The objective of each neural network layer is to approximate some functions, but there are still limits for a single layer to approximate a targeted function. Therefore, if we stack more layers, each layer provides a nonlinear transformation. This way, there are better chances of approximating a more complicated function. >by 張恆瑞助教 --- 助教其實已經回答得很清楚了，我想多說一句，在逼近一個複雜的函數時，增加 network 的深度會比增加寬度更有效率 (之後上課會再提到) > [color=#d83478]TA has answered very clearly. I want to say one more thing. When approaching a complex function, increasing the depth of the network will be more efficient than increasing the width (will be explained later in class). >by 李宏毅老師 ``` Q:想請問是否有理論能去支撐，疊較多層可以使得誤差更小？或是能有相關的收斂速度？ ``` 是有的，有關深層的類神經網路比淺層的網路更能有效地逼近一個目標函式這件事，你可以參考以下影片。(但這裡只有討論誤差更小這件事，並沒有討論訓練時的收斂速度) > [color=#d83478]Yes, you can refer to the following video about the fact that deep neural networks can approach a target function more effectively than shallow networks. (But here we does not discuss the convergence speed during training) links: [https://youtu.be/FN8jclCrqY0](https://) [https://youtu.be/qpuLxXrHQB4](https://) ## 8. ``` Q:想請問，在模型的調整上，我能理解成越flexible越好嗎？ ``` 我覺得不能這樣說有時候適當的限制 model 的 flexibility 其實是一種給予資訊的方式。假設你現在已經知道你要train的東西，他的 input 和 output 是接近線性的關係這時候你去限制 model 的 flexibility 反而會有比較好的 preformance。因為你的 training data 只是真實世界資料分布的其中一部分，如果你的 model 太 flexible，可能最後的結果會 overfit 你的 training data。 > [color=#d83478]I'm afraid not. Constraining the flexibility of a model is sometimes a way to give additional information. Assume that the relation of input and output is close to linearity. In this case, constraining your model may result in better performance. The reason is that your training data is just a subset of real-world data. If your model is too flexible, it may end up overfitting your training data. >by 施貽仁助教 ## 9. ``` Q:DNN 中間 layer 的維度可以忽大忽小嗎? 一般是怎麼決定要變大還變小? 只能 trial & error ？ ``` layer 的維度可以忽大忽小，但至於每一層要多少維度是一個需要調整的 hyperparameter 。至於怎麼安排維度比較好，至今仍無定論，安排維度的方法不同時期有不同的流行，我記得深度學習剛開始流行 (2012 年) 的前幾年，金字塔狀的網路很流行，也就是每一層的維度越來越小；但近年流行的是高樓大廈型，也就是每一層的維度都一樣 > [color=#d83478]How many dimensions each layer needs are hyperparameters that need to be tuned. As for how to arrange the dimensions better, it is still inconclusive. The method of setting dimensions has different popularity in different periods. I remember that in the first few years when deep learning became popular (2012), the pyramid-shaped networks were very popular, that is, the dimensions are getting smaller and smaller; but in recent years, high-rise buildings are popular, that is, the dimensions of each floor are the same. >by 李宏毅老師 ## 10. ``` Q:請問為什麼這邊公式（w1=w0 - learning rate*斜率）是用減的？ ``` ![](https://i.imgur.com/XUDncSR.png) --- 斜率是負的（左高右低）應該往右走斜率是正的（左低右高）應該往左走 >by 王韋翰同學 --- 如果實值函數F(x)在點a處可微且有定義，那麼函數F(x)在a點沿著梯度相反的方向-∆F(a)下降最多。以上取自維基百科：[https://zh.wikipedia.org/wiki/%E6%A2%AF%E5%BA%A6%E4%B8%8B%E9%99%8D%E6%B3%95](https://) 因此是要用減的，這樣比較容易找到local minimum >by 黃冠博助教 ## 11. ``` Q:請問為什麼這邊公式（w1=w0 - learning rate*斜率）是用減的？ ``` ![](https://i.imgur.com/fstTp6X.jpg) 若是按照前面算式的邏輯，我會推出圖中綠色的式子，但是老師投影片裡面的是紫色式子。想請問要怎麼證明這兩個相同，或至少在training上面是等價的? --- 每一個式子有它數學上的意義。在老師的式子中W是一個矩陣將j維的vector map到i維上，你推導出來的W是將每一個xj map到i維上，這樣子想他們在數學上就不會是等價的。Optimization的過程是在找尋一個數學表示中最佳的參數，那當數學表示的意義不同的時候最佳的參數也會不同。所以這兩個式子在training上並不會等價。至於哪一個model比較好你可以做實驗試試看。 >by 李威緒助教 --- > [name=b06502162Lu]以下部分為上課影片同學提問 --- ## 12. ``` Q:請問這裡綠色框框裡面的1是1倍的b_i的意思嗎？ ``` ![](https://i.imgur.com/49Yazsj.png) --- 意思應該是輸入1，然後乘上bi的意思。因為那個箭頭是代表一個參數，就如同右側的輸入x1乘上w11加到r1一樣 >by 張恆瑞助教 # **3/5 課程提問區**