# Deep Learning
## Multi-layer Perceptron
### Limitation: XOR
**Feature selection (特徵選擇)** is a method to reduce the variables by using certain criteria to select variables that are most useful to predict the target by our model.
## Deep Learning
$$\textbf{y} = \sigma (\textbf{Ax}+\textbf{b})$$
### Why non-linearity? Why deep?
It can model more complex functions.
Example: XOR function ise not implementable by a single perceptron.
## Recurrent Neural Network(RNN)
Used in label have sequence relationship (聲音前後有關係等等)
Idea: Maintain a hidden state $h_t$ Such that
$$\textbf{y}_t , \textbf{h}_t = \sigma(\textbf{Ax}_t+\textbf{Bh}_{t-1})$$
## Convolutional Neural Network (CNN)
影像辨識
feature (receptive field) contains a "pattern" that may be "any place" in the picture -> no need to have fully connected layer!
超過範圍補值: padding
Zero padding: 補0
只取某些固定的row/column不會影響圖片的性質: pooling -> 減少運算輛,但可能準確率下降
max pooling: 取範圍內最大值稱為max pooling
Convolutional layer -> feature map
Flatten: 把通過convolution layer的矩陣拉值變成向量
最後再丟到一個fully-connected的network.
CNN is not invariant to scaling & rotation: 圖片旋轉、放大縮小可能對於CNN難以偵測
## Self attention
Input variable size?
Attention : map a query and a set of key-value pairs to an output.
Q and K multiplies to generate the attention score, $\alpha$, which means the correlation of the query and the key, then apply a normalization layer(ex: softmax) to get the normalized $\alpha'$. After that, use the attention score to weight the input value to get output. In short,
$$\text{Attention}(Q,K,V) = \sigma(QK^T)V$$
sequence modeling
###### tags: `machine learning`