Python機器學習、深度學習 #2

資料的前處理

在把資料丟進模型之前，我們通常要對資料進行一些預處理，因為機器學習的資料量通常很大，事前做一些整理或運算可以增快訓練的成效或速度，也可以讓人更直觀的去看資料的特性。而要如何處理則要根據資料的特性去決定。

數值資料中常見的預處理

Normalizing(正規化)

重新定義數字範圍，並將資料縮放或位移。
- 常見的縮放是縮放至0~1或是-1~1。
幾種正規化的方法：
- scaling to a range
  - 修正資料範圍用，其分布不變。
  - $x^{'} = (x - x_{m i n}) / (x_{m a x} - x_{m i n})$
- clipping
  - 當出現極端特徵時可過濾掉。
  - if x > n then x = n
- log scaling
  - 當特徵有次方關係時可使用。
  - $x^{'} = l o g (x)$
- z-score
  - 當特徵沒有極端值時使用。
  - $x^{'} = (x - μ) / σ$

Bucketing(桶分類)

當你不希望資料有浮點數出現時，可以使用Bucketing的技巧。
所謂Bucketing就是將資料分區塊，落在該區塊的資料統一視為平均值。
- 如：40~50的資料都當作45。

Buckets with equally spaced boundaries

分區的時候將每個桶子大小都切成一樣的，純粹將資料做一個壓縮的概念。
- 如：老師將成績分為100~90分、90~80分…

Buckets with quantile boundaries

根據資料特性或人數而去定義不同的桶子大小。
- 如：0~16視為小孩、16~60視為青年、60以上視為老人等。

資料的轉換

當資料中有非數值化的特徵時，我們可能要事先將資料做轉換，好讓模型可以有效訓練。
舉個例子，現在有個花朵相關的資料集，裡面有特徵欄位長這樣…

ID	Color
…	…
80	Yellow
81	Blue
82	Red

我們可以先將這些字串做一個一對一的轉換，變成下面這樣…

ID	Color
…	…
80	1
81	2
82	3

然後可以進一步表示成One-hot encoding的格式，可以套用到輸出時的機率表示…

ID	Color
…	…
80	{[1.0],[0.0],[0.0]…}
81	{[0.0],[1.0],[0.0]…}
82	{[0.0],[0.0],[1.0]…}

如此一來就可以將非數值的資料，轉換為數值，更方便訓練。

Linear Regression

基本上機器學習能分類成三類：
1. 分類
2. 回歸
3. 分群
現在就來介紹回歸方法中的線性回歸法。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
線性回歸的目標就是在一群資料(上圖黑點)中找到一條線(上圖藍線)，盡可能讓每個點都離這條線越近越好。
找到線之後，將來如果給定特徵X，就可以對照線找到標籤y，達到預測的效果。

模型架構

s c o r e = [\begin{matrix} θ_{0, 0} & θ_{1, 0} & θ_{2, 0} \\ θ_{1, 0} & θ_{1, 1} & θ_{2, 1} \\ θ_{2, 0} & θ_{1, 2} & θ_{2, 2} \end{matrix}] * [\begin{matrix} X_{0} \\ X_{1} \\ X_{2} \end{matrix}] + [\begin{matrix} b_{0} \\ b_{1} \\ b_{2} \end{matrix}]

$θ$ : 每個特徵的權重。
$X$ : 特徵。
$b$ : 偏差常數。
也可以將式子簡單表示為：
$y = h (x; θ) + ϵ$

訓練方法

要找到這條線，其實就像解工程數學中的最佳化問題，可以用兩種方法解。

Analytical

第一種解法就是直接用公式
$θ = (X^{T} X)^{- 1} X^{T} y$ 。
雖然看起來很完美，可以一步直接解出答案，但是矩陣的運算是十分耗時的！
尤其上面的公式需要解反矩陣，複雜度為
$O (n^{3})$ 。
基本上這種等級的複雜度，維度稍微大一點就爆了，因此需要下面的逼近法。

Gradient Descent

逼近法其實方法有很多種，這邊就簡單介紹一個牛頓法。
牛頓法是利用斜率去得到方向，逼近目標點。如果誤差值夠小就停下，否則繼續逼近。
需要注意的是步伐大小太大會跳過頭，太小會跑很久。
公式：
$X_{n + 1} = X_{n} - \frac{f (X_{n})}{f^{'} (X_{n})}$
※補充資料－常用的誤差值計算方法：Loss Design:
$\frac{1}{2} \sum_{i = 1}^{n} (θ^{T} x_{i} - y_{i})^{2}$

優點和缺點

優點

訓練快速
執行快速
只需要儲存訓練好的模型，不需要留下訓練用的資料(空間不用太大)

缺點

訓練資料必須獨立，不能彼此影響
線性的結果有極限

針對最後一點，來張圖解釋。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
今天如果資料像這樣，我們看的出來應該用sin的樣子去表示這個資料，但是你如果用LR去做模型，不管怎麼拉線都無法完美的表示的每個地方。
那麼…該怎麼辦呢？

改變模型，提高維度

s c o r e = [\begin{matrix} θ_{0, 0}^{0} & θ_{1, 0}^{1} & θ_{2, 0}^{2} \\ θ_{1, 0}^{0} & θ_{1, 1}^{1} & θ_{2, 1}^{2} \\ θ_{2, 0}^{0} & θ_{1, 2}^{1} & θ_{2, 2}^{2} \end{matrix}] * [\begin{matrix} X_{0} \\ X_{1} \\ X_{2} \end{matrix}] + [\begin{matrix} b_{0} \\ b_{1} \\ b_{2} \end{matrix}]

其實只要如上方表示，稍微改一下權重的部分就可以將模型變為高維度的樣子。
更細節的部分會在之後討論…
這邊先給個觀念：維度越大越好？

首先，維度越大訓練越慢，這應該沒有問題。
再來，雖然維度越大可以讓曲線越複雜，更貼合資料點，但是請看看下圖
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

圖片來源：https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html

當維度大到變成第三張圖的時候，雖然每個點幾乎都完美貼合，但是圖形在其他地方嚴重扭曲，進而造成後面的判別有問題。
這種現象我們叫做overfitting，後續可能會更詳細的說明，現在只要知道維度絕對不是越大越好。

線性回歸實戰練習

Code





































from sklearn.datasets import load_iris
iris = load_iris()
X = iris['data']
y = iris['target']
features = iris.feature_names
print("---Dataset Data--- \n")
print("Data Shape : ",X.shape)
for index,f in enumerate(X[:3]):
    print(index + 1,"-> ",f)

print("\n---Dataset Content---\n")
for index,f in enumerate(features):
    print(index + 1,"). ",f)

from sklearn import linear_model
from sklearn.metrics import mean_squared_error

train_num = int(X.shape[0] * 0.9)
    
x_train = X[:train_num]
y_train = y[:train_num]

x_test = X[train_num+1:]
y_test = y[train_num+1:]


print("Training Data Shape : ",x_train.shape)
print("Training Label Shape : ",y_train.shape)
print("Testing Data Shape : ",x_test.shape)
print("Testing Label Shape : ",y_test.shape)


reg = linear_model.LinearRegression()
reg.fit(x_train,y_train)
print('Coefficients: \n', reg.coef_)
print("Mean squared error: %.2f"
      % mean_squared_error(reg.predict(x_test), y_test))

Output

---Dataset Data--- 

Data Shape :  (150, 4)
1 ->  [5.1 3.5 1.4 0.2]
2 ->  [4.9 3.  1.4 0.2]
3 ->  [4.7 3.2 1.3 0.2]

---Dataset Content---

1 ).  sepal length (cm)
2 ).  sepal width (cm)
3 ).  petal length (cm)
4 ).  petal width (cm)

Training Data Shape :  (135, 4)
Training Label Shape :  (135,)
Testing Data Shape :  (14, 4)
Testing Label Shape :  (14,)
Coefficients: 
 [-0.10962032 -0.03634593  0.2507714   0.53522511]
Mean squared error: 0.06

因為最近有點忙，就將上周的code稍微改一下拿來用了。
理論上有空之後該部分都會修正得更好，包含上周的部分。

Python機器學習、深度學習 #2

資料的前處理

數值資料中常見的預處理

Normalizing(正規化)

Bucketing(桶分類)

Buckets with equally spaced boundaries

Buckets with quantile boundaries

資料的轉換

Linear Regression

模型架構

訓練方法

Analytical

Gradient Descent

優點和缺點

優點

缺點

改變模型，提高維度

線性回歸實戰練習

tags: ML/DL note python

Read more

【LeetCode】目錄

【LeetCode】1846. Maximum Element After Decreasing and Rearranging

【LeetCode】1759. Count Number of Homogenous Substrings

【LeetCode】1319. Number of Operations to Make Network Connected

tags: `ML/DL` `note` `python`