tags: `machine learning`|`python`

機器學習 - 資料預處理

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

ctrl + i 可尋找方法或類別的相關資訊


dataset.info() # 資料類別、是否有缺失資料等資訊

養成習慣：

執行完每列程式後，記得檢查變數是否取用正確

Get the dataset

dataset取得方式

專案取得
Kaggle 平台

取得dataset後第一步：觀察並分析欄位、特徵

dataset分析

以下皆使用此表格作為範例

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Country –> 只會有France、Spain、Germany，不會出現其他國家。屬於分類型欄位

Age、Salary –> 屬於連續型數值

Purchased –> 是否購買，用布林值表示

此表格可預測顧客是否購買產品

自變量與應變量

f (x) = y

自變量：
$x$ (input) –> Country、Age、Salary為特徵
應變量：
$y$ (output) –> Purchased為結果

Importing the Libraries

需要用到的libraries



import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

as 後面為別名

numpy –> 提供矩陣數學運算方法

matplotlib.pyplot –> 提供畫圖方法長條圖、直線圖等

pandas –> 提供讀取資料集的方法通常為csv檔

Importing the Dataset

資料集引入

選取資料集檔案路徑
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
透過pandas中的read_csv方法來讀取


dataset = pd.read_csv('Data.csv')

python裡，字串用單引號或雙引號都可

處理自變量與應變量

使用pandas中的iloc方法取出自變量與應變量所需要的值
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →


x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

iloc：i as integer, loc as location
：表示所有rows or columns
：-1 表示所有rows or columns但不包含最後一個row or column

Missing Data

缺失資料解決方式1

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

刪除資料

缺點：

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

投入的成本會損失

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

10筆資料內，有2筆資料缺失，若刪除，相當於20%的資料被刪掉

缺失資料解決方式2


from sklearn.impute import SimpleImputer

使用scikit-learn (sklearn)
提供資料分析與資料處理的方法

使用sklearn中的impute
提供估算的方法

使用sklearn中的impute的SimpleImputer類別
提供missing data處理的方法


imputer = SimpleImputer(missing_values=np.nan, strategy="mean", fill_value=None)

平均值(mean)或中位數(median) –> 連續型數值

最常出現的(most_frequent) –> 分類型欄位

常數(constant) –> fill_value要寫值多少


imputer = imputer.fit(x[:, 1:3])

透過imputer.fit()進行資料擬合，並設定要進行缺失資料擬合的範圍
x[：, 1：3] –> x[all rows, 1 & 2 columns]
1:3表示索引值1到3，但不包含3(第2及第3行)


x[:, 1:3] = imputer.transform(x[:, 1:3])

透過imputer.transform()進行資料轉換，並設定要進行缺失資料轉換的範圍
x[：, 1：3] –> x[all rows, 1 & 2 columns]
1:3表示索引值1到3，但不包含3(第2及第3行)

將轉換的結果設定回原本的資料集合中

Categorical Data

處理分類型欄位

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

做標籤編碼(0、1、2…)來分類


from sklearn.preprocessing import LabelEncoder

使用sklearn中的preprocessing
提供預處理的方法

使用sklearn中的preprocessing的LabelEncoder類別
提供標籤編碼處理的方法


labelencorder_x = LabelEncoder()

宣告LabelEncoder物件
針對Country進行標籤編碼(Country為input –>
$x$ )


x[:, 0] = labelencorder_x.fit_transform(x[:, 0])

使用LabelEncoder中的fit_transform()進行資料擬合與轉換，並設定範圍
x[：, 0] –> x[all rows, 0 column]

將轉換的結果設定回原本的資料集合中


labelencorder_y = LabelEncoder()
y = labelencorder_y.fit_transform(y)

應變量能夠自動被識別為類別，所以不需要做虛擬編碼，直接使用LabelEncoder即可

使用LabelEncoder中的fit_transform()進行資料擬合與轉換

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

做虛擬編碼(Dummy Encoding)

標籤編碼後有順序之分，此例的Country被分成三類，延展成三行


from sklearn.preprocessing import OneHotEncoder

使用sklearn中的preprocessing的OneHotEncoder類別
提供虛擬編碼處理的方法


from sklearn.compose import ColumnTransformer

使用sklearn中的compose
提供編寫的方法


ct = ColumnTransformer([("Country", OneHotEncoder(), [0])] , remainder="passthrough")

參數設定為須處理虛擬編碼的欄位
[0] –> 索引值為0的行 –> Country

remainder：其他未指定的欄位要如何處理
remainder = "passthrough" –> 剩餘欄位不會被轉換


X = ct.fit_transform(x)

使用ColumnTransformer中的fit_transform()進行資料擬合與轉換，並且轉換為array()

Splitting the Dataset into the Training set and Test set

分割訓練集合及測試集合

通常70-80%切割給訓練集合，20-30%切割給測試集合
訓練集合用來訓練模型，測試集合用來測試模型準確度、錯誤率等效果


from sklearn.model_selection import train_test_split

使用sklearn中的model_selection的train_test_split類別
提供分割訓練集合與測試集合的方法


x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

若給一個array[x]，train_test_split會回傳兩組array，分別是測試集合與訓練集合

test_size = 0.2 –> 測試集合為20%

random_state = 0 –> 取值不變時，每次得到的結果一樣；取值改變時，每次得到的結果不同；不設定參數時，會隨機選擇樣本，得到的结果也就不同

Feature Scaling

特徵縮放

以此例來說，Age和Salary的範圍不同(Age落在0-100左右，Salary以萬起跳)，若方程式為

A g e + S a l a r y

，Age會小到可以被忽略

缺點：丟棄特徵 –> 損失成本

不論範圍如何，為求模型更客觀，都會做特徵縮放

標準化與正規化

標準化

x_{s t a n d} = \frac{x - m e a n (x)}{S t a n d a r d D e v i a t i o n (x)}

$S t a n d a r d D e v i a t i o n (x)$ ：標準差

$m e a n (x)$ ：平均值

正規化

x_{n o r m} = \frac{x - min (x)}{max (x) - min (x)}

數值落在0~1之間


from sklearn.preprocessing import StandardScaler

使用sklearn中的preprocessing的StandardScaler類別
提供標準化的方法



sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

宣告StandardScaler()物件
針對自變量
$x$ 進行

使用StandardScaler()中的fit_transform()對x_train進行資料擬合與轉換

因為sc_x已經被擬合過，故第3行的sc_x不需要再用fit_transform()，可直接使用transform()

應變量是用來分辨的(是否購買)，故維持分類編碼(0、1)

練習

機器學習-作業1









































# Importing the Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the Dataset
dataset = pd.read_csv("Data.csv")
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

# Missing Data
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy="mean", fill_value=None)
imputer = imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

# Categorical Data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

labelencorder_x = LabelEncoder()
x[:, 0] = labelencorder_x.fit_transform(x[:, 0])

ct = ColumnTransformer([("Country", OneHotEncoder(), [0])] , remainder="passthrough")
X = ct.fit_transform(x)

labelencorder_y = LabelEncoder()
y = labelencorder_y.fit_transform(y)

# Splitting the Dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

機器學習-作業3





















































# Importing the libraries 
import numpy as np
#import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv("Customers.csv")
x = dataset.iloc[:, [1, 2, 3, 5, 6, 7]].values
y = dataset.iloc[:, 4].values

dataset.info()

# Missing Data
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent", fill_value=None)
#X = np.reshape(x[:, 3], (-1, 1))
imputer = imputer.fit(x[:, [3]])
x[:, [3]] = imputer.transform(x[:, [3]])
"""
if len(x) == 0 :
    print("list is empty")
else : print("list is not empty")
"""

# Categorical Data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

labelencorder_x = LabelEncoder()
x[:, 0] = labelencorder_x.fit_transform(x[:, 0])
x[:, 3] = labelencorder_x.fit_transform(x[:, 3])


ct = ColumnTransformer([("Gender", OneHotEncoder(), [0]), ("Profession", OneHotEncoder(), [3])] , remainder="passthrough")
X = ct.fit_transform(x)

'''
labelencorder_y = LabelEncoder()
y = labelencorder_y.fit_transform(y)
'''

# Splitting the Dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

tags: machine learning|python

機器學習 - 資料預處理

Get the dataset

dataset取得方式

dataset分析

自變量與應變量

Importing the Libraries

需要用到的libraries

Importing the Dataset

資料集引入

處理自變量與應變量

Missing Data

缺失資料解決方式1

缺失資料解決方式2

Categorical Data

處理分類型欄位

Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More → 做標籤編碼(0、1、2…)來分類

Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More → 做虛擬編碼(Dummy Encoding)

Splitting the Dataset into the Training set and Test set

分割訓練集合及測試集合

Feature Scaling

特徵縮放

標準化與正規化

標準化

正規化

練習

Read more

2️⃣機器學習 - 簡單線性回歸(Simple Linear Regression)

8️⃣機器學習 - 單純貝氏分類器(Naive Bayes)

5️⃣機器學習 - R平方(R Squared)

4️⃣機器學習 - 多項式回歸(Polynomial Regression)

tags: `machine learning`|`python`

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

做標籤編碼(0、1、2…)來分類

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

做虛擬編碼(Dummy Encoding)