[ML] sklearn / 進階範例
===
###### tags: `ML / sklearn`
###### tags: `ML`, `sklearn`, `python`
<br>
[TOC]
<br>
# 1. 角色定位 Role Definition
> [[HackMD][ML] 1. 角色定位 Role Definition](/GQC1vHgUQNWJiu3CN7236w)
<br>
<hr>
<br>
# 2. 資料集 Dataset
> [[HackMD] Python / matplotlib](/o66X_svJRPqnLZYuggDaVg)
> [[HackMD][ML] 2. 資料集 dataset](/39UTxIkHTvWetcIlmNLfyg)
## 生成的資料集
> - [doc](https://scikit-learn.org/stable/datasets/sample_generators.html#sample-generators)
### `sklearn.datasets.make_multilabel_classification`
- [doc](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_multilabel_classification.html)
- [examples](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html)
```python=
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.neighbors import KNeighborsClassifier
X, y = make_multilabel_classification(n_classes=3, random_state=0)
print('X.shape:', X.shape)
print('y.shape:', y.shape)
print('y[-2:]:\n', y[-2:], '\n')
clf = KNeighborsClassifier()
clf.fit(X, y)
print('predict(X[-2:]):\n', clf.predict(X[-2:]))
print('score(): ', clf.score(X, y), '\n')
print('get_params():\n', clf.get_params())
```
**執行結果:**
```
X.shape: (100, 20)
y.shape: (100, 3)
y[-2:]:
[[1 1 1]
[1 0 1]]
predict(X[-2:]):
[[1 1 0]
[1 1 1]]
score(): 0.66
get_params():
{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski',
'metric_params': None, 'n_jobs': None, 'n_neighbors': 5,
'p': 2, 'weights': 'uniform'}
```
### `sklearn.datasets.make_regression`
- [doc](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html)
- [外來範例](https://machinelearningmastery.com/multi-output-regression-models-with-python/)
```python=
# example of multioutput regression test problem
from sklearn.datasets import make_regression
# create datasets
X, y = make_regression(
n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1, noise=0.5)
# summarize dataset
print(X.shape, y.shape)
```
**執行結果**:
```
(1000, 10) (1000, 2)
```
- 1000 筆資料
- 10 個特徵值
- 2 個目標值
<br>
## 收集的資料集
<br>
<hr>
<br>
# 3. 資料整理 Data Wrangling
## 資料分析
### 待消化
- [Machine Learning -關聯分析-Apriori演算法-詳細解說啤酒與尿布的背後原理 Python實作-Scikit Learn一步一步教學](https://chwang12341.medium.com/76b7778f8f34)
<br>
## 資料插補
### 待消化
- [Python impute.IterativeImputer方法代碼示例](https://vimsky.com/zh-tw/examples/detail/python-method-sklearn.impute.IterativeImputer.html)
- [Imputing missing values before building an estimator](https://scikit-learn.org/stable/auto_examples/impute/plot_missing_values.html)
- [sklearn.impute.IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html)
<br>
## 視覺化呈現
### 待消化
- [深入淺出 Python 視覺化工具 matplotlib、Seaborn 的基礎與架構全指南與教學](https://medium.com/not-only-data/44a47458912)
- [從圖表秒懂機器學習模型的原理:以 matplotlib 視覺化 scikit-learn 的分類器(KNN、邏輯斯迴歸、SVM、決策樹、隨機森林)](https://alankrantas.medium.com/2f01aec48b54)
<br>
## 大數據處理
> #Vaex, Dask
### 待消化
- [手把手教你如何用 Python和Vaex 在笔记本上分析 100GB 数据](https://zhuanlan.zhihu.com/p/107468779)
<br>
<hr>
<br>
# 4. 特徵工程 Feature Engineering
### 待消化
- [Dealing with categorical features with high cardinality: Feature Hashing](https://medium.com/flutter-community/7c406ff867cb)
- [[轉載]Scikit-learn介紹幾種常用的特徵選擇方法](https://www.itread01.com/content/1541215204.html)
<br>
<hr>
<br>
# 5. 演算法 Algorithms
> #演算法 algorithms, 策略 strategy
## Multioutput
> 術語:
> - algorithms that do not **natively** support multiple outputs
> algorithms that do not **inherently** support multiple outputs
> - multiple-output regression
> multioutput regression.
### sklearn.multioutput.MultiOutputClassifier
> - [API](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html)
> - code: [multioutput.py](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/multioutput.py#L338)
> > This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification.
> > 該策略包含為每個目標擬合一個分類器。這是用於擴展本身不支援多目標分類的分類器之簡單策略。
```python=
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multioutput import MultiOutputClassifier
X, y = make_multilabel_classification(n_classes=3, random_state=0)
print('X.shape:', X.shape)
print('y.shape:', y.shape)
print('y[-2:]:\n', y[-2:], '\n')
clf = KNeighborsClassifier()
clf = MultiOutputClassifier(clf)
clf.fit(X, y)
print('predict(X[-2:]):\n', clf.predict(X[-2:]))
print('score(): ', clf.score(X, y), '\n')
print('estimators_:\n', clf.estimators_)
print('get_params():\n', clf.get_params())
```
**執行結果:**
```
X.shape: (100, 20)
y.shape: (100, 3)
y[-2:]:
[[1 1 1]
[1 0 1]]
predict(X[-2:]):
[[1 1 0]
[1 1 1]]
score(): 0.66
estimators_:
[KNeighborsClassifier(), KNeighborsClassifier(), KNeighborsClassifier()]
get_params():
{'estimator__algorithm': 'auto',
'estimator__leaf_size': 30,
'estimator__metric': 'minkowski',
'estimator__metric_params': None,
'estimator__n_jobs': None,
'estimator__n_neighbors': 5,
'estimator__p': 2,
'estimator__weights': 'uniform',
'estimator': KNeighborsClassifier(),
'n_jobs': None}
```
- **註解**:
- 看起來是使用**多個分類器**,來進行各自 target 預測
- 實測:結果一樣
```python=
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
X, y = make_multilabel_classification(n_classes=3, random_state=0)
print('X.shape:', X.shape)
print('y.shape:', y.shape)
print('y[-2:]:\n', y[-2:], '\n')
clf0 = KNeighborsClassifier()
clf1 = KNeighborsClassifier()
clf2 = KNeighborsClassifier()
clf0.fit(X, y[:,0])
clf1.fit(X, y[:,1])
clf2.fit(X, y[:,2])
y_pred0 = clf0.predict(X)
y_pred1 = clf1.predict(X)
y_pred2 = clf2.predict(X)
y_pred = np.column_stack((y_pred0, y_pred1, y_pred2))
print('y_pred[-2:]:\n', y_pred[-2:], '\n')
print('clf0.score():', clf0.score(X, y[:,0]))
print('clf0.score():', clf1.score(X, y[:,1]))
print('clf0.score():', clf2.score(X, y[:,2]))
print('acc:', accuracy_score(y, y_pred))
```
**執行結果**:
```
X.shape: (100, 20)
y.shape: (100, 3)
y[-2:]:
[[1 1 1]
[1 0 1]]
y_pred[-2:]:
[[1 1 0]
[1 1 1]]
clf0.score(): 0.84
clf0.score(): 0.9
clf0.score(): 0.86
acc: 0.66
```
- 原因:clf0, clf1, clf2 的 get_params() 都是一樣的
<br>
### sklearn.multioutput.MultiOutputRegressor
> - [API](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html) | [doc](https://scikit-learn.org/stable/modules/multiclass.html#multioutputregressor)
> - code: [multioutput.py](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/multioutput.py#L244)
> > This strategy consists of fitting one regressor per target. This is a simple strategy for extending regressors that do not natively support multi-target regression.
> > 該策略包含為每個目標擬合一個迴歸器。這是用於擴展本身不支援多目標迴歸的迴歸器之簡單策略。
> - issue: https://github.com/scikit-learn/scikit-learn/issues/21826
- ### 範例1
```python=
import numpy as np
from sklearn.datasets import load_linnerud
from sklearn.linear_model import Ridge
from sklearn.multioutput import MultiOutputRegressor
X, y = load_linnerud(return_X_y=True)
print('X.shape:', X.shape)
print('y.shape:', y.shape)
print('y[-2:]:\n', y[-2:], '\n')
regr = Ridge(random_state=123)
regr = MultiOutputRegressor(regr)
regr.fit(X, y)
print('predict(X[-2:]):\n', regr.predict(X[-2:]))
print('score(): ', regr.score(X, y), '\n')
print('estimators_:\n', regr.estimators_)
print('get_params():\n', regr.get_params())
```
**執行結果:**
```
X.shape: (20, 3)
y.shape: (20, 3)
y[-2:]:
[[156. 33. 54.]
[138. 33. 68.]]
predict(X[-2:]):
[[158.91979326 31.51181739 59.36551594]
[187.32842123 37.0873515 55.40215097]]
score(): 0.29687777631731227
estimators_:
[Ridge(random_state=123), Ridge(random_state=123), Ridge(random_state=123)]
get_params():
{'estimator__alpha': 1.0,
'estimator__copy_X': True,
'estimator__fit_intercept': True,
'estimator__max_iter': None,
'estimator__normalize': False,
'estimator__random_state': 123,
'estimator__solver': 'auto',
'estimator__tol': 0.001,
'estimator': Ridge(random_state=123),
'n_jobs': None}
```
- ### 範例2
- **使用 MultiOutputRegressor:**
```python=
from sklearn.linear_model import Ridge
from sklearn.multioutput import MultiOutputRegressor
X=[[1,2,3,4,5],
[2,3,4,5,6],
[3,4,5,6,7],
[4,5,6,7,8],
[5,6,7,8,9],
[6,7,8,9,10],
]
y=[[6,7,8],
[7,8,9],
[8,9,10],
[9,10,11],
[10,11,12],
[11,12,13]]
regr = Ridge()
regr = MultiOutputRegressor(regr)
regr.fit(X, y)
regr.predict(X)
#regr.estimators_
```
**執行結果:**
```
array([[ 6.02824859, 7.02824859, 8.02824859],
[ 7.01694915, 8.01694915, 9.01694915],
[ 8.00564972, 9.00564972, 10.00564972],
[ 8.99435028, 9.99435028, 10.99435028],
[ 9.98305085, 10.98305085, 11.98305085],
[10.97175141, 11.97175141, 12.97175141]])
```
- **自己動手串接**
```python=
y = np.array(y)
y_pred_list = list()
regr_list = list()
for n in range(y.shape[1]):
regr = Ridge()
regr_list.append(regr)
regr.fit(X, y[:,n])
y_pred = regr.predict(X)
y_pred_list.append(y_pred)
print('y_pred_list:\n', y_pred_list)
#np.vstack(y_pred_list).T
np.column_stack(y_pred_list)
```
**執行結果:**
```=
y_pred_list:
[array([ 6.02824859, 7.01694915, 8.00564972, 8.99435028, 9.98305085,
10.97175141]),
array([ 7.02824859, 8.01694915, 9.00564972, 9.99435028, 10.98305085,
11.97175141]),
array([ 8.02824859, 9.01694915, 10.00564972, 10.99435028, 11.98305085,
12.97175141])]
array([[ 6.02824859, 7.02824859, 8.02824859],
[ 7.01694915, 8.01694915, 9.01694915],
[ 8.00564972, 9.00564972, 10.00564972],
[ 8.99435028, 9.99435028, 10.99435028],
[ 9.98305085, 10.98305085, 11.98305085],
[10.97175141, 11.97175141, 12.97175141]])
```
- 結果同 `MultiOutputRegressor`
<br>
### sklearn.multioutput.ClassifierChain
> - [API](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.ClassifierChain.html)
示意圖:

([圖片來源](https://www.researchgate.net/figure/Structure-of-a-classifier-chain-The-input-for-the-chain-is-a-document-vector-consisting_fig1_336148903))
<br>
### sklearn.multioutput.RegressorChain
> - [API](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html)
- ### 範例1
```python=
from sklearn.multioutput import RegressorChain
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='lbfgs',multi_class='multinomial')
X, Y = [[1, 0], [0, 1], [1, 1]], [[0, 2], [1, 1], [2, 0]]
chain = RegressorChain(base_estimator=logreg, order=[0, 1,]).fit(X, Y)
chain.predict(X)
```
**執行結果:**
```
array([[0., 2.],
[1., 1.],
[2., 0.]])
```
- LogisticRegression 不是屬於分類的嗎?
- **X**
`[[1, 0], [0, 1], [1, 1]]`
- **Y**
`[[0, 2], [1, 1], [2, 0]]`
- **檢驗:?**
```python=
y = np.array(Y)
print('y[:,0] =', y[:,0], '\n')
y_0 = logreg.fit(X, y[:,0]).predict(X)
print('y_0 =', y_0, '\n')
X_y0 = np.column_stack((X, y_0))
print('X_y0 =\n', X_y0, '\n')
y_1 = logreg.fit(X_y0, y[:,1]).predict(X_y0)
print('y_1 =', y_1, '\n')
#y_01 = np.vstack((y_0, y_1)).T
y_01 = np.column_stack((y_0, y_1))
print('y_01 =\n', y_01, '\n')
```
**執行結果:**
```=
y[:,0] = [0 1 2]
y_0 = [0 1 2]
X_y0 =
[[1 0 0]
[0 1 1]
[1 1 2]]
y_1 = [2 1 0]
y_01 =
[[0 2]
[1 1]
[2 0]]
```
- ### [試驗] 為何 RegressorChain 和 MultiOutputRegressor 結果相同?
```python=
from sklearn.linear_model import Ridge
from sklearn.multioutput import MultiOutputRegressor
from sklearn.multioutput import RegressorChain
X=[[1,2,3,4,5],
[2,3,4,5,6],
[3,4,5,6,7],
[4,5,6,7,8],
[5,6,7,8,9],
[6,7,8,9,10],
]
y=[[6,7,8],
[7,8,9],
[8,9,10],
[9,10,11],
[10,11,12],
[11,12,13]]
regr = Ridge()
regr1 = MultiOutputRegressor(regr)
regr1.fit(X, y)
print('MultiOutputRegressor:\n', regr1.predict(X))
print()
regr2 = RegressorChain(regr, order=[0,1,2])
regr2.fit(X, y)
print('RegressorChain:\n', regr2.predict(X))
```
**執行結果:**
```=
MultiOutputRegressor:
[[ 6.02824859 7.02824859 8.02824859]
[ 7.01694915 8.01694915 9.01694915]
[ 8.00564972 9.00564972 10.00564972]
[ 8.99435028 9.99435028 10.99435028]
[ 9.98305085 10.98305085 11.98305085]
[10.97175141 11.97175141 12.97175141]]
RegressorChain:
[[ 6.02824859 7.02824859 8.02824859]
[ 7.01694915 8.01694915 9.01694915]
[ 8.00564972 9.00564972 10.00564972]
[ 8.99435028 9.99435028 10.99435028]
[ 9.98305085 10.98305085 11.98305085]
[10.97175141 11.97175141 12.97175141]]
```
- ### [試驗] 資料長度變了,怎麼能餵到前一個迴歸器,不是應該要固定 window ?
```python=
import numpy as np
from sklearn.linear_model import Ridge
X_data=[[1,2,3,4,5],
[2,3,4,5,6],
[3,4,5,6,7],
[4,5,6,7,8],
[5,6,7,8,9],
[6,7,8,9,10],
]
y_data=[[6,7,8],
[7,8,9],
[8,9,10],
[9,10,11],
[10,11,12],
[11,12,13]]
X = np.array(X_data)[:,:-1]
y = np.column_stack((np.array(X_data)[:,-1], np.array(y_data)))
y_pred_list = list()
regr = Ridge()
print('X_init:\n', X)
print('y_init:', y[:,0])
regr.fit(X, y[:,0])
y_pred = y[:,0]
for n in range(1, y.shape[1]):
X = np.column_stack((X[:,1:], y_pred))
print()
print('features:\n', X)
print('target:', y[:,n])
y_pred = regr.predict(X)
y_pred_list.append(y_pred)
print('y_pred:')
print(y_pred)
regr = Ridge()
regr.fit(X, y_pred)
print('-' * 30)
print('y_pred_list:\n', y_pred_list)
#np.vstack(y_pred_list).T
np.column_stack(y_pred_list)
```
**執行結果:**
```
X_init:
[[1 2 3 4]
[2 3 4 5]
[3 4 5 6]
[4 5 6 7]
[5 6 7 8]
[6 7 8 9]]
y_init: [ 5 6 7 8 9 10]
features:
[[ 2 3 4 5]
[ 3 4 5 6]
[ 4 5 6 7]
[ 5 6 7 8]
[ 6 7 8 9]
[ 7 8 9 10]]
target: [ 6 7 8 9 10 11]
y_pred:
[ 6.02112676 7.00704225 7.99295775 8.97887324 9.96478873 10.95070423]
------------------------------
features:
[[ 3. 4. 5. 6.02112676]
[ 4. 5. 6. 7.00704225]
[ 5. 6. 7. 7.99295775]
[ 6. 7. 8. 8.97887324]
[ 7. 8. 9. 9.96478873]
[ 8. 9. 10. 10.95070423]]
target: [ 7 8 9 10 11 12]
y_pred:
[ 7.03300541 8.00161213 8.97021885 9.93882557 10.90743229 11.87603902]
------------------------------
features:
[[ 4. 5. 6.02112676 7.03300541]
[ 5. 6. 7.00704225 8.00161213]
[ 6. 7. 7.99295775 8.97021885]
[ 7. 8. 8.97887324 9.93882557]
[ 8. 9. 9.96478873 10.90743229]
[ 9. 10. 10.95070423 11.87603902]]
target: [ 8 9 10 11 12 13]
y_pred:
[ 8.03345015 8.98083153 9.9282129 10.87559428 11.82297566 12.77035703]
------------------------------
y_pred_list:
[array([ 6.02112676, 7.00704225, 7.99295775, 8.97887324, 9.96478873,
10.95070423]),
array([ 7.03300541, 8.00161213, 8.97021885, 9.93882557, 10.90743229,
11.87603902]),
array([ 8.03345015, 8.98083153, 9.9282129 , 10.87559428, 11.82297566,
12.77035703])]
```
<br>
### 參考資料
- ### [4 Strategies for Multi-Step Time Series Forecasting](https://machinelearningmastery.com/multi-step-time-series-forecasting/)
- ### [How to Develop Multi-Output Regression Models with Python](https://machinelearningmastery.com/multi-output-regression-models-with-python/)
- 有 code :+1: :100:
- **Not all regression algorithms support multioutput regression.**
- 通常會有如下的訊息
> ValueError: y should be a 1d array, got an array of shape (999, 3) instead.
- 原生支援 multipleoutput
- sklearn.ensemble.RandomForestRegressor
- sklearn.ensemble.BaggingRegressor
- sklearn.ensemble.ExtraTreesRegressor
- sklearn.linear_model.LinearRegression
- sklearn.neighbors.KNeighborsRegressor
- sklearn.tree.DecisionTreeRegressor
- 原生不支援 multipleoutput
- sklearn.ensemble.AdaBoostRegressor
- sklearn.ensemble.GradientBoostingRegressor
- workaround
A for using regression models designed for predicting one value for multioutput regression is to divide the multioutput regression problem into multiple sub-problems.
The most obvious way to do this is to split a multioutput regression problem into multiple single-output regression problems.
- ### [www.sktime.org / Tutorials / Forecasting with sktime](https://www.sktime.org/en/stable/examples/01_forecasting.html)
- **strategy 參數**
- recursive (預設)
- direct
- dirrec (direct + recursive)
- multioutput
<br>
<hr>
<br>
# 6. 訓練 Training
> #超參數搜尋 hyperparameter search, 超參數調校 hyperparameter tuning, 超參數優化 hyperparameter optimization, 網格搜尋 grid search, 交叉驗證 cross validation
## 待消化資料
- [自動化調整超參數方法介紹(使用python)](https://medium.com/jackys-blog/自動化調整超參數方法介紹-使用python-40edb9f0b462)
- [Grid SearchCV(網格搜尋)與RandomizedSearchCV (隨機搜尋)](https://tw511.com/a/01/8581.html)
- [調參必備--Grid Search網格搜索](https://kknews.cc/code/aaaxmen.html)
<br>
<hr>
<br>
# 7. 評價指標 Metrics
<br>
<hr>
<br>
# 8. 流程控制, 管線自動化, MLOps
> pipeline
<br>
<hr>
<br>
# 9. 術語討論區
> [[HackMD][ML] 9. 術語討論區](/2TiOHA2vRuKAZTVOcq4pkg)