[ML] sklearn / 進階範例

[ML] sklearn / 進階範例 === ###### tags: `ML / sklearn` ###### tags: `ML`, `sklearn`, `python` [TOC] # 1. 角色定位 Role Definition > [[HackMD][ML] 1. 角色定位 Role Definition](/GQC1vHgUQNWJiu3CN7236w) <hr> # 2. 資料集 Dataset > [[HackMD] Python / matplotlib](/o66X_svJRPqnLZYuggDaVg) > [[HackMD][ML] 2. 資料集 dataset](/39UTxIkHTvWetcIlmNLfyg) ## 生成的資料集 > - [doc](https://scikit-learn.org/stable/datasets/sample_generators.html#sample-generators) ### `sklearn.datasets.make_multilabel_classification` - [doc](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_multilabel_classification.html) - [examples](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) ```python= import numpy as np from sklearn.datasets import make_multilabel_classification from sklearn.neighbors import KNeighborsClassifier X, y = make_multilabel_classification(n_classes=3, random_state=0) print('X.shape:', X.shape) print('y.shape:', y.shape) print('y[-2:]:\n', y[-2:], '\n') clf = KNeighborsClassifier() clf.fit(X, y) print('predict(X[-2:]):\n', clf.predict(X[-2:])) print('score(): ', clf.score(X, y), '\n') print('get_params():\n', clf.get_params()) ``` **執行結果：** ``` X.shape: (100, 20) y.shape: (100, 3) y[-2:]: [[1 1 1] [1 0 1]] predict(X[-2:]): [[1 1 0] [1 1 1]] score(): 0.66 get_params(): {'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'} ``` ### `sklearn.datasets.make_regression` - [doc](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html) - [外來範例](https://machinelearningmastery.com/multi-output-regression-models-with-python/) ```python= # example of multioutput regression test problem from sklearn.datasets import make_regression # create datasets X, y = make_regression( n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1, noise=0.5) # summarize dataset print(X.shape, y.shape) ``` **執行結果**： ``` (1000, 10) (1000, 2) ``` - 1000 筆資料 - 10 個特徵值 - 2 個目標值 ## 收集的資料集 <hr> # 3. 資料整理 Data Wrangling ## 資料分析 ### 待消化 - [Machine Learning -關聯分析-Apriori演算法-詳細解說啤酒與尿布的背後原理 Python實作-Scikit Learn一步一步教學](https://chwang12341.medium.com/76b7778f8f34) ## 資料插補 ### 待消化 - [Python impute.IterativeImputer方法代碼示例](https://vimsky.com/zh-tw/examples/detail/python-method-sklearn.impute.IterativeImputer.html) - [Imputing missing values before building an estimator](https://scikit-learn.org/stable/auto_examples/impute/plot_missing_values.html) - [sklearn.impute.IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) ## 視覺化呈現 ### 待消化 - [深入淺出 Python 視覺化工具 matplotlib、Seaborn 的基礎與架構全指南與教學](https://medium.com/not-only-data/44a47458912) - [從圖表秒懂機器學習模型的原理：以 matplotlib 視覺化 scikit-learn 的分類器（KNN、邏輯斯迴歸、SVM、決策樹、隨機森林）](https://alankrantas.medium.com/2f01aec48b54) ## 大數據處理 > #Vaex, Dask ### 待消化 - [手把手教你如何用 Python和Vaex 在笔记本上分析 100GB 数据](https://zhuanlan.zhihu.com/p/107468779) <hr> # 4. 特徵工程 Feature Engineering ### 待消化 - [Dealing with categorical features with high cardinality: Feature Hashing](https://medium.com/flutter-community/7c406ff867cb) - [[轉載]Scikit-learn介紹幾種常用的特徵選擇方法](https://www.itread01.com/content/1541215204.html) <hr> # 5. 演算法 Algorithms > #演算法 algorithms, 策略 strategy ## Multioutput > 術語： > - algorithms that do not **natively** support multiple outputs > algorithms that do not **inherently** support multiple outputs > - multiple-output regression > multioutput regression. ### sklearn.multioutput.MultiOutputClassifier > - [API](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) > - code: [multioutput.py](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/multioutput.py#L338) > > This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification. > > 該策略包含為每個目標擬合一個分類器。這是用於擴展本身不支援多目標分類的分類器之簡單策略。 ```python= import numpy as np from sklearn.datasets import make_multilabel_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.multioutput import MultiOutputClassifier X, y = make_multilabel_classification(n_classes=3, random_state=0) print('X.shape:', X.shape) print('y.shape:', y.shape) print('y[-2:]:\n', y[-2:], '\n') clf = KNeighborsClassifier() clf = MultiOutputClassifier(clf) clf.fit(X, y) print('predict(X[-2:]):\n', clf.predict(X[-2:])) print('score(): ', clf.score(X, y), '\n') print('estimators_:\n', clf.estimators_) print('get_params():\n', clf.get_params()) ``` **執行結果：** ``` X.shape: (100, 20) y.shape: (100, 3) y[-2:]: [[1 1 1] [1 0 1]] predict(X[-2:]): [[1 1 0] [1 1 1]] score(): 0.66 estimators_: [KNeighborsClassifier(), KNeighborsClassifier(), KNeighborsClassifier()] get_params(): {'estimator__algorithm': 'auto', 'estimator__leaf_size': 30, 'estimator__metric': 'minkowski', 'estimator__metric_params': None, 'estimator__n_jobs': None, 'estimator__n_neighbors': 5, 'estimator__p': 2, 'estimator__weights': 'uniform', 'estimator': KNeighborsClassifier(), 'n_jobs': None} ``` - **註解**： - 看起來是使用**多個分類器**，來進行各自 target 預測 - 實測：結果一樣 ```python= import numpy as np from sklearn.datasets import make_multilabel_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score X, y = make_multilabel_classification(n_classes=3, random_state=0) print('X.shape:', X.shape) print('y.shape:', y.shape) print('y[-2:]:\n', y[-2:], '\n') clf0 = KNeighborsClassifier() clf1 = KNeighborsClassifier() clf2 = KNeighborsClassifier() clf0.fit(X, y[:,0]) clf1.fit(X, y[:,1]) clf2.fit(X, y[:,2]) y_pred0 = clf0.predict(X) y_pred1 = clf1.predict(X) y_pred2 = clf2.predict(X) y_pred = np.column_stack((y_pred0, y_pred1, y_pred2)) print('y_pred[-2:]:\n', y_pred[-2:], '\n') print('clf0.score():', clf0.score(X, y[:,0])) print('clf0.score():', clf1.score(X, y[:,1])) print('clf0.score():', clf2.score(X, y[:,2])) print('acc:', accuracy_score(y, y_pred)) ``` **執行結果**： ``` X.shape: (100, 20) y.shape: (100, 3) y[-2:]: [[1 1 1] [1 0 1]] y_pred[-2:]: [[1 1 0] [1 1 1]] clf0.score(): 0.84 clf0.score(): 0.9 clf0.score(): 0.86 acc: 0.66 ``` - 原因：clf0, clf1, clf2 的 get_params() 都是一樣的 ### sklearn.multioutput.MultiOutputRegressor > - [API](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html) | [doc](https://scikit-learn.org/stable/modules/multiclass.html#multioutputregressor) > - code: [multioutput.py](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/multioutput.py#L244) > > This strategy consists of fitting one regressor per target. This is a simple strategy for extending regressors that do not natively support multi-target regression. > > 該策略包含為每個目標擬合一個迴歸器。這是用於擴展本身不支援多目標迴歸的迴歸器之簡單策略。 > - issue: https://github.com/scikit-learn/scikit-learn/issues/21826 - ### 範例1 ```python= import numpy as np from sklearn.datasets import load_linnerud from sklearn.linear_model import Ridge from sklearn.multioutput import MultiOutputRegressor X, y = load_linnerud(return_X_y=True) print('X.shape:', X.shape) print('y.shape:', y.shape) print('y[-2:]:\n', y[-2:], '\n') regr = Ridge(random_state=123) regr = MultiOutputRegressor(regr) regr.fit(X, y) print('predict(X[-2:]):\n', regr.predict(X[-2:])) print('score(): ', regr.score(X, y), '\n') print('estimators_:\n', regr.estimators_) print('get_params():\n', regr.get_params()) ``` **執行結果：** ``` X.shape: (20, 3) y.shape: (20, 3) y[-2:]: [[156. 33. 54.] [138. 33. 68.]] predict(X[-2:]): [[158.91979326 31.51181739 59.36551594] [187.32842123 37.0873515 55.40215097]] score(): 0.29687777631731227 estimators_: [Ridge(random_state=123), Ridge(random_state=123), Ridge(random_state=123)] get_params(): {'estimator__alpha': 1.0, 'estimator__copy_X': True, 'estimator__fit_intercept': True, 'estimator__max_iter': None, 'estimator__normalize': False, 'estimator__random_state': 123, 'estimator__solver': 'auto', 'estimator__tol': 0.001, 'estimator': Ridge(random_state=123), 'n_jobs': None} ``` - ### 範例2 - **使用 MultiOutputRegressor：** ```python= from sklearn.linear_model import Ridge from sklearn.multioutput import MultiOutputRegressor X=[[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7], [4,5,6,7,8], [5,6,7,8,9], [6,7,8,9,10], ] y=[[6,7,8], [7,8,9], [8,9,10], [9,10,11], [10,11,12], [11,12,13]] regr = Ridge() regr = MultiOutputRegressor(regr) regr.fit(X, y) regr.predict(X) #regr.estimators_ ``` **執行結果：** ``` array([[ 6.02824859, 7.02824859, 8.02824859], [ 7.01694915, 8.01694915, 9.01694915], [ 8.00564972, 9.00564972, 10.00564972], [ 8.99435028, 9.99435028, 10.99435028], [ 9.98305085, 10.98305085, 11.98305085], [10.97175141, 11.97175141, 12.97175141]]) ``` - **自己動手串接** ```python= y = np.array(y) y_pred_list = list() regr_list = list() for n in range(y.shape[1]): regr = Ridge() regr_list.append(regr) regr.fit(X, y[:,n]) y_pred = regr.predict(X) y_pred_list.append(y_pred) print('y_pred_list:\n', y_pred_list) #np.vstack(y_pred_list).T np.column_stack(y_pred_list) ``` **執行結果：** ```= y_pred_list: [array([ 6.02824859, 7.01694915, 8.00564972, 8.99435028, 9.98305085, 10.97175141]), array([ 7.02824859, 8.01694915, 9.00564972, 9.99435028, 10.98305085, 11.97175141]), array([ 8.02824859, 9.01694915, 10.00564972, 10.99435028, 11.98305085, 12.97175141])] array([[ 6.02824859, 7.02824859, 8.02824859], [ 7.01694915, 8.01694915, 9.01694915], [ 8.00564972, 9.00564972, 10.00564972], [ 8.99435028, 9.99435028, 10.99435028], [ 9.98305085, 10.98305085, 11.98305085], [10.97175141, 11.97175141, 12.97175141]]) ``` - 結果同 `MultiOutputRegressor` ### sklearn.multioutput.ClassifierChain > - [API](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.ClassifierChain.html) 示意圖： ![](https://i.imgur.com/ZaBv692.png) ([圖片來源](https://www.researchgate.net/figure/Structure-of-a-classifier-chain-The-input-for-the-chain-is-a-document-vector-consisting_fig1_336148903)) ### sklearn.multioutput.RegressorChain > - [API](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.RegressorChain.html) - ### 範例1 ```python= from sklearn.multioutput import RegressorChain from sklearn.linear_model import LogisticRegression logreg = LogisticRegression(solver='lbfgs',multi_class='multinomial') X, Y = [[1, 0], [0, 1], [1, 1]], [[0, 2], [1, 1], [2, 0]] chain = RegressorChain(base_estimator=logreg, order=[0, 1,]).fit(X, Y) chain.predict(X) ``` **執行結果：** ``` array([[0., 2.], [1., 1.], [2., 0.]]) ``` - LogisticRegression 不是屬於分類的嗎？ - **X** `[[1, 0], [0, 1], [1, 1]]` - **Y** `[[0, 2], [1, 1], [2, 0]]` - **檢驗：？** ```python= y = np.array(Y) print('y[:,0] =', y[:,0], '\n') y_0 = logreg.fit(X, y[:,0]).predict(X) print('y_0 =', y_0, '\n') X_y0 = np.column_stack((X, y_0)) print('X_y0 =\n', X_y0, '\n') y_1 = logreg.fit(X_y0, y[:,1]).predict(X_y0) print('y_1 =', y_1, '\n') #y_01 = np.vstack((y_0, y_1)).T y_01 = np.column_stack((y_0, y_1)) print('y_01 =\n', y_01, '\n') ``` **執行結果：** ```= y[:,0] = [0 1 2] y_0 = [0 1 2] X_y0 = [[1 0 0] [0 1 1] [1 1 2]] y_1 = [2 1 0] y_01 = [[0 2] [1 1] [2 0]] ``` - ### [試驗] 為何 RegressorChain 和 MultiOutputRegressor 結果相同？ ```python= from sklearn.linear_model import Ridge from sklearn.multioutput import MultiOutputRegressor from sklearn.multioutput import RegressorChain X=[[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7], [4,5,6,7,8], [5,6,7,8,9], [6,7,8,9,10], ] y=[[6,7,8], [7,8,9], [8,9,10], [9,10,11], [10,11,12], [11,12,13]] regr = Ridge() regr1 = MultiOutputRegressor(regr) regr1.fit(X, y) print('MultiOutputRegressor:\n', regr1.predict(X)) print() regr2 = RegressorChain(regr, order=[0,1,2]) regr2.fit(X, y) print('RegressorChain:\n', regr2.predict(X)) ``` **執行結果：** ```= MultiOutputRegressor: [[ 6.02824859 7.02824859 8.02824859] [ 7.01694915 8.01694915 9.01694915] [ 8.00564972 9.00564972 10.00564972] [ 8.99435028 9.99435028 10.99435028] [ 9.98305085 10.98305085 11.98305085] [10.97175141 11.97175141 12.97175141]] RegressorChain: [[ 6.02824859 7.02824859 8.02824859] [ 7.01694915 8.01694915 9.01694915] [ 8.00564972 9.00564972 10.00564972] [ 8.99435028 9.99435028 10.99435028] [ 9.98305085 10.98305085 11.98305085] [10.97175141 11.97175141 12.97175141]] ``` - ### [試驗] 資料長度變了，怎麼能餵到前一個迴歸器，不是應該要固定 window ？ ```python= import numpy as np from sklearn.linear_model import Ridge X_data=[[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7], [4,5,6,7,8], [5,6,7,8,9], [6,7,8,9,10], ] y_data=[[6,7,8], [7,8,9], [8,9,10], [9,10,11], [10,11,12], [11,12,13]] X = np.array(X_data)[:,:-1] y = np.column_stack((np.array(X_data)[:,-1], np.array(y_data))) y_pred_list = list() regr = Ridge() print('X_init:\n', X) print('y_init:', y[:,0]) regr.fit(X, y[:,0]) y_pred = y[:,0] for n in range(1, y.shape[1]): X = np.column_stack((X[:,1:], y_pred)) print() print('features:\n', X) print('target:', y[:,n]) y_pred = regr.predict(X) y_pred_list.append(y_pred) print('y_pred:') print(y_pred) regr = Ridge() regr.fit(X, y_pred) print('-' * 30) print('y_pred_list:\n', y_pred_list) #np.vstack(y_pred_list).T np.column_stack(y_pred_list) ``` **執行結果：** ``` X_init: [[1 2 3 4] [2 3 4 5] [3 4 5 6] [4 5 6 7] [5 6 7 8] [6 7 8 9]] y_init: [ 5 6 7 8 9 10] features: [[ 2 3 4 5] [ 3 4 5 6] [ 4 5 6 7] [ 5 6 7 8] [ 6 7 8 9] [ 7 8 9 10]] target: [ 6 7 8 9 10 11] y_pred: [ 6.02112676 7.00704225 7.99295775 8.97887324 9.96478873 10.95070423] ------------------------------ features: [[ 3. 4. 5. 6.02112676] [ 4. 5. 6. 7.00704225] [ 5. 6. 7. 7.99295775] [ 6. 7. 8. 8.97887324] [ 7. 8. 9. 9.96478873] [ 8. 9. 10. 10.95070423]] target: [ 7 8 9 10 11 12] y_pred: [ 7.03300541 8.00161213 8.97021885 9.93882557 10.90743229 11.87603902] ------------------------------ features: [[ 4. 5. 6.02112676 7.03300541] [ 5. 6. 7.00704225 8.00161213] [ 6. 7. 7.99295775 8.97021885] [ 7. 8. 8.97887324 9.93882557] [ 8. 9. 9.96478873 10.90743229] [ 9. 10. 10.95070423 11.87603902]] target: [ 8 9 10 11 12 13] y_pred: [ 8.03345015 8.98083153 9.9282129 10.87559428 11.82297566 12.77035703] ------------------------------ y_pred_list: [array([ 6.02112676, 7.00704225, 7.99295775, 8.97887324, 9.96478873, 10.95070423]), array([ 7.03300541, 8.00161213, 8.97021885, 9.93882557, 10.90743229, 11.87603902]), array([ 8.03345015, 8.98083153, 9.9282129 , 10.87559428, 11.82297566, 12.77035703])] ``` ### 參考資料 - ### [4 Strategies for Multi-Step Time Series Forecasting](https://machinelearningmastery.com/multi-step-time-series-forecasting/) - ### [How to Develop Multi-Output Regression Models with Python](https://machinelearningmastery.com/multi-output-regression-models-with-python/) - 有 code :+1: :100: - **Not all regression algorithms support multioutput regression.** - 通常會有如下的訊息 > ValueError: y should be a 1d array, got an array of shape (999, 3) instead. - 原生支援 multipleoutput - sklearn.ensemble.RandomForestRegressor - sklearn.ensemble.BaggingRegressor - sklearn.ensemble.ExtraTreesRegressor - sklearn.linear_model.LinearRegression - sklearn.neighbors.KNeighborsRegressor - sklearn.tree.DecisionTreeRegressor - 原生不支援 multipleoutput - sklearn.ensemble.AdaBoostRegressor - sklearn.ensemble.GradientBoostingRegressor - workaround A for using regression models designed for predicting one value for multioutput regression is to divide the multioutput regression problem into multiple sub-problems. The most obvious way to do this is to split a multioutput regression problem into multiple single-output regression problems. - ### [www.sktime.org / Tutorials / Forecasting with sktime](https://www.sktime.org/en/stable/examples/01_forecasting.html) - **strategy 參數** - recursive (預設) - direct - dirrec (direct + recursive) - multioutput <hr> # 6. 訓練 Training > #超參數搜尋 hyperparameter search, 超參數調校 hyperparameter tuning, 超參數優化 hyperparameter optimization, 網格搜尋 grid search, 交叉驗證 cross validation ## 待消化資料 - [自動化調整超參數方法介紹(使用python)](https://medium.com/jackys-blog/自動化調整超參數方法介紹-使用python-40edb9f0b462) - [Grid SearchCV（網格搜尋）與RandomizedSearchCV （隨機搜尋）](https://tw511.com/a/01/8581.html) - [調參必備--Grid Search網格搜索](https://kknews.cc/code/aaaxmen.html) <hr> # 7. 評價指標 Metrics <hr> # 8. 流程控制, 管線自動化, MLOps > pipeline <hr> # 9. 術語討論區 > [[HackMD][ML] 9. 術語討論區](/2TiOHA2vRuKAZTVOcq4pkg)