Scikit-Learn sklearn.ensemble.StackingClassifier

--- breaks: false # false 表示採用 "軟換行規則"，即 "換行符號" 不會換到新行。 --- # Scikit-Learn sklearn.ensemble.StackingClassifier ###### tags: `scikit-learn` `sklearn` `python` `machine learning` `ensemble` >[name=Marty.chen ] >[time=Wed, Dec 25, 2019 11:32 AM] >以下範例資料皆來自官方文件 :::danger 官方文件： * [API](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html#sklearn.ensemble.StackingClassifier) * [User Guide](https://scikit-learn.org/stable/modules/ensemble.html#stacking) ::: ## 說明 Stacked generalization是0.22版新增的功能，可以結合多個分類器或迴歸器做整體學習。按官方文件說明，堆疊起來的估計器會以完整的訓練資料集訓練，而`final_estimator`會以各估計器的output做為input，並以交叉驗證做訓練。參數`estimators`以list格式來堆疊不同的估計器： ```python >>> from sklearn.linear_model import RidgeCV, LassoCV >>> from sklearn.svm import SVR >>> estimators = [('ridge', RidgeCV()), ... ('lasso', LassoCV(random_state=42)), ... ('svr', SVR(C=1, gamma=1e-6))] ``` `final_estimator`則視實作上的分類或數值而import不同的module，`StackingClassifier`或`StackingRegressor`： ```python >>> from sklearn.ensemble import GradientBoostingRegressor >>> from sklearn.ensemble import StackingRegressor >>> reg = StackingRegressor( ... estimators=estimators, ... final_estimator=GradientBoostingRegressor(random_state=42)) ``` 最後一樣使用實作`StackingClassifier`或`StackingRegressor`的估計器來擬合模型： ```python >>> from sklearn.datasets import load_boston >>> X, y = load_boston(return_X_y=True) >>> from sklearn.model_selection import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(X, y, ... random_state=42) >>> reg.fit(X_train, y_train) ``` 訓練過程中，如上所述，`estimators`內的估計器是以整個訓練資料集來訓練，而為了泛化性以及避免過擬合，`final_estimator`是以樣本外的資料來訓練，並以[sklearn.model_selection.cross_val_predict](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict)做交叉驗證。對`StackingClassifier`而言，`estimators`內的output是由`StackingClassifier`的參數`stack_method`來控制。 ```python >>> y_pred = reg.predict(X_test) >>> from sklearn.metrics import r2_score >>> print('R2 score: {:.2f}'.format(r2_score(y_test, y_pred))) R2 score: 0.81 ``` 你也可以用`transform`來取得`estimators`內的output： ```python >>> reg.transform(X_test[:5]) array([[28.78..., 28.43... , 22.62...], [35.96..., 32.58..., 23.68...], [14.97..., 14.05..., 16.45...], [25.19..., 25.54..., 22.92...], [18.93..., 19.26..., 17.03... ]]) ``` 如果設置`stack_method_='predict_proba'`，而且是一個二分類問題的話，則第一個column在訓練過程中會被拿掉，這是為了避免共線性的問題。你也可以巢狀堆疊你的估計器，也就說，`final_estimator`也可以是`StackingClassifier`或`StackingRegressor`： ```python >>> final_layer = StackingRegressor( ... estimators=[('rf', RandomForestRegressor(random_state=42)), ... ('gbrt', GradientBoostingRegressor(random_state=42))], ... final_estimator=RidgeCV() ... ) >>> multi_layer_regressor = StackingRegressor( ... estimators=[('ridge', RidgeCV()), ... ('lasso', LassoCV(random_state=42)), ... ('svr', SVR(C=1, gamma=1e-6, kernel='rbf'))], ... final_estimator=final_layer ... ) >>> multi_layer_regressor.fit(X_train, y_train) StackingRegressor(...) >>> print('R2 score: {:.2f}' ... .format(multi_layer_regressor.score(X_test, y_test))) R2 score: 0.82 ``` ## 應用 ```python from sklearn.ensemble import StackingClassifier ``` ### class ```python sklearn.ensemble.StackingClassifier(estimators, final_estimator=None, cv=None, stack_method='auto', n_jobs=None, passthrough=False, verbose=0) ``` #### parameters * estimatorslist: * type: list of (str, estimator) * note: 每一個list內的元素都要以tuple格式來定義 * final_estimator: * type: estimator * default: None * note: 預設使用的分類器為`LogisticRegression` * cv: * type: int, cross-validation generator, or an iterable * default: None * options: * 未設置情況下會預設交叉驗證的`cv`為5 * 設置數值則指定KFold * 可以設置交叉驗證生成器的物件 * 可以設置可迭代物件 * note: * 對於數值或None的輸入，如果估計器是分類器，而且`y`是兩類或是多類別，則使用`StratifiedKFold `，其它情況皆使用`KFold` * 參考[使用者指南](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)可瞭解更多可用的交叉驗證策略 * 訓練資料集夠大的時候，大量的拆分是無濟於事，而且訓練時間也會增加，`cv`並非用於模型選擇，而是預測 * stack_method: * default: auto * options: * auto: 如果設置auto則以下列順序執行 * predict_proba * decision_function * predict * n_jobs: * type: int * default: None * note: * 擬合過程中平行計算的核心數 * None表示1 * -1則表示使用所有的處理器 * passthrough: * type: bool * default: False * options: * False: 代表`final_estimator`使用`estimator`內的分類器的預測做為input * True: 代表`final_estimator`則使用`estimator`內的分類器的預測以及原始訓練資料集做為input #### attributes * estimators_: list of estimators * note: 估計參數的元素，已擬合於訓練資料上。如果設置`drop`則不顯示於清單中 * named_estimators: _Bunch * note: 以名稱取得`estimators_`的分類器屬性 * final_estimator_: estimator * note: 預測給定`estimators_`的輸出的分類器 * stack_method_: list of str * note: 每一個估計器使用的方法 #### methods * decision_function * note: 使用`final_estimator_.decision_function`預測決策函數 * fit * note: 擬合 * fit_transform * note: 擬合並轉換 * get_params * note: 取得估計器的參數 * predict * note: 預測目標類別 * predict_proba * note: 使用`final_estimator_.predict_proba`預測類別機率 * score * note: 回傳給定資料的平均準確度 * set_params * note: 設置估計器的參數 * transform * note: 回傳每一個估計器的類別標籤或機率 ## 範例範例取自官方文件： ```python from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.svm import LinearSVC from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.ensemble import StackingClassifier X, y = load_iris(return_X_y=True) estimators = [ ('rf', RandomForestClassifier(n_estimators=10, random_state=42)), ('svr', make_pipeline(StandardScaler(), LinearSVC(random_state=42))) ] clf = StackingClassifier( estimators=estimators, final_estimator=LogisticRegression() ) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, stratify=y, random_state=42 ) clf.fit(X_train, y_train).score(X_test, y_test) ```