Release Highlights for scikit-learn 0.23(翻譯)

# Release Highlights for scikit-learn 0.23(翻譯) ###### tags: `scikit-learn` `sklearn` `python` `machine learning` `Release Highlights` `翻譯` [原文連結](https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_23_0.html) 我們很高興宣佈scikit-learn 0.23的發布，其中包含許多bug的修復以及新功能!下面我們詳細說明這版本的一些主要功能。關於完整的修正清單，請參閱[發行說明](https://scikit-learn.org/stable/whats_new/v0.23.html#changes-0-23)。安裝最新版本(使用pip)： ```shell pip install --upgrade scikit-learn ``` 或者使用conda： ```shell conda install scikit-learn ``` ## Generalized Linear Models, and Poisson loss for gradient boosting 擁有non-normal loss function的Long-awaited Generalized Linear Models已經正式上市了。特別是，有三個新的迴歸器：[PoissonRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PoissonRegressor.html#sklearn.linear_model.PoissonRegressor)、[GammaRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.GammaRegressor.html#sklearn.linear_model.GammaRegressor)與[TweedieRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TweedieRegressor.html#sklearn.linear_model.TweedieRegressor)。Poisson regressor可以用來建模正整數計數或相對頻率。更多請參考[使用者指南](https://scikit-learn.org/stable/modules/linear_model.html#generalized-linear-regression)。此外，[HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor)還支援一種新的'poisson' loss。 ```python import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import PoissonRegressor from sklearn.experimental import enable_hist_gradient_boosting # noqa from sklearn.ensemble import HistGradientBoostingRegressor # 產生虛擬資料 n_samples, n_features = 1000, 20 rng = np.random.RandomState(0) X = rng.randn(n_samples, n_features) # positive integer target correlated with X[:, 5] with many zeros: y = rng.poisson(lam=np.exp(X[:, 5]) / 2) # 分割資料集 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng) # 實作迴歸器 glm = PoissonRegressor() gbdt = HistGradientBoostingRegressor(loss='poisson', learning_rate=.01) # 擬合 glm.fit(X_train, y_train) gbdt.fit(X_train, y_train) print(glm.score(X_test, y_test)) print(gbdt.score(X_test, y_test)) ``` ## Rich visual representation of estimators 現在可以透過在notebooks上啟用選項`display='diagram'`來可視化估計器(estimators)。這對總結整個pipelines的架構與其它複合估計器尤其有用，可以互相提供細節資訊。點擊下方影像來展開Pipeline的元素<sub>(請至官方文件點擊)</sub>。關於如何使用這個功能，可以參考[Visualizing Composite Estimators](https://scikit-learn.org/stable/modules/compose.html#visualizing-composite-estimators)。 ```python from sklearn.pipeline import make_pipeline from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.impute import SimpleImputer from sklearn.compose import make_column_transformer from sklearn.linear_model import LogisticRegression set_config(display='diagram') num_proc = make_pipeline(SimpleImputer(strategy='median'), StandardScaler()) cat_proc = make_pipeline( SimpleImputer(strategy='constant', fill_value='missing'), OneHotEncoder(handle_unknown='ignore')) preprocessor = make_column_transformer((num_proc, ('feat1', 'feat3')), (cat_proc, ('feat0', 'feat2'))) clf = make_pipeline(preprocessor, LogisticRegression()) clf ``` ## Scalability and stability improvements to KMeans 我們重新設計整個KMeans，現在它明顯的更快更穩定。此外，演算法Elkan現在也與稀疏矩陣兼容。這個估計器使用基於OpemMP的平行化來取代joblib，因此，參數`n_jobs`已經失效了。更多關於如何控制執行緒數量的細節，請參考[Paralleism](https://scikit-learn.org/stable/modules/computing.html#parallelism)。 ```python import scipy import numpy as np from sklearn.model_selection import train_test_split from sklearn.cluster import KMeans from sklearn.datasets import make_blobs from sklearn.metrics import completeness_score rng = np.random.RandomState(0) X, y = make_blobs(random_state=rng) X = scipy.sparse.csr_matrix(X) X_train, X_test, _, y_test = train_test_split(X, y, random_state=rng) kmeans = KMeans(algorithm='elkan').fit(X_train) print(completeness_score(kmeans.predict(X_test), y_test)) ``` ## Improvements to the histogram-based Gradient Boosting estimators 對[HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html#sklearn.ensemble.HistGradientBoostingClassifier)與[HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor)有各種的改善，除了上面所說的Poisson loss，這個估計器現在也支援[sample weights](https://scikit-learn.org/stable/modules/ensemble.html#sw-hgbdt)。此外還加入了自動提早停止(early-stopping)的準則：當樣本數超過10k的時候，自動停止預設是啟用的。最後，使用者現在可以定義[monotonic constraints](https://scikit-learn.org/stable/modules/ensemble.html#monotonic-cst-gbdt)來約束基於於指定特徵的變化的預測。下面的範例，我們建立一個目標，這個目標通常與第一個特徵正相關，並帶有一些噪點。使用monotoinc constraints允許預測補捉到第一個特徵的全域效果，而不是擬合噪點。 ```python import numpy as np from matplotlib import pyplot as plt from sklearn.model_selection import train_test_split from sklearn.inspection import plot_partial_dependence from sklearn.experimental import enable_hist_gradient_boosting # noqa from sklearn.ensemble import HistGradientBoostingRegressor # 產生虛擬資料 n_samples = 500 rng = np.random.RandomState(0) X = rng.randn(n_samples, 2) noise = rng.normal(loc=0.0, scale=0.01, size=n_samples) y = (5 * X[:, 0] + np.sin(10 * np.pi * X[:, 0]) - noise) # 未使用montonic constraints gbdt_no_cst = HistGradientBoostingRegressor().fit(X, y) # 使用montonic constraints gbdt_cst = HistGradientBoostingRegressor(monotonic_cst=[1, 0]).fit(X, y) disp = plot_partial_dependence( gbdt_no_cst, X, features=[0], feature_names=['feature 0'], line_kw={'linewidth': 4, 'label': 'unconstrained'}) plot_partial_dependence(gbdt_cst, X, features=[0], line_kw={'linewidth': 4, 'label': 'constrained'}, ax=disp.axes_) disp.axes_[0, 0].plot(X[:, 0], y, 'o', alpha=.5, zorder=-1, label='samples') disp.axes_[0, 0].set_ylim(-3, 3); disp.axes_[0, 0].set_xlim(-1, 1) plt.legend() plt.show() ``` ## Sample-weight support for Lasso and ElasticNet 現在，Lasso與ElasticNet兩個線性迴歸演算法也支援樣本權重設置。 ```python from sklearn.model_selection import train_test_split from sklearn.datasets import make_regression from sklearn.linear_model import Lasso import numpy as np # 設置樣本數與特徵數 n_samples, n_features = 1000, 20 # 定義一個隨機生成器 rng = np.random.RandomState(0) X, y = make_regression(n_samples, n_features, random_state=rng) print(X.shape, y.shape) # 產生樣本的隨機權重值 sample_weight = rng.rand(n_samples) sample_weight.shape # 資料集分割，連權重也跟著分割 X_train, X_test, y_train, y_test, sw_train, sw_test = train_test_split( X, y, sample_weight, random_state=rng) print(X_train.shape, X_test.shape, sw_train.shape) # 實作 Lasso reg = Lasso() # 擬合的時候賦予樣本權重 reg.fit(X_train, y_train, sample_weight=sw_train) print(reg.score(X_test, y_test, sw_test)) ```