# Release Highlights for scikit-learn 0.23(翻譯)
###### tags: `scikit-learn` `sklearn` `python` `machine learning` `Release Highlights` `翻譯`
[原文連結](https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_23_0.html)
我們很高興宣佈scikit-learn 0.23的發布,其中包含許多bug的修復以及新功能!下面我們詳細說明這版本的一些主要功能。關於完整的修正清單,請參閱[發行說明](https://scikit-learn.org/stable/whats_new/v0.23.html#changes-0-23)。
安裝最新版本(使用pip):
```shell
pip install --upgrade scikit-learn
```
或者使用conda:
```shell
conda install scikit-learn
```
## Generalized Linear Models, and Poisson loss for gradient boosting
擁有non-normal loss function的Long-awaited Generalized Linear Models已經正式上市了。特別是,有三個新的迴歸器:[PoissonRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PoissonRegressor.html#sklearn.linear_model.PoissonRegressor)、[GammaRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.GammaRegressor.html#sklearn.linear_model.GammaRegressor)與[TweedieRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TweedieRegressor.html#sklearn.linear_model.TweedieRegressor)。Poisson regressor可以用來建模正整數計數或相對頻率。更多請參考[使用者指南](https://scikit-learn.org/stable/modules/linear_model.html#generalized-linear-regression)。此外,[HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor)還支援一種新的'poisson' loss。
```python
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingRegressor
# 產生虛擬資料
n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X = rng.randn(n_samples, n_features)
# positive integer target correlated with X[:, 5] with many zeros:
y = rng.poisson(lam=np.exp(X[:, 5]) / 2)
# 分割資料集
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
# 實作迴歸器
glm = PoissonRegressor()
gbdt = HistGradientBoostingRegressor(loss='poisson', learning_rate=.01)
# 擬合
glm.fit(X_train, y_train)
gbdt.fit(X_train, y_train)
print(glm.score(X_test, y_test))
print(gbdt.score(X_test, y_test))
```
## Rich visual representation of estimators
現在可以透過在notebooks上啟用選項`display='diagram'`來可視化估計器(estimators)。這對總結整個pipelines的架構與其它複合估計器尤其有用,可以互相提供細節資訊。點擊下方影像來展開Pipeline的元素<sub>(請至官方文件點擊)</sub>。關於如何使用這個功能,可以參考[Visualizing Composite Estimators](https://scikit-learn.org/stable/modules/compose.html#visualizing-composite-estimators)。
```python
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression
set_config(display='diagram')
num_proc = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_proc = make_pipeline(
SimpleImputer(strategy='constant', fill_value='missing'),
OneHotEncoder(handle_unknown='ignore'))
preprocessor = make_column_transformer((num_proc, ('feat1', 'feat3')),
(cat_proc, ('feat0', 'feat2')))
clf = make_pipeline(preprocessor, LogisticRegression())
clf
```
## Scalability and stability improvements to KMeans
我們重新設計整個KMeans,現在它明顯的更快更穩定。此外,演算法Elkan現在也與稀疏矩陣兼容。這個估計器使用基於OpemMP的平行化來取代joblib,因此,參數`n_jobs`已經失效了。更多關於如何控制執行緒數量的細節,請參考[Paralleism](https://scikit-learn.org/stable/modules/computing.html#parallelism)。
```python
import scipy
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import completeness_score
rng = np.random.RandomState(0)
X, y = make_blobs(random_state=rng)
X = scipy.sparse.csr_matrix(X)
X_train, X_test, _, y_test = train_test_split(X, y, random_state=rng)
kmeans = KMeans(algorithm='elkan').fit(X_train)
print(completeness_score(kmeans.predict(X_test), y_test))
```
## Improvements to the histogram-based Gradient Boosting estimators
對[HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html#sklearn.ensemble.HistGradientBoostingClassifier)與[HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor)有各種的改善,除了上面所說的Poisson loss,這個估計器現在也支援[sample weights](https://scikit-learn.org/stable/modules/ensemble.html#sw-hgbdt)。此外還加入了自動提早停止(early-stopping)的準則:當樣本數超過10k的時候,自動停止預設是啟用的。最後,使用者現在可以定義[monotonic constraints](https://scikit-learn.org/stable/modules/ensemble.html#monotonic-cst-gbdt)來約束基於於指定特徵的變化的預測。下面的範例,我們建立一個目標,這個目標通常與第一個特徵正相關,並帶有一些噪點。使用monotoinc constraints允許預測補捉到第一個特徵的全域效果,而不是擬合噪點。
```python
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.inspection import plot_partial_dependence
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingRegressor
# 產生虛擬資料
n_samples = 500
rng = np.random.RandomState(0)
X = rng.randn(n_samples, 2)
noise = rng.normal(loc=0.0, scale=0.01, size=n_samples)
y = (5 * X[:, 0] + np.sin(10 * np.pi * X[:, 0]) - noise)
# 未使用montonic constraints
gbdt_no_cst = HistGradientBoostingRegressor().fit(X, y)
# 使用montonic constraints
gbdt_cst = HistGradientBoostingRegressor(monotonic_cst=[1, 0]).fit(X, y)
disp = plot_partial_dependence(
gbdt_no_cst, X, features=[0], feature_names=['feature 0'],
line_kw={'linewidth': 4, 'label': 'unconstrained'})
plot_partial_dependence(gbdt_cst, X, features=[0],
line_kw={'linewidth': 4, 'label': 'constrained'}, ax=disp.axes_)
disp.axes_[0, 0].plot(X[:, 0], y, 'o', alpha=.5, zorder=-1, label='samples')
disp.axes_[0, 0].set_ylim(-3, 3); disp.axes_[0, 0].set_xlim(-1, 1)
plt.legend()
plt.show()
```
## Sample-weight support for Lasso and ElasticNet
現在,Lasso與ElasticNet兩個線性迴歸演算法也支援樣本權重設置。
```python
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
import numpy as np
# 設置樣本數與特徵數
n_samples, n_features = 1000, 20
# 定義一個隨機生成器
rng = np.random.RandomState(0)
X, y = make_regression(n_samples, n_features, random_state=rng)
print(X.shape, y.shape)
# 產生樣本的隨機權重值
sample_weight = rng.rand(n_samples)
sample_weight.shape
# 資料集分割,連權重也跟著分割
X_train, X_test, y_train, y_test, sw_train, sw_test = train_test_split(
X, y, sample_weight, random_state=rng)
print(X_train.shape, X_test.shape, sw_train.shape)
# 實作 Lasso
reg = Lasso()
# 擬合的時候賦予樣本權重
reg.fit(X_train, y_train, sample_weight=sw_train)
print(reg.score(X_test, y_test, sw_test))
```