Scikit-Learn sklearn.preprocessing.RobustScaler

# Scikit-Learn sklearn.preprocessing.RobustScaler ###### tags: `scikit-learn` `sklearn` `python` `machine learning` `preprocessing` >[name=Marty.chen ] [time=Wed, May 13] >以下範例資料皆來自官方文件 >[HackMD hyperlink](https://hackmd.io/@shaoeChen/r1CQ9VY98) :::danger 官方文件： * [API](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) * [範例](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py) * [四分位距](https://zh.wikipedia.org/wiki/%E5%9B%9B%E5%88%86%E4%BD%8D%E8%B7%9D) ::: ## 說明實作機器學習的時候，特徵的前置預處理是必經過程，因為每一個特徵有它自己的值域與單位，有大有小，沒有在相同的空間範圍內，這會造成擁有較大值域的特徵對模型的影響過大。舉例來說，一棟屋子，三房兩廳二衛浴，屋齡30年，每坪20,000元，郵政編碼200。每坪20,000元，這個值域相對其它特徵來說都過大了，這對模型會有不良的影響。因此每一個特徵我們都需要做前置預處理，將它們縮放至相同的大小。也因此，在實作過程中我們會取整個訓練資料集的特徵來計算各別特徵的值域空間，然後在測試的時候將測試資料經過相同的空間縮放至相同的值域大小。簡單說，你的100元不是我的100元，可能你談的是100日元，而我是100新台幣，但是當我們經過相同空間縮放為美金的時候，所談就是一樣的價錢了。『If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more robust estimates for the center and range of your data.』這是[文件上](https://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers)的一句話，說明著這種作法對離群值有著較佳的魯棒性。 ### 公式 RobustScaler是一種四分位距的計算，以(數值-中位數)/四分位距來做資料的縮放，其中四分位距為Q3-Q1，相關的計算細節會放置於範例說明。 ## 應用 ```python from sklearn.preprocessing import RobustScaler ``` ### class ```python sklearn.preprocessing.RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True) ``` #### parameters * with_centering: * type: boolean * default: True * note: 若為True，則數值縮放為數值-中位數 * with_scaling: * type: boolean * default: True * note: 若為True，則數值縮放為數值/四分位距 * quantile_range: * type: tuple * default: (25.0 , 75.0) * notes: 調整會影響第一、第三四分位距的認定，見範例說明 #### attributes * scale_: ndarray or None, shape (n_features,) * note: 每一個特徵的相對縮放比例，以`np.sqrt(var_)`計算。當`with_std=False`則為`None` * mean_: ndarray or None, shape (n_features,) * note: 訓練集中每一個特徵的均值。當`with_mean=False`則為`None` * var_: ndarray or None, shape (n_features,) * note: 訓練集中每一個特徵的方差。用來計算`scale_`。當`with_std=False`則為`None` * n_samples_seen_: int or array, shape (n_features,) * note: 估計器為每個特徵處理的樣本數。如果沒有缺失值，其值為整數，如有缺失則為陣列。如果重新執行`fit`則數值會重置，但若執行`partial_fit`會累加計算 #### methods * center_ * note: 回傳資料集的中位數 * scale_ * note: 回傳四分位距 ## 範例請參考[github](https://github.com/shaoeChen/sklearn_api_resource/blob/master/sklearn_example/Scikit-Learn%20sklearn.preprocessing.RobustScaler.ipynb)