〈 AI 學習筆記〉Automated Machine Learning 簡介與 Auto-Sklearn 實作紀錄

# 〈 AI 學習筆記〉Automated Machine Learning 簡介與 Auto-Sklearn 實作紀錄 ## 簡介 Automated Machine Learning Automated Machine Learning 顧名思義就是將訓練模型的步驟自動化。從資料前處理、特徵處理、挑選最適模型(NAS)、hyperparameters的優化等，盡可能減少整個流程的繁瑣程度。 ## 用Auto-Sklearn和Fetal Health資料集實作 Fetal Health是Kaggle上提供的資料集，有21個特徵欄位和目標欄位 `['fetal_health']` ，模型的目標是能夠分類出3種健康狀況：1-Normal; 2-Suspect; 3-Pathological. Auto-Sklearn是一個AutoML的套件，這次是使用Version 1。因為似乎沒有支援Linux以外的操作系統，為方便起見我使用Google Colab：首先先用pip install載入套件，出現錯誤是因為系統沒辦法取得元特徵的資料，稍後會處理 ```python=1 !pip install auto-sklearn #跑完後要重新啟動執行階段 ``` 接著載入基本資料處理的套件，讀入需要的資料集 ```python=2 import numpy as np import pandas as pd df = pd.read_csv('/content/fetal_health.csv') df # 'fetal_health' encoded as 1-Normal; 2-Suspect; 3-Pathological. ``` 之後進行資料前處理，只要將feature columns和target column分開就行了，Auto-Sklearn會自動進行缺失項填補、one-hot encoding等簡易的前處理 ```python=7 from sklearn.model_selection import train_test_split X = df.drop(labels=['fetal_health'],axis=1).values # 移除Species並取得剩下欄位資料 y = df['fetal_health'].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) print('train shape:', X_train.shape) print('test shape:', X_test.shape) ``` 輸出結果： ``` train shape: (1700, 21) test shape: (426, 21) ``` 建構和fit模型 ```python=14 import autosklearn.classification autoclassifier = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=180, per_run_time_limit=40, resampling_strategy='cv', resampling_strategy_arguments={'folds': 5}, include = { 'classifier': ["random_forest",'sgd','adaboost','bernoulli_nb','gradient_boosting'], 'feature_preprocessor': ["pca",'kernel_pca','fast_ica','liblinear_svc_preprocessor'], }, initial_configurations_via_metalearning=0 ) autoclassifier.fit(X_train, y_train) ``` 上面有提到Google Colab可能沒辦法取得到元特徵的資料，因此在fit時會一直跳出錯誤訊息，可以透過： `initial_configurations_via_metalearning=0` 移除錯誤訊息，就是不使用元特徵進行模型初始化。我們可以使用`include`來規定所需要的搜尋空間，限定我們想用的classifier, regressor或preprocessor，至於有哪些可以選擇，這邊提供一個表單 - [classifier](https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/components/classification) - [regressor](https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/components/regression) - [feature preprocessor](https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/components/feature_preprocessing) 此外注意我們放入的資料型態是pandas的DataFrame，有所謂的per-column dtype，因此不需要告訴模型各個column的資料型態(建議還是加入，以便模型分辨哪些是numerical或categorical等等)。因此，在使用numpy array這種沒有per-column dtype的資料型態時，我們就需要在fit時用feat_type來記錄我們每一行的dtype。 ```python=26 feat_type = ['numerical']*21 autoclassifier.fit(X_train, y_train, X_test, y_test, feat_type=feat_type) ``` 最後來看看模型結果 ```python=28 print('autoclassifier 訓練集: ',autoclassifier.score(X_train,y_train)) print('autoclassifier 測試集: ',autoclassifier.score(X_test,y_test)) ``` 輸出結果 ``` 訓練集: 0.9994117647058823 測試集: 0.903755868544601 ``` 查看訓練完的模型可以使用autoclassifier.leaderboard() ```python=30 autoclassifier.leaderboard(detailed = True, ensemble_only=True) ``` ## 結語因為是簡易範例因此整體準確度有待加強，但在使用上的確比傳統的機器學習方便很多，而且新推出的Auto-Sklearn2可以達到更高的準確度，之後有機會再來寫 ###### tags: `AI` `Machine Learning`