Auto-sklearn - HackMD

Auto-sklearn === ###### tags: `ML / ensemble` ###### tags: `ML`, `AutoML`, `sklearn`, `autosklearn`, `AI Maker` [TOC] ## 官網資料 - [使用說明](https://automl.github.io/auto-sklearn/master/manual.html) - [目前有實作的分類器(Classifiers)](https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/components/classification) - [目前有實作的回歸器(Regressors)](https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/components/regression) ## 第三方資料 - [[簡介] Auto-Sklearn: An AutoML tool based on Bayesian Optimization](https://towardsdatascience.com/auto-sklearn-an-automl-tool-based-on-bayesian-optimization-91a8e1b26c22) ![](https://i.imgur.com/45QvkEj.png) - [4 Python AutoML Libraries Every Data Scientist Should Know](https://towardsdatascience.com/4-python-automl-libraries-every-data-scientist-should-know-680ff5d6ad08) 1. auto-sklearn 2. TPOT 3. HyperOpt 4. AutoKeras - Comparison — Which one should I use? - 何謂 AutoML - [使用 Azure Machine Learning 建立、檢閱和部署自動化機器學習模型](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-use-automated-ml-for-ml-models) - ==**自動化機器學習**== 是針對特定資料來選取最佳機器學習服務演算法的流程。此流程可讓您快速產生機器學習模型。 - [如需以 Python 程式碼為基礎的體驗，請使用 Azure Machine Learning SDK 設定自動化機器學習實驗](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-configure-auto-train) - 自動化機器學習服務會為您挑選演算法和超參數，並產生馬上可進行部署的模型。 ## [參數說明(Parameters)](https://automl.github.io/auto-sklearn/master/api.html) ### time_left_for_this_task (該次任務的總時間) : int, optional > 白話文：搜尋超參數＆最佳模型的**總時間限制** > default=3600 > 預設值為 ++**3600 秒 = 60分鐘 = 1小時**++ > Time limit in seconds for the search of appropriate models. By increasing this value, *auto-sklearn* has a higher chance of finding better models. > 搜尋適當模型的時間限制（以秒為單位）。藉由增加該值，*auto-sklearn* 更有可能找到更好的模型。 ### per_run_time_limit (單一次執行的時間限制) : int, optional > 白話文：選定一個模型與一組參數，進行單一次測試的**時間限制** > default=1/10 of time_left_for_this_task > 預設值為 time_left_for_this_task 的 10 分之 1，也就是 ++**360 秒 (6分鐘)**++ > Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data. > > 機器學習模型呼叫一次的時間限制。如果機器學習演算法超過了時間限制，則模型擬合將被終止。將該值設置得足夠大，以使典型的機器學習演算法可以擬合訓練資料。 ### initial_configurations_via_metalearning (透過後設學習來初始化模型參數): int, optional > 白話文：拿現有最佳的參數，來初始化該 model 的參數 > default=25 > 預設值為 25 > Initialize the hyperparameter optimization algorithm with this many configurations which worked well on previously seen datasets. Disable if the hyperparameter optimization algorithm should start from scratch. > > 使用許多配置好的參數來初始化「超參數最佳化演算法」，這些配置在以前看到的資料集上效果很好。如果「超參數最佳化演算法」應從頭開始，請停用。停用方式：```initial_configurations_via_metalearning=0``` ### ensemble_size : int, optional (default=50) > Number of models added to the ensemble built by *Ensemble selection from libraries of models*. Models are drawn with replacement. ### ensemble_nbest : int, optional (default=50) > Only consider the ``ensemble_nbest`` models when building an ensemble. ### max_models_on_disc: int, optional (default=50) > Defines the maximum number of models that are kept in the disc. > The additional number of models are permanently deleted. Due to the nature of this variable, it sets the upper limit on how many models can be used for an ensemble. > It must be an integer greater or equal than 1. > If set to None, all models are kept on the disc.  # [API 說明](https://automl.github.io/auto-sklearn/master/api.html) ## 訓練結果的摘要 ### [.sprint_statistics()](https://automl.github.io/auto-sklearn/master/api.html#autosklearn.classification.AutoSklearnClassifier.sprint_statistics) > Return the following statistics of the training result: > 回傳訓練結果的底下統計資料： > > - dataset name 資料集名稱 > - metric used 使用的指標 > - best validation score 最佳驗證分數 > - number of target algorithm runs > 目標演算法執行的總次數 > - number of successful target algorithm runs > 目標演算法執行成功的次數 > - number of crashed target algorithm runs > 目標演算法執行時，當掉的次數 > - number of target algorithm runs that exceeded the memory limit > 目標演算法執行時，超過記憶體限制的次數 > - number of target algorithm runs that exceeded the time limit > 目標演算法執行時，超過時間上限的次數 ``` 目標演算法執行的次數 = 成功的次數 + 當掉的次數 + 超過記憶體限制的次數 + 超過時間上限的次數 ``` 範例： ``` auto-sklearn results: Dataset name: 18bcd0902df870e04d5fd1d456d109cf Metric: accuracy Best validation score: 0.797441 Number of target algorithm runs: 51 Number of successful target algorithm runs: 31 Number of crashed target algorithm runs: 6 Number of target algorithms that exceeded the time limit: 9 Number of target algorithms that exceeded the memory limit: 5 ``` ``` 目標演算法執行的次數 51 = 成功 31 + 當掉 6 + 超時 9 + OOM 5 ``` ## 訓練結果的模型組合 ### [.show_models()](https://automl.github.io/auto-sklearn/master/api.html#autosklearn.classification.AutoSklearnClassifier.show_models) ### [.get_models_with_weights()](https://automl.github.io/auto-sklearn/master/api.html#autosklearn.experimental.askl2.AutoSklearn2Classifier.get_models_with_weights) - .get_models_with_weights()[0] 存取第 1 個 model - .get_models_with_weights()[1] 存取第 2 個 model ## 超參數空間 ### .get_configuration_space() ## Q&A - ### 結果是否可重製 - keywords - the reproducibility of the results - [[auto-sklearn] api](https://automl.github.io/auto-sklearn/master/api.html) - seed - smac_scenario_args - [SMAC3 documentation!](https://automl.github.io/SMAC3/master/index.html) ![](https://i.imgur.com/E0AyaZm.png) - [Modify Stopping Criterion #451](https://github.com/automl/auto-sklearn/issues/451) - [[auto-sklearn] Random Search](https://automl.github.io/auto-sklearn/master/examples/60_search/example_random_search.html) > A crucial feature of auto-sklearn is automatically optimizing the hyperparameters through SMAC, introduced [here](http://ml.informatik.uni-freiburg.de/papers/11-LION5-SMAC.pdf). > - [Sequential Model-Based Optimization for General Algorithm Configuration](http://ml.informatik.uni-freiburg.de/papers/11-LION5-SMAC.pdf) - stackover - [How to get absolutely reproducible results with Scikit Learn?](https://stackoverflow.com/questions/52746279) - [Python sklearn RandomForestClassifier non-reproducible results](https://stackoverflow.com/questions/47433920) - 目前測試的 code ```python import autosklearn.classification from sklearn.datasets import make_classification from numpy import * X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False) random.seed(1234) clf1 = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=120, memory_limit=1024*12 ) clf1.fit(X, y, dataset_name='random') print('done-1') random.seed(1234) clf2 = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=120, memory_limit=1024*12 ) clf2.fit(X, y, dataset_name='random') print('done-2') print("same results:", all(clf1.predict(X) == clf2.predict(X))) print(clf1.sprint_statistics()) print(clf2.sprint_statistics()) ``` 測試結果： ``` same results: False auto-sklearn results: Dataset name: random Metric: accuracy Best validation score: 0.966667 Number of target algorithm runs: 26 Number of successful target algorithm runs: 20 Number of crashed target algorithm runs: 5 Number of target algorithms that exceeded the time limit: 1 Number of target algorithms that exceeded the memory limit: 0 auto-sklearn results: Dataset name: random Metric: accuracy Best validation score: 0.966667 Number of target algorithm runs: 26 Number of successful target algorithm runs: 19 Number of crashed target algorithm runs: 5 Number of target algorithms that exceeded the time limit: 2 Number of target algorithms that exceeded the memory limit: 0 ``` <hr> # AI Maker 測試 ## [01] oai.comp.c2m4 - 記憶體有 4GB - 但程式碼的設定是要配置到 12GB ``` params['memory_limit'] = 1024 * 12 # (in MB) (default: 3072 MB) ``` - 環境設定 ```bash mkdir /input mkdir /output export INPUT=/input export OUTPUT=/output export DEBUG=True export TASK=classification export AutoSklearnClassifier_param_time_left_for_this_task=120 export AutoSklearnClassifier_param_per_run_time_limit=60 ``` - fit 部份確實有在限制時間內中斷 ``` [debug] is saving to the model_file: /output/model.pkl [debug][fit] elapsed time: 119.5 (sec) ``` - 跑了 10 分鐘沒有跑出來，直接中斷 ``` Traceback (most recent call last): File "/workspace/main.py", line 139, in <module> main() File "/workspace/main.py", line 39, in main train(context) File "/workspace/trainer.py", line 59, in train model_instance.report() File "/workspace/model/classification_model.py", line 87, in report super(AutoSklearnClassifier, self).report() File "/workspace/model/abstract_model.py", line 289, in report self.dump_metrics(x, y) File "/workspace/model/abstract_model.py", line 292, in dump_metrics metric_dict = self.get_metrics(x, y_true) File "/workspace/model/abstract_model.py", line 304, in get_metrics y_pred = self.predict(x) File "/workspace/model/abstract_model.py", line 233, in predict return self._model.predict(x) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/sklearn/utils/metaestimators.py", line 120, in <lambda> out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/sklearn/pipeline.py", line 419, in predict return self.steps[-1][-1].predict(Xt, **predict_params) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/estimators.py", line 1476, in predict return super().predict(X, batch_size=batch_size, n_jobs=n_jobs) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/estimators.py", line 798, in predict return self.automl_.predict(X, batch_size=batch_size, n_jobs=n_jobs) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/automl.py", line 2347, in predict predicted_probabilities = super().predict( File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/automl.py", line 1466, in predict all_predictions = joblib.Parallel(n_jobs=n_jobs)( File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/joblib/parallel.py", line 1088, in __call__ while self.dispatch_one_batch(iterator): File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch self._dispatch(tasks) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async result = ImmediateResult(func) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__ self.results = batch() File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__ return [func(*args, **kwargs) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp> return [func(*args, **kwargs) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/automl.py", line 194, in _model_predict prediction = predict_func(X_) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/pipeline/classification.py", line 144, in predict_proba return super().predict_proba(X) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/sklearn/utils/metaestimators.py", line 120, in <lambda> out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/sklearn/pipeline.py", line 475, in predict_proba return self.steps[-1][-1].predict_proba(Xt) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/pipeline/components/classification/__init__.py", line 151, in predict_proba return self.choice.predict_proba(X) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/pipeline/components/classification/gradient_boosting.py", line 170, in predict_proba return self.estimator.predict_proba(X) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 1379, in predict_proba raw_predictions = self._raw_predict(X) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 750, in _raw_predict self._predict_iterations( File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 773, in _predict_iterations raw_predictions[k, :] += predict(X) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/sklearn/ensemble/_hist_gradient_boosting/predictor.py", line 66, in predict _predict_from_raw_data(self.nodes, X, self.raw_left_cat_bitsets, KeyboardInterrupt Traceback (most recent call last): File "/workspace/main.py", line 139, in <module> main() File "/workspace/main.py", line 33, in main switch_to_conda_env_if_needed(args, context) File "/workspace/main.py", line 102, in switch_to_conda_env_if_needed process = subprocess.run(cmd, shell=True, capture_output=False) File "/usr/local/miniconda3/lib/python3.9/subprocess.py", line 507, in run stdout, stderr = process.communicate(input, timeout=timeout) File "/usr/local/miniconda3/lib/python3.9/subprocess.py", line 1126, in communicate self.wait() File "/usr/local/miniconda3/lib/python3.9/subprocess.py", line 1189, in wait return self._wait(timeout=timeout) File "/usr/local/miniconda3/lib/python3.9/subprocess.py", line 1917, in _wait (pid, sts) = self._try_wait(0) File "/usr/local/miniconda3/lib/python3.9/subprocess.py", line 1875, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) KeyboardInterrupt ``` ### 如果程式碼不配置 memory_limit ``` params['memory_limit'] = 1024 * 12 # (in MB) (default: 3072 MB) ``` 會遇到下面錯誤： ``` [ERROR] [2023-08-04 10:11:11,847:Client-AutoML(1):2d5c4e56-326c-11ee-a32d-061bb0d7f1e3] (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 3072 MB).', 'configuration_origin': 'DUMMY'}.",) [ERROR] [2023-08-04 10:11:11,847:Client-AutoML(1):2d5c4e56-326c-11ee-a32d-061bb0d7f1e3] (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 3072 MB).', 'configuration_origin': 'DUMMY'}.",) Traceback (most recent call last): File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/automl.py", line 765, in fit self._do_dummy_prediction() File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/automl.py", line 489, in _do_dummy_prediction raise ValueError(msg) ValueError: (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 3072 MB).', 'configuration_origin': 'DUMMY'}.",) Traceback (most recent call last): File "/workspace/main.py", line 139, in <module> main() File "/workspace/main.py", line 39, in main train(context) File "/workspace/trainer.py", line 55, in train model_instance.fit() File "/workspace/model/abstract_model.py", line 196, in fit self._model.fit(self._x, self._y) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/sklearn/pipeline.py", line 346, in fit self._final_estimator.fit(Xt, y, **fit_params_last_step) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/estimators.py", line 1448, in fit super().fit( File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/estimators.py", line 540, in fit self.automl_.fit(load_models=self.load_models, **kwargs) File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/automl.py", line 2304, in fit return super().fit( File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/automl.py", line 962, in fit raise e File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/automl.py", line 765, in fit self._do_dummy_prediction() File "/usr/local/miniconda3/envs/autosklearn/lib/python3.9/site-packages/autosklearn/automl.py", line 489, in _do_dummy_prediction raise ValueError(msg) ValueError: (" Dummy prediction failed with run state StatusType.MEMOUT and additional output: {'error': 'Memout (used more than 3072 MB).', 'configuration_origin': 'DUMMY'}.",) [2023-08-04 10:11:17][INFO][main.py#103] process: CompletedProcess(args='bash -c "echo process_id:$$ && source activate autosklearn && python /workspace/main.py train;"', returncode=1) Traceback (most recent call last): File "/workspace/main.py", line 139, in <module> main() File "/workspace/main.py", line 33, in main switch_to_conda_env_if_needed(args, context) File "/workspace/main.py", line 106, in switch_to_conda_env_if_needed process.check_returncode() File "/usr/local/miniconda3/lib/python3.9/subprocess.py", line 460, in check_returncode raise CalledProcessError(self.returncode, self.args, self.stdout, subprocess.CalledProcessError: Command 'bash -c "echo process_id:$$ && source activate autosklearn && python /workspace/main.py train;"' returned non-zero exit status 1. Fri Aug 4 10:11:17 CST 2023 ```