[ML] sklearn / AIMaker === ###### tags: `ML / sklearn` ###### tags: `ML`, `sklearn`, `python` <br> [TOC] <br> ## Source code - http://10.78.26.44:30000/Tj_Tsai/ml_sklearn - 資料介接的問題 - 如 iris,就一份 dataset,需要拆成 X, y - 到底要如何接上輸入資料 - 如果沒有驗證資料,提供選項讓 user 決定需不需要分割資料 <br> ## AI Maker 參考手冊 - [AI Maker](http://10.78.26.20:31012/s/QFn7N5R-H#%E6%A8%A1%E6%9D%BF) - [[AI-Maker] 部署圖像分類模型](https://hackmd.io/@Cynthia-Chuang/AIMake-Image-Classifier), 2021/01/12 - [[gitlab][Cynthia] AI_Maker_Template / ML_AI_Maker](http://10.78.26.44:30000/ai_maker_template/ml_ai_maker/tree/master) <br> ## 演算法清單 - [ ] Regression - [x] [compose](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose) - [x] [TransformedTargetRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html) - [x] [dummy](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.dummy) - [x] [DummyRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html) - [ ] [ensemble](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble) - [x] [AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html) - [x] [BaggingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html) - [x] [ExtraTreesRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html) - [x] [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) - [x] [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) - [ ] [StackingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html) - [ ] [VotingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html) - [ ] [HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html) - 當前 sklearn 版本沒有 - [x] [gaussian_process](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.gaussian_process) - [x] [GaussianProcessRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html) - [x] [kernel_ridge](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.kernel_ridge) - [x] [KernelRidge](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html) - [ ] [Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) - [x] [Classical linear regressors](https://scikit-learn.org/stable/modules/classes.html#classical-linear-regressors) - [x] [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) - [x] [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) / [Polynomial Regression](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) - [x] [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) - [x] [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) - [x] [RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html) - [ ] [Regressors with variable selection](https://scikit-learn.org/stable/modules/classes.html#regressors-with-variable-selection) - [x] [Bayesian regressors](https://scikit-learn.org/stable/modules/classes.html#bayesian-regressors) - [x] [linear_model.ARDRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.ARDRegression) - [x] [linear_model.BayesianRidge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.BayesianRidge) - [ ] [Multi-task linear regressors with variable selection](https://scikit-learn.org/stable/modules/classes.html#multi-task-linear-regressors-with-variable-selection) - [x] [Outlier-robust regressors](https://scikit-learn.org/stable/modules/classes.html#outlier-robust-regressors) - [x] [HuberRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.HuberRegressor) - [x] [RANSACRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.RANSACRegressor) - [x] [TheilSenRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.TheilSenRegressor) - [x] [Generalized linear models (GLM) for regression](https://scikit-learn.org/stable/modules/classes.html#generalized-linear-models-glm-for-regression) - [x] [PoissonRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PoissonRegressor.html#sklearn.linear_model.PoissonRegressor) - the best for Iris - [x] [TweedieRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TweedieRegressor.html#sklearn.linear_model.TweedieRegressor) - [x] [GammaRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.GammaRegressor.html#sklearn.linear_model.GammaRegressor) - [ ] [Miscellaneous](https://scikit-learn.org/stable/modules/classes.html#miscellaneous) - [x] [PassiveAggressiveRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.GammaRegressor.html#sklearn.linear_model.PassiveAggressiveRegressor) - [ ] [neighbors](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors) - [x] [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor) - [ ] [RadiusNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsRegressor.html#sklearn.neighbors.RadiusNeighborsRegressor) - [x] [neural_network](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neural_network) - [x] [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neural_network.MLPRegressor) - [x] [tree](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree) - [x] [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) - [x] [ExtraTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeRegressor.html) - [x] [SVM](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm) - [x] [SVM / SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) - [x] [SVM / NuSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVR.html) - [x] [SVM / LinearSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html) <br> - [ ] Classification - [x] [dummy](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.dummy) - [x] [DummyClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) - [ ] [ensemble](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble) - [x] [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) - [x] [BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) - [x] [ExtraTreesClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) - [x] [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) - [x] [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) - [ ] [StackingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html) - [ ] [VotingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) - [ ] [HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html) - for big datasets - 當前 sklearn 版本沒有 - [x] [gaussian_process](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.gaussian_process) - [x] [GaussianProcessClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html) - [ ] [Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) - [x] [PassiveAggressiveClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveClassifier.html) - [x] [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html) - [x] [RidgeClassifierCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifierCV.html) - [x] [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier) - [ ] [naive_bayes](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes) - [x] [BernoulliNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) - [x] [CategoricalNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html) - [x] [ComplementNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html) - [x] [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) - [x] [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) - [ ] [neighbors](https://scikit-learn.org/stable/modules/classes.html) - [ ] [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) - [ ] [RadiusNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsClassifier.html) - [x] [neural_network](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neural_network) - [x] [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) - [ ] [Semi-Supervised](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.semi_supervised) - [ ] [SelfTrainingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.SelfTrainingClassifier.html#sklearn.semi_supervised.SelfTrainingClassifier) - 當前 sklearn 版本沒有 - [ ] [svm](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm) - [x] [SVM / SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) - [x] [SVM / NuSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html) - [x] [SVM / LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) - [ ] [tree](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree) - [x] [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) - [x] [ExtraTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeClassifier.html) - 參考資料 - [TOP MACHINE LEARNING ALGORITHMS YOU SHOULD KNOW](https://builtin.com/data-science/tour-top-10-algorithms-machine-learning-newbies) - Linear Regression - Logistic Regression - Linear Discriminant Analysis - Classification and Regression Trees - Naive Bayes - K-Nearest Neighbors (KNN) - Learning Vector Quantization (LVQ) - Support Vector Machines (SVM) - Random Forest - Boosting - AdaBoost <br> ## [模型指標 (metrics)](https://hackmd.io/DLAReu9qSzud3e9NWfJIJw#metrics) <br> ## auto-sklearn - 若沒有帶演算法進來, - 由系統自動判斷 - 若有指定是回歸類型,則由系統跑所有回歸演算法,選出最好的一個? - 若有指定是分類類型,則由系統跑所有分類演算法,選出最好的一個? - 參考資料 - [官網手冊](https://automl.github.io/auto-sklearn/master/manual.html) - [metrics](https://automl.github.io/auto-sklearn/master/examples/40_advanced/example_calc_multiple_metrics.html#sphx-glr-examples-40-advanced-example-calc-multiple-metrics-py) <br> ## 待改進的東西 - 環境變數讀取,不要綁訂到 abstract model - 資料窗口如何定義,如何檢測界面是否有符合資料規範,例如: - X : array-like, shape = (n_samples, n_features) - y : array-like, shape = (n_samples,) <br> <hr> ### 2021/04/16 - 評價指標 - [RMSE 有分「一般RMSE」和「正規化RMSE」](https://zh.wikipedia.org/wiki/%E5%9D%87%E6%96%B9%E6%A0%B9%E8%AF%AF%E5%B7%AE) ### 2021/04/06 - 不錯的導覽建議 - [Day8-Scikit-learn介紹(1)](https://ithelp.ithome.com.tw/articles/10204845) [![](https://i.imgur.com/qcEvSsI.png)](https://i.imgur.com/qcEvSsI.png) ### 2021/04/06 - 集大成工具 (ML2.0 要看) - [EthicalML / awesome-production-machine-learning](https://github.com/EthicalML/awesome-production-machine-learning#data-stream-processing) Awesome production machine learning ![](https://i.imgur.com/fsf7ylu.png) ### 2021/03/22 - Python 可解譯性套件 - [使用可解譯性套件以 Python (preview & 預測來說明 ML 模型)](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-machine-learning-interpretability-aml) ### 2021/03/19 - PyCaret - ### https://towardsdatascience.com/pycaret-2-1-is-here-whats-new-4aae6a7f636a - PyCaret 的介紹 (最新版是 2.3) - PyCaret is an open-source, low-code machine learning library in Python that automates the machine learning workflow - **這篇介紹了 2.1 的新功能, 可以了解看看, 也許可以將 PyCaret 整進我們的 ML Template** - Hyperparameter Tuning on GPU. 支援使用 GPU 針對 XGBoost, LightGBM and Catboost 三種模型進行超參數調整 - Model Deployment 透過 deploy_model() 將模型佈署到 Azure 或是 GCP - MLFlow Deployment 支援 MLFlow 佈署功能 - MLFlow Model Registry 支援 MLFlow 模型註冊功能 - High-Resolution Plotting - User-Defined Loss Function - Feature Selection 使用 Boruta algorithm 進行特微選取。這個看一下是不是自動挑選特徵的功能 - 圖解 ![](https://i.imgur.com/K0MibzH.png) - https://pycaret.org/interpret-model/ - https://pycaret.org/automl/ - https://pycaret.org/missing-values/ - 有 Interpret Model, Data preparation 以及 AutoML 功能 - > 看起來值得研究研究這套 library > 等你把第一版 ML Template 完成後, 就來研究這套吧 (laugh) > ML Template v1.0 -> 使用 sciket-learn + auto-sklearn > ML Template v2.0 -> 導入 PyCaret, 提供更多功能 - ### [Evaluating the Model Using All Plots](https://towardsdatascience.com/how-to-use-pycaret-the-library-for-lazy-data-scientists-91343f960bd2) <— 這功能也很不錯 ![](https://i.imgur.com/JYThX5a.png) - ### [【Python 煉丹爐再升級】沒有 PyCaret 2.0 辦不到的事!幾分鐘自動化所有數據準備、模型部署](https://buzzorange.com/techorange/2020/08/03/pycaret-2-0/) - ### [Kaggle 上面有人寫了使用 PyCaret 的簡單教學](https://www.kaggle.com/frtgnn/pycaret-introduction-classification-regression) <br> <hr> ### 2021/03/19 - ~~演算法~~ AdaBoost (both) Extra Tree (both) Decision Tree (both) Gradient Boosting (both) KNeighbors (both) Logistic Regression (cls) Linear Regression (reg) Random Forest (both) SGD (cls) SVM (both) XGBoost (both) LightGBM (both) <br> <hr> ### 2021/03/18 - 投影片總結 - 資料科學家的痛 https://www.kaggle.com/binuthomasphilip/lin-reg-bikes-ride-in-sharing-economy-9 隨意找了一篇 kaggle 上的 notebook, 這篇是針對自行車資料集做訓練 https://www.kaggle.com/c/bike-sharing-demand/code https://www.kaggle.com/ishantkukreti/eda-model-selection 這篇,前面的資料處理,統計分析,也不少 剛有想到, 或許可以拿幾篇kaggle notebook來給大家看,了解一般開發者都是怎麼開發 ML 的 然後再帶一下 azure ml 畫面, 說明 azure ml 所帶來的好處有哪些 從那些 kaggle notebook 來看,可以看出來大家基本的起手式都是先看一下資料內容 做個統計分析,資料平衡情況,欄位內容形態等等 先了解 data, 再接著處理 data (格式轉換, 特徵挑選) 最後才是丟個演算法下去訓練 目前大家的做法,幾乎都是先丟個 xgboost 下去跑 所以 howard 才說演算法相對不是那麼關鍵 因為最重要的苦工在資料處理那一段 ![](https://i.imgur.com/khLKdGf.png) https://www.kaggle.com/khaledelhasafy/no-show-data-anlysis no-show 之前看的 覺得重要的欄位,跟目標欄位是否有關連 ![](https://i.imgur.com/id4YgMv.png) 把 no-show 的 89 個診所位置,全部攤開 看看 診所位置 與 no-show 是否有明顯關聯 投影片 - ~~頁碼~~ - ~~AMAX -> AICS~~