[ML] sklearn / AIMaker
===
###### tags: `ML / sklearn`
###### tags: `ML`, `sklearn`, `python`
<br>
[TOC]
<br>
## Source code
- http://10.78.26.44:30000/Tj_Tsai/ml_sklearn
- 資料介接的問題
- 如 iris,就一份 dataset,需要拆成 X, y
- 到底要如何接上輸入資料
- 如果沒有驗證資料,提供選項讓 user 決定需不需要分割資料
<br>
## AI Maker 參考手冊
- [AI Maker](http://10.78.26.20:31012/s/QFn7N5R-H#%E6%A8%A1%E6%9D%BF)
- [[AI-Maker] 部署圖像分類模型](https://hackmd.io/@Cynthia-Chuang/AIMake-Image-Classifier), 2021/01/12
- [[gitlab][Cynthia] AI_Maker_Template / ML_AI_Maker](http://10.78.26.44:30000/ai_maker_template/ml_ai_maker/tree/master)
<br>
## 演算法清單
- [ ] Regression
- [x] [compose](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose)
- [x] [TransformedTargetRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html)
- [x] [dummy](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.dummy)
- [x] [DummyRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html)
- [ ] [ensemble](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
- [x] [AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html)
- [x] [BaggingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html)
- [x] [ExtraTreesRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)
- [x] [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)
- [x] [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
- [ ] [StackingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html)
- [ ] [VotingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html)
- [ ] [HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html)
- 當前 sklearn 版本沒有
- [x] [gaussian_process](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.gaussian_process)
- [x] [GaussianProcessRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html)
- [x] [kernel_ridge](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.kernel_ridge)
- [x] [KernelRidge](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html)
- [ ] [Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)
- [x] [Classical linear regressors](https://scikit-learn.org/stable/modules/classes.html#classical-linear-regressors)
- [x] [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [x] [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) / [Polynomial Regression](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
- [x] [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor)
- [x] [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge)
- [x] [RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html)
- [ ] [Regressors with variable selection](https://scikit-learn.org/stable/modules/classes.html#regressors-with-variable-selection)
- [x] [Bayesian regressors](https://scikit-learn.org/stable/modules/classes.html#bayesian-regressors)
- [x] [linear_model.ARDRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.ARDRegression)
- [x] [linear_model.BayesianRidge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.BayesianRidge)
- [ ] [Multi-task linear regressors with variable selection](https://scikit-learn.org/stable/modules/classes.html#multi-task-linear-regressors-with-variable-selection)
- [x] [Outlier-robust regressors](https://scikit-learn.org/stable/modules/classes.html#outlier-robust-regressors)
- [x] [HuberRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.HuberRegressor)
- [x] [RANSACRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.RANSACRegressor)
- [x] [TheilSenRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.TheilSenRegressor)
- [x] [Generalized linear models (GLM) for regression](https://scikit-learn.org/stable/modules/classes.html#generalized-linear-models-glm-for-regression)
- [x] [PoissonRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PoissonRegressor.html#sklearn.linear_model.PoissonRegressor)
- the best for Iris
- [x] [TweedieRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TweedieRegressor.html#sklearn.linear_model.TweedieRegressor)
- [x] [GammaRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.GammaRegressor.html#sklearn.linear_model.GammaRegressor)
- [ ] [Miscellaneous](https://scikit-learn.org/stable/modules/classes.html#miscellaneous)
- [x] [PassiveAggressiveRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.GammaRegressor.html#sklearn.linear_model.PassiveAggressiveRegressor)
- [ ] [neighbors](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors)
- [x] [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor)
- [ ] [RadiusNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsRegressor.html#sklearn.neighbors.RadiusNeighborsRegressor)
- [x] [neural_network](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neural_network)
- [x] [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neural_network.MLPRegressor)
- [x] [tree](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)
- [x] [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
- [x] [ExtraTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeRegressor.html)
- [x] [SVM](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm)
- [x] [SVM / SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)
- [x] [SVM / NuSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVR.html)
- [x] [SVM / LinearSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html)
<br>
- [ ] Classification
- [x] [dummy](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.dummy)
- [x] [DummyClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)
- [ ] [ensemble](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
- [x] [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
- [x] [BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)
- [x] [ExtraTreesClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
- [x] [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
- [x] [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [ ] [StackingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html)
- [ ] [VotingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)
- [ ] [HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html)
- for big datasets
- 當前 sklearn 版本沒有
- [x] [gaussian_process](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.gaussian_process)
- [x] [GaussianProcessClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html)
- [ ] [Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)
- [x] [PassiveAggressiveClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveClassifier.html)
- [x] [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html)
- [x] [RidgeClassifierCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifierCV.html)
- [x] [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier)
- [ ] [naive_bayes](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes)
- [x] [BernoulliNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)
- [x] [CategoricalNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html)
- [x] [ComplementNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html)
- [x] [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
- [x] [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
- [ ] [neighbors](https://scikit-learn.org/stable/modules/classes.html)
- [ ] [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
- [ ] [RadiusNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsClassifier.html)
- [x] [neural_network](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neural_network)
- [x] [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)
- [ ] [Semi-Supervised](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.semi_supervised)
- [ ] [SelfTrainingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.SelfTrainingClassifier.html#sklearn.semi_supervised.SelfTrainingClassifier)
- 當前 sklearn 版本沒有
- [ ] [svm](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm)
- [x] [SVM / SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
- [x] [SVM / NuSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html)
- [x] [SVM / LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
- [ ] [tree](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)
- [x] [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- [x] [ExtraTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeClassifier.html)
- 參考資料
- [TOP MACHINE LEARNING ALGORITHMS YOU SHOULD KNOW](https://builtin.com/data-science/tour-top-10-algorithms-machine-learning-newbies)
- Linear Regression
- Logistic Regression
- Linear Discriminant Analysis
- Classification and Regression Trees
- Naive Bayes
- K-Nearest Neighbors (KNN)
- Learning Vector Quantization (LVQ)
- Support Vector Machines (SVM)
- Random Forest
- Boosting
- AdaBoost
<br>
## [模型指標 (metrics)](https://hackmd.io/DLAReu9qSzud3e9NWfJIJw#metrics)
<br>
## auto-sklearn
- 若沒有帶演算法進來,
- 由系統自動判斷
- 若有指定是回歸類型,則由系統跑所有回歸演算法,選出最好的一個?
- 若有指定是分類類型,則由系統跑所有分類演算法,選出最好的一個?
- 參考資料
- [官網手冊](https://automl.github.io/auto-sklearn/master/manual.html)
- [metrics](https://automl.github.io/auto-sklearn/master/examples/40_advanced/example_calc_multiple_metrics.html#sphx-glr-examples-40-advanced-example-calc-multiple-metrics-py)
<br>
## 待改進的東西
- 環境變數讀取,不要綁訂到 abstract model
- 資料窗口如何定義,如何檢測界面是否有符合資料規範,例如:
- X : array-like, shape = (n_samples, n_features)
- y : array-like, shape = (n_samples,)
<br> <hr>
### 2021/04/16 - 評價指標
- [RMSE 有分「一般RMSE」和「正規化RMSE」](https://zh.wikipedia.org/wiki/%E5%9D%87%E6%96%B9%E6%A0%B9%E8%AF%AF%E5%B7%AE)
### 2021/04/06 - 不錯的導覽建議
- [Day8-Scikit-learn介紹(1)](https://ithelp.ithome.com.tw/articles/10204845)
[](https://i.imgur.com/qcEvSsI.png)
### 2021/04/06 - 集大成工具 (ML2.0 要看)
- [EthicalML / awesome-production-machine-learning](https://github.com/EthicalML/awesome-production-machine-learning#data-stream-processing)
Awesome production machine learning

### 2021/03/22 - Python 可解譯性套件
- [使用可解譯性套件以 Python (preview & 預測來說明 ML 模型)](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-machine-learning-interpretability-aml)
### 2021/03/19 - PyCaret
- ### https://towardsdatascience.com/pycaret-2-1-is-here-whats-new-4aae6a7f636a
- PyCaret 的介紹 (最新版是 2.3)
- PyCaret is an open-source, low-code machine learning library in Python that automates the machine learning workflow
- **這篇介紹了 2.1 的新功能, 可以了解看看, 也許可以將 PyCaret 整進我們的 ML Template**
- Hyperparameter Tuning on GPU.
支援使用 GPU 針對 XGBoost, LightGBM and Catboost 三種模型進行超參數調整
- Model Deployment
透過 deploy_model() 將模型佈署到 Azure 或是 GCP
- MLFlow Deployment
支援 MLFlow 佈署功能
- MLFlow Model Registry
支援 MLFlow 模型註冊功能
- High-Resolution Plotting
- User-Defined Loss Function
- Feature Selection
使用 Boruta algorithm 進行特微選取。這個看一下是不是自動挑選特徵的功能
- 圖解

- https://pycaret.org/interpret-model/
- https://pycaret.org/automl/
- https://pycaret.org/missing-values/
- 有 Interpret Model, Data preparation 以及 AutoML 功能
- > 看起來值得研究研究這套 library
> 等你把第一版 ML Template 完成後, 就來研究這套吧 (laugh)
> ML Template v1.0 -> 使用 sciket-learn + auto-sklearn
> ML Template v2.0 -> 導入 PyCaret, 提供更多功能
- ### [Evaluating the Model Using All Plots](https://towardsdatascience.com/how-to-use-pycaret-the-library-for-lazy-data-scientists-91343f960bd2) <— 這功能也很不錯

- ### [【Python 煉丹爐再升級】沒有 PyCaret 2.0 辦不到的事!幾分鐘自動化所有數據準備、模型部署](https://buzzorange.com/techorange/2020/08/03/pycaret-2-0/)
- ### [Kaggle 上面有人寫了使用 PyCaret 的簡單教學](https://www.kaggle.com/frtgnn/pycaret-introduction-classification-regression)
<br> <hr>
### 2021/03/19 - ~~演算法~~
AdaBoost (both)
Extra Tree (both)
Decision Tree (both)
Gradient Boosting (both)
KNeighbors (both)
Logistic Regression (cls)
Linear Regression (reg)
Random Forest (both)
SGD (cls)
SVM (both)
XGBoost (both)
LightGBM (both)
<br> <hr>
### 2021/03/18 - 投影片總結 - 資料科學家的痛
https://www.kaggle.com/binuthomasphilip/lin-reg-bikes-ride-in-sharing-economy-9
隨意找了一篇 kaggle 上的 notebook, 這篇是針對自行車資料集做訓練
https://www.kaggle.com/c/bike-sharing-demand/code
https://www.kaggle.com/ishantkukreti/eda-model-selection
這篇,前面的資料處理,統計分析,也不少
剛有想到, 或許可以拿幾篇kaggle notebook來給大家看,了解一般開發者都是怎麼開發 ML 的
然後再帶一下 azure ml 畫面, 說明 azure ml 所帶來的好處有哪些
從那些 kaggle notebook 來看,可以看出來大家基本的起手式都是先看一下資料內容
做個統計分析,資料平衡情況,欄位內容形態等等
先了解 data, 再接著處理 data (格式轉換, 特徵挑選)
最後才是丟個演算法下去訓練
目前大家的做法,幾乎都是先丟個 xgboost 下去跑
所以 howard 才說演算法相對不是那麼關鍵
因為最重要的苦工在資料處理那一段

https://www.kaggle.com/khaledelhasafy/no-show-data-anlysis
no-show 之前看的
覺得重要的欄位,跟目標欄位是否有關連

把 no-show 的 89 個診所位置,全部攤開
看看 診所位置 與 no-show 是否有明顯關聯
投影片
- ~~頁碼~~
- ~~AMAX -> AICS~~