建立自己的scikit-learn

# 建立自己的scikit-learn ###### tags: `python` `scikit-learn` :::info I decided to create my own estimator using scikit-learn and then use Pipeline and GridSearchCV for automatizing whole process and parameter tuning. ::: ## Building an object 1. 以下選擇一個要建立的object類型 * Classifier * Clusterring * Regressor * Transformer 2. 繼承BaseEstimator和依選擇的object類型而有對應的類別 * ClassifierMixin * ClusterMixin * RegressorMixin * TransformerMixin 以上建立好自己的estimator，再來就是決定輸入參數及輸出。 ## 遵守scikit-learn規則 * __init__的所有參數都必須具有默認值，讓只需鍵入MyClassifier（）就可以初始化分類器 * 不確定參數放在__init__方法！這屬於fit方法。 * __init__方法的所有參數都應該與它們作為創建對象的屬性具有相同的名稱 * 不要把data作為__init__參數！它應該在fit方法。 ## get_params & set_params 所有estimator都必須具有get_params和set_params函數。它們是您繼承BaseEstimator時繼承的，我建議**不要override**這些函數（就是不要在分類器的定義中說明它們）。 ## fit method **在這裡implement所有你想做的!** 還有以下規定項目。 * 檢查參數 * 拿取並處理資料 * 若要增加atrribute，命名應該以 _ 結尾，如self.fitted_。 * return self * 若沒有要傳入Y，則要設定成Y=None，之後才能使用GridSearch。 ## Additional requirements, score and GridSearch * 需要對用戶隱藏的所有內容，命名要以 _ 開頭。(但還是可以被呼叫) * 為了正確使用GridSearch，會依需求override score method來辨識哪一個模型最優。所以GridSearch只要呼叫score並遵循數值越大代表模型越好的規則，就可以找到最佳參數。 * Override score method: 評估模型的部分寫在這裡，並回傳**數值**指標。 ## An example of MeanClassifier 建立一個分類器，會將輸入x分為兩類，其threshold值會根據訓練資料來決定。 * class 0: x < threshold * class 1: x >= threshold * threshold = mean + intValue。由輸入的list X，其必須包含20個數值，以這20個數值取平均並加上使用者設定的intValue。 ### build the classifier class ```python= from sklearn.base import BaseEstimator, ClassifierMixin class MeanClassifier(BaseEstimator, ClassifierMixin): """An example of classifier""" def __init__(self, intValue=0, stringParam="defaultValue", otherParam=None): """ Called when initializing the classifier """ self.intValue = intValue self.stringParam = stringParam # THIS IS WRONG! Parameters should have same name as attributes self.differentParam = otherParam def fit(self, X, y=None): """ This should fit classifier. All the "work" should be done here. Note: assert is not a good choice here and you should rather use try/except blog with exceptions. This is just for short syntax. """ assert (type(self.intValue) == int), "intValue parameter must be integer" assert (type(self.stringParam) == str), "stringValue parameter must be string" self.threshold_ = (sum(X)/len(X)) + self.intValue # mean + intValue return self def _meaning(self, x): # returns True/False according to fitted classifier # notice underscore on the beginning return( True if x >= self.threshold_ else False ) def predict(self, X, y=None): try: getattr(self, "threshold_") except AttributeError: raise RuntimeError("You must train classifer before predicting data!") return([self._meaning(x) for x in X]) def score(self, X, y=None): # counts number of values bigger than mean return(sum(self.predict(X))) ``` ### use the classifier & GridSearch ```python= from sklearn.model_selection import GridSearchCV X_train = [i for i in range(0, 100, 5)] X_test = [i + 3 for i in range(-5, 95, 5)] tuned_params = {"intValue" : [-10,-1,0,1,10]} gs = GridSearchCV(MeanClassifier(), tuned_params) gs.fit(X_train) print(gs.best_params_) # {'intValue': -10} # and that is what we expect :) ``` ## Simplification of init method 當初始化參數非常多的時候，會看到以下冗長的程式碼。 ```python= def __init__(self, arg1, arg2, arg3, ..., argN): self.arg1 = arg1 self.arg2 = arg2 . . . self.argN = argN ``` 所以我們可以透過inspect module & setattr來簡化這段程式碼。 ```python= import inspect def __init__(self, arg1, arg2, arg3, ..., argN): # print("Initializing classifier:\n") args, _, _, values = inspect.getargvalues(inspect.currentframe()) values.pop("self") for arg, val in values.items(): setattr(self, arg, val) # print("{} = {}".format(arg,val) ``` ## 參考 [部落格](http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/) [scikit-learn](https://scikit-learn.org/dev/developers/contributing.html#rolling-your-own-estimator)