# 建立自己的scikit-learn
###### tags: `python` `scikit-learn`
:::info
I decided to create my own estimator using scikit-learn and then use Pipeline and GridSearchCV for automatizing whole process and parameter tuning.
:::
## Building an object
1. 以下選擇一個要建立的object類型
* Classifier
* Clusterring
* Regressor
* Transformer
2. 繼承BaseEstimator和依選擇的object類型而有對應的類別
* ClassifierMixin
* ClusterMixin
* RegressorMixin
* TransformerMixin
以上建立好自己的estimator,再來就是決定輸入參數及輸出。
## 遵守scikit-learn規則
* __init__的所有參數都必須具有默認值,讓只需鍵入MyClassifier()就可以初始化分類器
* 不確定參數放在__init__方法! 這屬於fit方法。
* __init__方法的所有參數都應該與它們作為創建對象的屬性具有相同的名稱
* 不要把data作為__init__參數! 它應該在fit方法。
## get_params & set_params
所有estimator都必須具有get_params和set_params函數。它們是您繼承BaseEstimator時繼承的,我建議**不要override**這些函數(就是不要在分類器的定義中說明它們)。
## fit method
**在這裡implement所有你想做的!** 還有以下規定項目。
* 檢查參數
* 拿取並處理資料
* 若要增加atrribute,命名應該以 _ 結尾,如self.fitted_。
* return self
* 若沒有要傳入Y,則要設定成Y=None,之後才能使用GridSearch。
## Additional requirements, score and GridSearch
* 需要對用戶隱藏的所有內容,命名要以 _ 開頭。(但還是可以被呼叫)
* 為了正確使用GridSearch,會依需求override score method來辨識哪一個模型最優。所以GridSearch只要呼叫score並遵循數值越大代表模型越好的規則,就可以找到最佳參數。
* Override score method: 評估模型的部分寫在這裡,並回傳**數值**指標。
## An example of MeanClassifier
建立一個分類器,會將輸入x分為兩類,其threshold值會根據訓練資料來決定。
* class 0: x < threshold
* class 1: x >= threshold
* threshold = mean + intValue。由輸入的list X,其必須包含20個數值,以這20個數值取平均並加上使用者設定的intValue。
### build the classifier class
```python=
from sklearn.base import BaseEstimator, ClassifierMixin
class MeanClassifier(BaseEstimator, ClassifierMixin):
"""An example of classifier"""
def __init__(self, intValue=0, stringParam="defaultValue", otherParam=None):
"""
Called when initializing the classifier
"""
self.intValue = intValue
self.stringParam = stringParam
# THIS IS WRONG! Parameters should have same name as attributes
self.differentParam = otherParam
def fit(self, X, y=None):
"""
This should fit classifier. All the "work" should be done here.
Note: assert is not a good choice here and you should rather
use try/except blog with exceptions. This is just for short syntax.
"""
assert (type(self.intValue) == int), "intValue parameter must be integer"
assert (type(self.stringParam) == str), "stringValue parameter must be string"
self.threshold_ = (sum(X)/len(X)) + self.intValue # mean + intValue
return self
def _meaning(self, x):
# returns True/False according to fitted classifier
# notice underscore on the beginning
return( True if x >= self.threshold_ else False )
def predict(self, X, y=None):
try:
getattr(self, "threshold_")
except AttributeError:
raise RuntimeError("You must train classifer before predicting data!")
return([self._meaning(x) for x in X])
def score(self, X, y=None):
# counts number of values bigger than mean
return(sum(self.predict(X)))
```
### use the classifier & GridSearch
```python=
from sklearn.model_selection import GridSearchCV
X_train = [i for i in range(0, 100, 5)]
X_test = [i + 3 for i in range(-5, 95, 5)]
tuned_params = {"intValue" : [-10,-1,0,1,10]}
gs = GridSearchCV(MeanClassifier(), tuned_params)
gs.fit(X_train)
print(gs.best_params_) # {'intValue': -10} # and that is what we expect :)
```
## Simplification of init method
當初始化參數非常多的時候,會看到以下冗長的程式碼。
```python=
def __init__(self, arg1, arg2, arg3, ..., argN):
self.arg1 = arg1
self.arg2 = arg2
.
.
.
self.argN = argN
```
所以我們可以透過inspect module & setattr來簡化這段程式碼。
```python=
import inspect
def __init__(self, arg1, arg2, arg3, ..., argN):
# print("Initializing classifier:\n")
args, _, _, values = inspect.getargvalues(inspect.currentframe())
values.pop("self")
for arg, val in values.items():
setattr(self, arg, val)
# print("{} = {}".format(arg,val)
```
## 參考
[部落格](http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/)
[scikit-learn](https://scikit-learn.org/dev/developers/contributing.html#rolling-your-own-estimator)