EvalML Pipeline Design v2.0

# EvalML Pipeline Design v2.0 --- Misc. todos: - Add estimator-less feature selection? ## Goals PR Comments Round 2: - COMPONENTS / utils to get components - rename encoder --> categorical encoder - remove problem_type parameter - get_names for feature selector - move input_feature_names - automatically determine problem type and model type - feature_importances as subclass method / property - remove name, component_type, needs_fitting - return_dict PR Comments Round 1: - Add in update_parameters() - implement get_item with model types? - Abstract out feature importance - Moving out of fixed pipline into pipebase - Add it to Estimator - If there is SelectFromModel take from there - Make Transformer subclasses - Feature Selector - Potentially think about others - Check default arguments - done - List and Dicts() - Remove init - done - Logger - done - Rename hyperparameter - done - Make parameter required - done - SelectFromModel - Make less generic - Type/Name as String as Component List - Add Model Type - done 1. Explain, expose, describe pipeline * Structure + parameters 2. How can users control structure of pipeline? * Submit own pipeline? 3. How can user control parameters? * Specific parameters * Range 4. Inspect, understand, evaluate fitted pipeline --- ## Implementation Outline 1. 1. Create Component class and relevant subclasses 1. Transformers a. SimpleImputer b. OneHotEncoder c. StandardScaler d. SelectFromModel(Feature Selection) 2. Estimators a. Classification a. Logistic Regression b. RF c. XGBoost b.Regression a. RF Regression b. Linear Regression 3. Subclasses a. Enums for Model Types a. Imputer b. Encoder c. Scaler d. Feature Selection e. Classifier f. Regressor 3. Refactor current calls from external libraries to Component objects 4. Create Pipeline class a. Implement methods 5. Come up with more complex pipeline a. Can our new abstraction handle this complex case well? . . --- ## Design ### Explain, expose, describe pipeline * Objective(s) * Components and their parameters * Results: * Training time * Scores * high_variance_cv * all_objective_scores * Features * Top features * Number of features ### Save and export pipeline * Pickle --- ## Implementation / API ### Pipeline A Pipeline can be described as a series (list) of Components objects. The last Component must implement a fit() method, and all Components but the last must implement a transform() method. * Holds reference to Components * Holds after-training information #### Fields ``` name: string model_type: enum results: dict score, high_variance_cv, scores, all_objective_scores, training_time Would resolve https://github.com/FeatureLabs/evalml/issues/51 pipeline_components: list list of Components that represent or make up this Pipeline problem_type: list of ProblemTypes is this necessary? should it be outside the scope of class? objective: ObjectiveBase random_state: int n_jobs: int ``` #### Methods ``` __init__() component_list: list list of Components or str (of component names) (save this for later?) dict: enables users to choose what they want for specific steps along pipeline, pass in component_types hyperparameters: dict {str:dict{str:value}} dict mapping Component name to a dictionary of its hyperparameters will also check that component_list is valid (has an estimator as last Component) get_component(int i) i: int gets the Component object at index i in pipeline_components Name vs component type? Override list behavior describe() Move from describe_pipeline + more structured feature_importance(): calls feature_ fit(X, y, objective_fit_size) predict(X) score(X, y, other_objectives) predict_proba(X) ``` --- ### Component (base class) #### Fields ``` name: string maybe enum component_type: str or enum, "encoder", "imputer", "scaler", "normalizer", "classifier", "regressor" _needs_fitting: bool objective? _component_object: obj or None random_state: int (?) ``` #### Methods ``` Handle _component_object() errors in the following methods: fit(self, X, y=None, objective_fit_size): Utilizes the fit function of the component_object or has its own fit function for a custom component transform(self, X): Utilizes the transform function of the component_object or has its own transform function fit_transform(self, X, y=None): Combines the fit and transform function in one __init__(self, hyperparameters, _needs_fitting=False, _component_object=None): predict(self, X): Given X, produce list Y_hat score(self, X, y, other_objectives): Produce a scaler score given X, y predict_proba(self, X): Given X produce list of p(y_hat) ``` ### Estimator (subclass) ##### Fields ``` name: string component_type: str or enum _needs_fitting: bool _component_object: obj + its own parameters ``` ##### Methods ``` fit(self, X, y=None, objective_fit_size) predict(self, X) score(self, X, y, other_objectives) predict_proba(self, X) ``` ### Transformer (subclass) ##### Fields ``` name: string component_type: str or enum _needs_fitting: bool _component_object: obj + its own parameters ``` ##### Methods ``` fit(self, X, y=None, objective_fit_size) transform(self, X) fit_transform(self, X, y=None) ``` --- ## Questions and Other Notes * What should describe_pipeline look like in different states (pre-initialized, during training, post-training, etc.)? * Flag for CV training and full dataset? * Testing vs Final Pipeline * Should Pipeline have reference to AutoML objects? * Make it the job of the AutoML object to update results after training * Pipeline parameters * Since a pipeline should represent one instance of a ML pipeline I don't believe it shoul dhave the ability to take in ranges of parameers. That should be the responsibility of the AutoML object to keep track * Distinction between AutoML and Pipelines * AutoML should: * Select and Optimize hyper-parameters * Compare pipelines * Build different piplines * Pipelines should: * Train on data * ~~Test using cross validation~~ * Predict given new data * Be initalized given different components * Is there any merit in having subclasses for components? * Could be useful if we used subclasses to determine component-specific functionalities / fields? * Hyperparameter ranges: store in Components/Pipeline but enable override * Organization of hyperparameters * For first pass, stick with current (SKOpt) * Cross-validation: * Pull out from AutoML * have it as a separate piece that can stand on its own * User Stories / Workflow: * User wants to create a new pipeline with specific hyperparameters for a component: 1. User initializes pipeline via Pipeline(list,hyperparameters) --- --- --- --- --- --- --- --- --- # TODO: - Cross-validation - How to deal with hyperparameter ranges etc: - Ranges - Distinct - Test pipelines and components - Test each component [Done] - Feauture importance [Done] - Actually test holistically - Test use cases - "guardrails" [Done] - Validate estimator at end - Multiple of each type of component - Order of components? - Imputer and encoder first? - file/folder structure [Done] - Most complex structure: TODO: - ComponentBase, Transformer, Estimator: if component_obj doesn't have method --> error out Test use cases: - needs_fitting = False - LinearRegressor - Test when estimator is not last place (I'll take this too -Ange) - ComponentBase, Transformer, Estimator: if component_obj doesn't have method --> error out - write tests with our own custom estimators Classification: - binary - multiclass Regression: # Done: - Feature importance for each pipeline - Generate pipeline name - Describe pipelines - CV information - Describe_component() - AL - pull out hyper-parameters - Add validation (?) - AL - Validate estimator is last or exists - Add docstrings - JS - Component Name vs. User-Defined-Name vs. Component Type -JS - overwrite indexing - Organization of components - JS - Serialization (?) - JS - Move hyperparameters from pipeline to components - AL - remedy AutoML --> hold off until later

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.