owned this note
owned this note
Published
Linked with GitHub
# EvalML Pipeline Design v2.0
---
Misc. todos:
- Add estimator-less feature selection?
## Goals
PR Comments Round 2:
- COMPONENTS / utils to get components
- rename encoder --> categorical encoder
- remove problem_type parameter
- get_names for feature selector
- move input_feature_names
- automatically determine problem type and model type
- feature_importances as subclass method / property
- remove name, component_type, needs_fitting
- return_dict
PR Comments Round 1:
- Add in update_parameters()
- implement get_item with model types?
- Abstract out feature importance
- Moving out of fixed pipline into pipebase
- Add it to Estimator
- If there is SelectFromModel take from there
- Make Transformer subclasses
- Feature Selector
- Potentially think about others
- Check default arguments - done
- List and Dicts()
- Remove init - done
- Logger - done
- Rename hyperparameter - done
- Make parameter required - done
- SelectFromModel
- Make less generic
- Type/Name as String as Component List
- Add Model Type - done
1. Explain, expose, describe pipeline
* Structure + parameters
2. How can users control structure of pipeline?
* Submit own pipeline?
3. How can user control parameters?
* Specific parameters
* Range
4. Inspect, understand, evaluate fitted pipeline
---
## Implementation Outline
1. 1. Create Component class and relevant subclasses
1. Transformers
a. SimpleImputer
b. OneHotEncoder
c. StandardScaler
d. SelectFromModel(Feature Selection)
2. Estimators
a. Classification
a. Logistic Regression
b. RF
c. XGBoost
b.Regression
a. RF Regression
b. Linear Regression
3. Subclasses
a. Enums for Model Types
a. Imputer
b. Encoder
c. Scaler
d. Feature Selection
e. Classifier
f. Regressor
3. Refactor current calls from external libraries to Component objects
4. Create Pipeline class
a. Implement methods
5. Come up with more complex pipeline
a. Can our new abstraction handle this complex case well?
.
.
---
## Design
### Explain, expose, describe pipeline
* Objective(s)
* Components and their parameters
* Results:
* Training time
* Scores
* high_variance_cv
* all_objective_scores
* Features
* Top features
* Number of features
### Save and export pipeline
* Pickle
---
## Implementation / API
### Pipeline
A Pipeline can be described as a series (list) of Components objects. The last Component must implement a fit() method, and all Components but the last must implement a transform() method.
* Holds reference to Components
* Holds after-training information
#### Fields
```
name: string
model_type: enum
results: dict
score, high_variance_cv, scores, all_objective_scores, training_time
Would resolve https://github.com/FeatureLabs/evalml/issues/51
pipeline_components: list
list of Components that represent or make up this Pipeline
problem_type: list of ProblemTypes
is this necessary? should it be outside the scope of class?
objective: ObjectiveBase
random_state: int
n_jobs: int
```
#### Methods
```
__init__()
component_list: list
list of Components or str (of component names)
(save this for later?)
dict: enables users to choose what they want for specific steps along pipeline, pass in component_types
hyperparameters: dict {str:dict{str:value}}
dict mapping Component name to a dictionary
of its hyperparameters
will also check that component_list is valid
(has an estimator as last Component)
get_component(int i)
i: int
gets the Component object at index i in pipeline_components
Name vs component type?
Override list behavior
describe()
Move from describe_pipeline + more structured
feature_importance(): calls feature_
fit(X, y, objective_fit_size)
predict(X)
score(X, y, other_objectives)
predict_proba(X)
```
---
### Component (base class)
#### Fields
```
name: string maybe enum
component_type: str or enum, "encoder", "imputer", "scaler", "normalizer", "classifier", "regressor"
_needs_fitting: bool
objective?
_component_object: obj or None
random_state: int (?)
```
#### Methods
```
Handle _component_object() errors in the following methods:
fit(self, X, y=None, objective_fit_size): Utilizes the fit function of the component_object or has its own fit function for a custom component
transform(self, X): Utilizes the transform function of the component_object or has its own transform function
fit_transform(self, X, y=None): Combines the fit and transform function in one
__init__(self, hyperparameters, _needs_fitting=False, _component_object=None):
predict(self, X): Given X, produce list Y_hat
score(self, X, y, other_objectives): Produce a scaler score given X, y
predict_proba(self, X): Given X produce list of p(y_hat)
```
### Estimator (subclass)
##### Fields
```
name: string
component_type: str or enum
_needs_fitting: bool
_component_object: obj
+ its own parameters
```
##### Methods
```
fit(self, X, y=None, objective_fit_size)
predict(self, X)
score(self, X, y, other_objectives)
predict_proba(self, X)
```
### Transformer (subclass)
##### Fields
```
name: string
component_type: str or enum
_needs_fitting: bool
_component_object: obj
+ its own parameters
```
##### Methods
```
fit(self, X, y=None, objective_fit_size)
transform(self, X)
fit_transform(self, X, y=None)
```
---
## Questions and Other Notes
* What should describe_pipeline look like in different states (pre-initialized, during training, post-training, etc.)?
* Flag for CV training and full dataset?
* Testing vs Final Pipeline
* Should Pipeline have reference to AutoML objects?
* Make it the job of the AutoML object to update results after training
* Pipeline parameters
* Since a pipeline should represent one instance of a ML pipeline I don't believe it shoul dhave the ability to take in ranges of parameers. That should be the responsibility of the AutoML object to keep track
* Distinction between AutoML and Pipelines
* AutoML should:
* Select and Optimize hyper-parameters
* Compare pipelines
* Build different piplines
* Pipelines should:
* Train on data
* ~~Test using cross validation~~
* Predict given new data
* Be initalized given different components
* Is there any merit in having subclasses for components?
* Could be useful if we used subclasses to determine component-specific functionalities / fields?
* Hyperparameter ranges: store in Components/Pipeline but enable override
* Organization of hyperparameters
* For first pass, stick with current (SKOpt)
* Cross-validation:
* Pull out from AutoML
* have it as a separate piece that can stand on its own
* User Stories / Workflow:
* User wants to create a new pipeline with specific hyperparameters for a component:
1. User initializes pipeline via Pipeline(list,hyperparameters)
---
---
---
---
---
---
---
---
---
# TODO:
- Cross-validation
- How to deal with hyperparameter ranges etc:
- Ranges
- Distinct
- Test pipelines and components
- Test each component [Done]
- Feauture importance [Done]
- Actually test holistically
- Test use cases
- "guardrails" [Done]
- Validate estimator at end
- Multiple of each type of component
- Order of components?
- Imputer and encoder first?
- file/folder structure [Done]
- Most complex structure:
TODO:
- ComponentBase, Transformer, Estimator: if component_obj doesn't have method --> error out
Test use cases:
- needs_fitting = False
- LinearRegressor
- Test when estimator is not last place (I'll take this too -Ange)
- ComponentBase, Transformer, Estimator: if component_obj doesn't have method --> error out
- write tests with our own custom estimators
Classification:
- binary
- multiclass
Regression:
# Done:
- Feature importance for each pipeline
- Generate pipeline name
- Describe pipelines
- CV information
- Describe_component() - AL
- pull out hyper-parameters
- Add validation (?) - AL
- Validate estimator is last or exists
- Add docstrings - JS
- Component Name vs. User-Defined-Name vs. Component Type -JS
- overwrite indexing
- Organization of components - JS
- Serialization (?) - JS
- Move hyperparameters from pipeline to components - AL
- remedy AutoML --> hold off until later