owned this note
owned this note
Published
Linked with GitHub
# Graph pipeline
###### tags: `sktime`
# 2023-08-11
## Agenda
* Test issue
## Test Issue
* StepInformation need to be adapted. Although, it is set in init
* Possible Fixes:
* Add step is not mutating but returns a fresh object.
* Add_step calls init and init do the checking!
* Add_step returns then the fresh object.
* All objects are checked in init.
# 2023-07-21
MVP graph pipeline for presenting
## FK thoughts
* must have: forecasting pipeline
* examples from pydata global 2022 in the "outlook"
* incl proba forecast
* should have: index-based features like holiday, season
* must have: feature union, column transformer, sequential piplines all expressible as graphical
* nice to have: conversion of nested into graphical (syntactic or eager)
* must have: example with grid search
* grid search "inside" the pipeline
* grid search "around" the pipeline
* should have: grid search integrates "nicely"
* must have: example with reduction
* should have: reduction integrates "nicely"
* should have: also works for time series classification
* must have (for tutorial) - notebook with vignettes, tested
* nice to have: distributed/parallel
* nice to have: inspection of partial results, e.g., after transformers
* nice to have: inspection of variable names expected in output pandas
## BH thoughts
* BigDEAL Competition Pipeline? -As Advanced Example
* Would require to implement additional models?
* Not sure if we are allowed to push the data in a repo...
* Complicated Grid Search Example?
* Multiple feature extraction paths.
* Custom/simple Ensemble that aggregates multiple forecasts (no composition!)
## State of the Graph-Pipeline
### Already Working
[Regression tests implementing simple pipelines with a graph pipeline](https://github.com/sktime/sktime/blob/fbad37c3dddcb7ddf4e6d38b1da3d9c9bb64fd61/sktime/pipeline/tests/regression_tests/test_pipeline_regression.py)
* Forecasting pipeline
* Classification pipeline
* Feature Union
* Probabilistic forecasts
### Needs to be tested - But will probably work (fingers crossed)
* indexed based features
* grid search around pipeline
### Needs to be tested - Unsure if it works
* grid search inside of a pipeline
* complicated grid search example
### Needs to be implemented
* conversion of the pipeline
* BigDEAL model (probably multiple models need to be added.)
* might be worth an issue to spell this out
* doesn´t need to be just one person implementing
* Notebook with examples!
### Other open todos
* Documentation
* Unit testing of steps
*
# 2023-04-28
Goal:
* Go through the detailed open questions of the STEP.
https://github.com/sktime/enhancement-proposals/pull/33
Agenda:
* going through open design questions
* planning implementation
### open questions from STEP
* Should we require the user to say explicitly that she/he wants to execute inverse_transform
and that she/he needs to specify at which position (by adding the step a second time to the pipeline?)
* FK - this is a hard question design-wise, i.e., if and how the user provides that information
* for this reason, FK would suggest to completely ignore `inverse_transform` in the initial draft
* let's make sure we don't lock ourselves out from options, but not deal with the logic in the first iteration
Example for Later:
* Scaler -> Forecaster1
* Forecaster -> Postprocessing
* Scaler -> Postprocessing
* Do we call inverse_transform and where.
* FK idea: calling `add_step` with the name of an existing estimator? tells the framework to use identity. So in-principle can be covered by extending existing methods and may not need deeper changes to the framework if we decide to work on this later.
* decision: deal with `inverse_transform` in the second iteration only
* Testing Strategy, Decision Needed!
* TDD or partial TDD using the simple examples?
* FK: preference is for partial TDD and then adding coverage at the end - probably faster than full TDD? no strong preference
* **Do we allow Mocking?**
* BH - are we using mocking in sktime already?
* FK: not to the extend one would like, but there are mock classes that are and can be used in testing, in `sktime.utils.estimators`
* originally, we wanted to use these to test the boilerplate layer
* BH - in pywatts, we used magic mocks
* FK: preference is for using both mocks and integration tests with actual pipeline elements, i.e., mocks and examples
* using mocks might be quite some work, so perhaps starting with examples only and then increasing complexity of tests might get us somewhere quicker?
* especially since the architecture might change
* BH - advantage of mocks is that multiple people can implement, one based on mock classes
* FK - should we go with what the pywatts framework is doing
* -> use mocks
* Developlment:
* Where do we implement:
* Dev Branch for the pipeline, where we merge from additional ones into it.
Work plan
* action FK: consolidate https://github.com/sktime/sktime/issues/3023 into test examples
* should be "simplest examples"
* covering current sktime pipelines (e.g., TransformedTargetForecaster, ForecastingPipeline, feature union) as well as pipeliens not possible with sktime
* should be PR to the STEP or additions to the STEP
* actually write code for the examples
* action BH: create a folder structure for the dev branch.
* folders for pipeline, tests, mocks and test examples for comparison (sktime current with graph pipeline)
* pipeline stub (stage 1)
# 2023-03-31
Agenda
1. presentation partial STEP by benheid
2. planning of STEP creation/writing
Notes
planning of STEP writing
Scope
* BH: what should the STEP cover? (technical solution)
* FK: better to start small, smallest non-trivial increment/improvement that covers the design intention
* ? pipeline with forecasters and transformers, nothing more
* functional requiremens
* inclusive of `TransformedTargetForecaster` linear pipeline
* inclusive of `ForecastingPipeline` linear pipeline
* inclusive of `TransformerPipeline`
* inclusive of `FeatureUnion` and variable subsetting
* plus "triangle" graphical situation ->?
* BH: this is more, requires execution order as in pywatts
* FK: but without it would not be truly graphical?
* only one output step! (must be transforer or forecaster)
* FK: agreed
* implications on functionality
* this would be polymorphic only as a single transformer or forecaster
* covers simple truly graphical situations beyond current sktime functionality
* at most one forecaster in the pipeline, rest is transformers
* can use pywatts graph resolution algos
* artefacts and functional units to define
* top-level pipeline object
* steps drafting, `add_step`
* maybe: step information?
* type inference (forecaster or transformer?)
* graph representation
* Perhaps indirection layer is necessary to faciliate the implementation.
* graph resolution
* must include whether transformers are pre or post
* transformer outputs "skipping" the forecaster, e.g., a lagger
Content
* high-level code design
* class/code architecture
* dispatch mechanism
* indirection layer
* component handling
* re-use
* sktime base class layer
* pywatts graph logic
* interface compatibility
* compliance
* testing (e.g., as a transformer)
* user journey design
* UX considerations, principles
* simplicity vs expressivity
* intuitiveness (specification)
* compare to `make_pipeline` or dunder
* specification incl `add_step`
* graph resolution
* state mutation?
* dunders?
* dunder resolution?
* simple examples
* linear examples
* feature union
* variable subsetting
* "skipping the forecaster" (e.g., with lags)
* triangle
* comparison to linear pipelines
* 1:1 comparison of specification code
* architectural
* implementation plan
* logical units
* can be worked on separately?
* tests of sub-units
* interface compatibility
* re-use or import of existing functionality
* TDD or partial TDD using the simple examples
Design questions
* high-level
* layer design
* components - classes, methods
* separate graph from estimator logic?
* validation
* re-use
* pywatts, sktime
* DAG logic
* extensibility
* without actually specifying, but thinking whether it is easy
* to more estimator types
* to special composition patterns
* e.g., "use index as features" = "different types of vertices"
* e.g., parameter plugin via `ParamPluginForecaster` = "different types of edges"
* Sklearn Compatibility? I.e. should the pipeline also work for tabular data?
* FK: good idea
* so type inference etc already exists for sklearn
* -> reuse?
* In which repository do we work?
* Depends on the extensibility
* Should we aim to support graph pipelining for sklearn?
* FK: good idea! sklearn is a much bigger use case!
* problem is, sklearn has worse architecture and will be more fiddly
* we could first try sktime then sklearn? (maybe by then sklearn adopts skbase :-) after our April presentation)
* graph pipelines for sklearn must exist though??
* FK: happy with either solution, i.e., separate module or directly in sktime - your preference
### actions
FK -> https://github.com/sktime/sktime/issues/4413
BH -> STEP cont
next meeting: Apr 21
# 2023-03-24
Agenda
1. high-level points
* new design - "compile" design
* what to do with this?
* pro/con point - interface compatibility of result
* can we use the graph pipeline as an `sktime` component?
* FK: do we rule out duck typing alt?
2. lower-level points
* any details need clarification? (from pros/cons)
* writing a STEP
* but which design to focus on? -> make a call
* try consensus
* if disagreed, Benedikt decides
* then rough implementation plan
* can start off existing code snippets
Notes
1. high-level points
* new design - "compile" design
* what to do with this?
* BH: extension or variant of dyn inheritance
* because dyn inheritance happens via `compile`
* would still require "translation" step at the end
* same principle as with dyn inheritance, as it would produce a base class descendant
* FK: agree, not that new, so park it and reconsider if we go with dyn inh
* pro/con point - interface compatibility of result
* can we use the graph pipeline as an `sktime` component?
* in the three cases
* FK: context, some (heterogeneous, non-polymorphic) composites do type checks via the `registry.scitype` utility, this is based on inheritance from base class
* example would be `TransformedTargetForecaster` that needs to identify which of its components is the forecaster
* e.g., if it is inheriting from `BaseForecaster`, we know it's a forecaster, etc
* would require some way to do this check for heterogeneous composites
* `sklearn` solves this by having a tag/attribute that carries the type, instead of checking inheritance (they use the duck typing approach for pipeline)
* question: does approach violate implicit contract assumption on `sktime` estimators?
* we can of course change the `sktime` contract (will affect composites only), but we should be explicit which options require a contract change for graph pipeline to be compatible!
* corresponding interface query is, to the pipeline: "what are you?"
* BH: none of the current solutions seem convincing
* FK: but as soon as it inherits from a base class, it complies with the contract expectation for composites
* BH: there are intermediate cases as well, no clear type to it'
* FK: could ignore for a moment and accept we have one potentially non-composition compliant class - graph pipeline is anyway the likely "top-level composite"?
* BH: good idea, let's revisit later, seems a special point
* FK: there will probably be consequences to the choice, but they will be made more explicit
* agreed that we have the problem with all designs - either non-compliance, or dealing with intermediate/mixed pipelines, or both
* FK: do we rule out duck typing alt?
* BH: agreed, uses deep inspection/python
2. lower-level points
* any details need clarification? (from pros/cons)
* no
* writing a STEP
* but which design to focus on? -> make a call
* try consensus; if disagreed, Benedikt decides
* FK would go for duck-typing based on pragmatic points
* less complex to understand and maintain
* can implement directly, plan is clear
* dyn inheritance is the "principled" way, but unclear how exactly to do right, will incur complexity
* duck-typing is how sklearn does it, people familiar with the concept
* only requires args/kwargs knowledge
* who else than aus could maintain it?
* unless we can hide dyn behind package
* solves `add_steps` and untyped object
* BH
* agrees on point of complexity, maintainability
* `add_steps` would also be ok in dyn programming, could be hidden behind call dunder
* duck-typing is easier to implement
* prefers duck-typing
* -> agreed, duck-typing is the plan
* then rough implementation plan
* STEP - based on BH prototype -> action BH
* implementation plan -> FK input
* FK to take care of interface compatibility & type/integration functionality
* BH graph pipeline logic (from or wraps pywatts)
* can start off existing code snippets
* two-step approach:
* first linear pipeline, ensure class dispatch works
* try to layer out the pipeline logic!
* i.e., can be switched out to graphical logic
* then graph pipeline, using internal layer
* review proposed plan in meeting 2023-03-31
#### FK opinions on 2023-03-17 post discussion
* would rule out duck typing alternative design (not duck typing prime) because of "hacking the language"
* FK would base this on this single contra point having weight
* this is manual implementation of inheritance
* copying methods in manually seems brittle
* new design idea: "graph pipeline class" - just entirely outside base class framework - with a `compile` method that results in a base class. From discussion on 2023-03-17.
* separates construction from specification
* construction happens separate from specification, and after specification
* solves the `add_step` problem?
* familiar to users with `keras`/`tensorflow` background
# 2023-03-17
Plan for 2023-03-24 - fill in/ complete the below
prepare for design discussion, aim is fix a design direction for further implemetnation and STEP/design concretisatioln
### reference material
design prototype ducktyping https://github.com/sktime/sktime/pull/4341
there are two designs in PR 4341
* base design: duck-typing using dispatch and kwargs
* alt design: methods are copied in `__init__`, "manual dynamic inheritance"
* FK: is this "hacking the language"?
design prototype polymorphism https://github.com/sktime/sktime/pull/3108
### pros/cons points
* user experience (user journey, simplicity, similarity to sklearn)
* compatibility with "graph pipeline" designs
* maintenance, maintainability
* cognitive complexity of the code (design, structure of code)
* on face value; oo "patterns"
* how "pythonic" it is
* using python native patterns vs "language hacking"
* extension or change cases
1. new base class
2. new method for existing base class
3. method signature changes for existing base class & method
FK: new point! important, forgot this
* compatibility with `sktime` interface contract
### user experience
-> looks like exactly the same syntax? therefore identical UX
```python
p = Pipeline(steps, stuff)
p.fit(data, params)
p.method(more_data)
```
in both cases, "which methods the pipeline has" is only clear after construction (for object), not for class
### compatibility "graph pipeline"
`add_step` method is the key element.
* design may require this to be self-mutating
* FK Q: do we want `add_step` to be mutating, design-wise?
type inference of "final" type seems to be same in both
* likely private method that produces an output of type
#### dyn inheritance
* dynamic inheritance: parent is constructed during `__new__`, so `add_step` would have to mutate the parent class of `self`, which may not be fixable
* FK, option: delegate as attr, and reconstructed each time `add_step` is executed. But this would not inherit from the "right base class", the wrapped class would
* FK, option: `add_step` is not mutating (but pure/functional), i.e., it does not return `self` but a new object
* BH, option: introducing `compile` method (like in `keras`), than this compile method can return the correct new object.
* FK: this is "drafting class", and polymorphic construction happens after `compile` (but do we really need a polymorphic class then, as we could simply construct a "right type class")
* -> 3rd alt design?
* locked in to being one of the existing types, unclear how to deal with custom heterogeneous type (multiple outputs)
* FK: could be "extra base class" or behave like an "other" dynamic parent
#### duck-typing
* duck-typing: `add_step` can be mutating
* duck-typing does *not* inherit from "the right base class", it lives a separate life but is interface compliant
* alt design: has elements of inheritance, but not explicit
* seems easiler to deal with heterogenous graph pipelines, e.g., forecaster and classifier as part of the same pipeline
* e.g., custom logic where `predict` returns a tuple of returns, one per ending point
#### option 3 (duck-typing alt)
### Maintenance
#### cognitive complexity of the code (design, structure of code)
**Ducktyping**
* BH: Requires knowledge about inspection.
* BH: The alternative design is complex and potentially confusing due to the inspection combined with copying.
* BH: Self implemented partial inheritance
* FK: yes, I think it is horrific in this respect. Good idea, but it "hacks the language", it goes "too deep". If python would support this natively, would be nice
**Dynamic Inheritance**
* BH: Needs expertise in object orientation.
* BH: Needs detailed knowledge about how an object in python is built. (First `__new__` is caled, ...). Needs knowledge about how to construct a new type.
* BH: Rather unknown concept?
* BH: Potentially less error prone than the ducktyping approach if the types are strictly checked?
* FK: yes, re "needs oo experience", but: oo is central to python
* so, assuming someone with oop background it seems simple
* cognitive complexity of duck typing requires to understand extension case
* extension case seems hard to understand, more steps to follow
* indeed, would agree that `__new__` is very arcane
* but the extension case is simpler?
#### how "pythonic" it is?
**duck-typing**
* BH: Is very pythonic.
* FK: agreed about the prime variant, although that's about the *external* interface rather than the *internal* interface
* FK: not the alt variant, see above
**dynamic inheritance**
* FK: not sure - seems to be unusual
#### Extension cases:
##### 1 new base class
**duck-typing**
* requires add to base class register & coverage in ?
* BH: Not sure if this is true. If the new base class is compliant to the existing interface and does not introduce new methods for which the ducktypting pipeline does not look it should work.
* Reason: New class implements fit and predict with a new signature, then the ducktyping approach looks if the new class has a fit and predict method and checks during run time if the new signature is fulfilled by the arguments passed to the pipeline.
* FK: yes, but the signature could be different. I was talking about a general case, let's think `fit(X,y,Z)` in a new class `BaseNewClass`.
* oh, I see, that would be covered by `kwargs` as long as the call situation is same as the signature.
* so it would only be a problem if the new base class has a new method, as you say
**polymorphism:**
* requires add to base class register & class inheritance parents
FK: unclear which is better
##### 2 new method for existing base class
**duck-typing prime:**
* requires manual add of an entirely new method and logic for pipeline
**duck-typing alt:**
* requires manual add to "allow list" of methods
**polymorphism:**
* propagates through via inheritance
Polymorphism comes out better here.
##### 3 method signature changes for existing base class & method
**duck-typing prime:**
* propagates through via inspection.
**duck-typing alt:**
* propagates through via copying and inspection.
**polymorphism:**
* propagates through via inheritance
FK: looks comparable, no pros/cons here?
# 2023-03-07
## Ducktyping vs Polymorphism
### Polymorphism [15min]
prototype in
https://github.com/sktime/sktime/pull/3108
issues:
* type in constructor, cannot be an object variable
* super calls get messed up
* not compatible with abc
* probably not possible to change type via method, e.g., via `add_step` that would change, say, from trafo to forecaster
* because `reset` calls `__init__` but not `__new__` (to be verified)
* can an object replace itself (mutating, reference constant) with an object of another class??
* `reset` does a pseudo-init, but works only if object is of *same* class
* could also be solved by `add_step` not being mutating, but returning a new object (constructed from scratch)
advantages:
* type is linked to class, boilerplate is directly inherited, no dispatch needed
* polymorph also useful for delegators
### Duck typing [15min]
same approach as `sklearn` `Pipeline` - switches on methods on demand, i.e., if estimator thinks it should have method
issues:
* not extensible - adding one more base class needs changes in the duck typed class
* how would inheritance of templated methods work? the `fit` body is different! depending on whether we are dealing with `BaseTransformer` vs `BaseForecaster`
* `fit` would have `args` and `kwargs` and dispatch to the template `fit`-s of base classes
* same for any method that exists in at least two base classes
* could also address this with unified method dispatch logic, needs to read register of base classes (and inspect their methods)
* pipeline also needs to be usable in composition, so inspectability of "what type are you" needs to be thought about
* naive implementation breaks inspectability, as we currently check for type via "what base class do you inherit from"?
* and this approach would not inherit from a specific base class (but instead dispatch to base classes and its steps)
advantages:
* can deal with pipelines that have two or more "output levels"
* i.e., pipelines or partial pipelines that do not have a clear base class, e.g., m-in, n-out networks
* duck-typing is pythonic
* works with `add_step` being mutating
## actions:
work out prototypes for linear pipeline (3 scitypes - transformer, forecaster, classifier)
https://github.com/sktime/sktime/issues/4281
https://github.com/sktime/sktime/issues/4282
## next meeting
Mar 17, 12noon UTC