skrub
Last update: 2024-02-27
Expected release cycle: every 3-4 months for the first year
General Objectives
skrub
intends to bridge the gap between the database and machine learning worlds to bring data sources (in tabular format) closer to the predictive models. skrub
embraces the scikit-learn philosophy with seemless integration with objects respecting the scikit-learn API. skrub
provides a high-level API by making sensible defaults avoiding users to craft each individual preprocessing steps of a machine learning pipeline.
To achieve this vision, we will work in the following areas:
- Go beyond data located in a single file or in a single table; allow to prepare table coming from different data sources.
- Be compatible with a wider source of database backends but with a unified API.
- Allow for eager, even partial, execution to easily design and debug data preprocessing.
- Develop building preprocessing blocks widely used in data science to prepare data and provide the highest-possible API with sensible defaults.
- Tied integration with the predictive modelling offered by scikit-learn
- Enable a more “imperative” feeling, more natural to users of pandas or polars
Roadmap
Short term (next release / 3 months)
- Having a robust implementation of
TableVectorizer
that is the highest level API transformer
- Internally require some refactoring
- It looks judicious to create many more examples in common use cases where this transformer is expected to work
- Improve the available joiners
- An AggJoiner across multiple tables
- Add supervised screening of joins and aggregation to drop columns or avoid joins and aggrations when the resulting columns are not linked to the target
- Backend interoperability
- First step would be dispatching mechanism for at least pandas - polars
Mid term vision (next-next release / 6 months)
- Backend interoperability
- Make sure that the current preprocessing block works properly for the different supported backends
- Support for ibis:
- Handle time series
- Allow for strategy to make some aggregation when dealing with
datetime
columns
- Investigate a transformer allowing sessionization
- Visualisation
- Integration and development
skrubview
- In terminal visualisation for the “hard core” scientist via
rich
→ probably not a priority at first
- Schema discovery
- Semantic column type discovery (datetime from strings, postal code, phone numbers, money with currencies, ip adresses for geoip coding and spatial joins)
- mid term rule-based (e.g. time)
- ML-based => requires cross-version sklearn persistence
- required for knowledge bases / entity linking
- Benchmark datasets collection to drive design decisions:
Longer term vision (design needs to be challenged)
- Design a new “pipeline” (sc(k?)affolding, valve, flow, recipe) enabling imperative programming when it comes to design ML pipeline, from the data source to the predictive modelling, including evaluation and tuning. It should allow users to hyper-parameter tune those schema decision, but also sanitize and enforce them (fix schema decisions), this enables more lazy behavior.
- Imperative design
- Compatibility with scikit-learn
- Introduce data source objects
- On the fly sampler for debugging and schema discovery
- Visual debugging of subset at each pipeline set via
skrubview
- For feature engineering at predict time
- Only at train time (hyperparameter choices, as target engineering vs feature engineering)
- Design discussion: https://hackmd.io/mzsv1km3SCCsViU-DGi1dw
- Integration and impact of neural-network based model within the preprocessing stages
- What is the place of embedding and language models (not LLM)
- Does
skorch
can play an interface to use pre-trained embedding for some encoding steps?
- Adding deep-learning based sentence embedding to "diverse entries" in the table vectorizer, following https://arxiv.org/abs/2312.09634
- Feature screening:
- Select k-best features for supervised task
- How to deal with mixed data types (categorical values, engineered/expanded numerical values, missing values)
- How to do it early enough in the feature engineering pipeline (e.g. before aggregation)
- Discover fuzzy joinable tables from a lake (swamp) of a large number of tables
- Hashing based statistical estimators
- Screening for fuzzy joins (probably min-hash)
- MinHashEncoder on n-grams from string labels for cat variables
- Bloom vectorizers: scikit-bloom package
- Hyperloglog heuristic cardinality estimation: probably not in skrub should be done by polars/duckdb
- Schema discovery, linking
- Ensemble of tables
- Ideally control the resources
- Long term / dream: Versioning might be great (deltalake / iceberg), try to inherit from backend (ibis/polars)
- Long term / dream User based access control, try to inherit from backend
- Think about which transformers currently in scikit-learn could be transfer to
skrub
as new citizen
- Transformers that make per column computation
- Transformers that would benefit from efficient computation in some backends (e.g. lazyness)
- Investigate the impact to be lazyness compatible
- is it possible to
fit
a lazy frame?
- we can leverage layzness for the
transform
part
- Backend integration
- Feature synthesis in databases, building on assembling features
- Investigate how to efficiently handle data structure
- how to handle heterogeneous chunk of data (rich columns vs. numerical matrices vs. sparse arrays)
- sparse feature interactions for manually crafted glassbox models
- target engineering without data leaks
- (bi)temporal modeling