skrub

Last update: 2024-02-27

Expected release cycle: every 3-4 months for the first year

General Objectives

skrub intends to bridge the gap between the database and machine learning worlds to bring data sources (in tabular format) closer to the predictive models. skrub embraces the scikit-learn philosophy with seemless integration with objects respecting the scikit-learn API. skrub provides a high-level API by making sensible defaults avoiding users to craft each individual preprocessing steps of a machine learning pipeline.

To achieve this vision, we will work in the following areas:

Go beyond data located in a single file or in a single table; allow to prepare table coming from different data sources.
Be compatible with a wider source of database backends but with a unified API.
Allow for eager, even partial, execution to easily design and debug data preprocessing.
Develop building preprocessing blocks widely used in data science to prepare data and provide the highest-possible API with sensible defaults.
Tied integration with the predictive modelling offered by scikit-learn
Enable a more “imperative” feeling, more natural to users of pandas or polars

Roadmap

Short term (next release / 3 months)

Having a robust implementation of TableVectorizer that is the highest level API transformer
- Internally require some refactoring
- It looks judicious to create many more examples in common use cases where this transformer is expected to work
Improve the available joiners
- An AggJoiner across multiple tables
- Add supervised screening of joins and aggregation to drop columns or avoid joins and aggrations when the resulting columns are not linked to the target
Backend interoperability
- First step would be dispatching mechanism for at least pandas - polars

Mid term vision (next-next release / 6 months)

Backend interoperability
- Make sure that the current preprocessing block works properly for the different supported backends
- Support for ibis:
  - Supporting random seed is not as robust/complete
  - Categorical variables are very implicit (foreign keys + join on a small side table)
Handle time series
- Allow for strategy to make some aggregation when dealing with datetime columns
- Investigate a transformer allowing sessionization
Visualisation
- Integration and development skrubview
- In terminal visualisation for the “hard core” scientist via rich → probably not a priority at first
Schema discovery
- Semantic column type discovery (datetime from strings, postal code, phone numbers, money with currencies, ip adresses for geoip coding and spatial joins)
  - mid term rule-based (e.g. time)
  - ML-based => requires cross-version sklearn persistence
  - required for knowledge bases / entity linking
Benchmark datasets collection to drive design decisions:
- Look at: https://arxiv.org/abs/2402.06282

Longer term vision (design needs to be challenged)

Design a new “pipeline” (sc(k?)affolding, valve, flow, recipe) enabling imperative programming when it comes to design ML pipeline, from the data source to the predictive modelling, including evaluation and tuning. It should allow users to hyper-parameter tune those schema decision, but also sanitize and enforce them (fix schema decisions), this enables more lazy behavior.
- Imperative design
- Compatibility with scikit-learn
- Introduce data source objects
- On the fly sampler for debugging and schema discovery
- Visual debugging of subset at each pipeline set via skrubview
- For feature engineering at predict time
- Only at train time (hyperparameter choices, as target engineering vs feature engineering)
- Design discussion: https://hackmd.io/mzsv1km3SCCsViU-DGi1dw
Integration and impact of neural-network based model within the preprocessing stages
- What is the place of embedding and language models (not LLM)
- Does skorch can play an interface to use pre-trained embedding for some encoding steps?
- Adding deep-learning based sentence embedding to "diverse entries" in the table vectorizer, following https://arxiv.org/abs/2312.09634
Feature screening:
- Select k-best features for supervised task
- How to deal with mixed data types (categorical values, engineered/expanded numerical values, missing values)
- How to do it early enough in the feature engineering pipeline (e.g. before aggregation)
Discover fuzzy joinable tables from a lake (swamp) of a large number of tables
Hashing based statistical estimators
- Screening for fuzzy joins (probably min-hash)
- MinHashEncoder on n-grams from string labels for cat variables
- Bloom vectorizers: scikit-bloom package
- Hyperloglog heuristic cardinality estimation: probably not in skrub should be done by polars/duckdb
Schema discovery, linking
- Ensemble of tables
  - Ideally control the resources
  - Long term / dream: Versioning might be great (deltalake / iceberg), try to inherit from backend (ibis/polars)
  - Long term / dream User based access control, try to inherit from backend
Think about which transformers currently in scikit-learn could be transfer to skrub as new citizen
- Transformers that make per column computation
- Transformers that would benefit from efficient computation in some backends (e.g. lazyness)
Investigate the impact to be lazyness compatible
- is it possible to fit a lazy frame?
- we can leverage layzness for the transform part
Backend integration
- DuckDB → ibis
Feature synthesis in databases, building on assembling features
Investigate how to efficiently handle data structure
- how to handle heterogeneous chunk of data (rich columns vs. numerical matrices vs. sparse arrays)
sparse feature interactions for manually crafted glassbox models
target engineering without data leaks
(bi)temporal modeling