or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Syncing
xxxxxxxxxx
A high-level interface for building a dataframe transformation pipeline
See also:
The goal is to have a more high-level and interactive interface for building a scikit-learn pipeline. In particular it should offer:
Ridge
,HistGradientBoostingRegressor
.The prototype used for the examples here (which will be updated as we make decisions) is in this branch.
here is some toy data:
Applying some transformations
The pipeline is instantiated with a dataset so we can get the previews.
Note: don't pay attention to the skrub imports for now; anything we decide to put in the public API will be importable directly from
skrub
.Roughly 2 APIs for adding steps are being considered; other suggestions welcome.
ATM it seems Option 1 is the main candidate and option 2 may or may not be added as a more "advanced" interface in the future.
Option 1: with
Pipe.use()
In this option, the pipeline has a
use
method (can also be namedapply
, for example, if that's not too confusing with pandas' apply). We pass it the transformer to use, and optional configuration such as the columns on which to use it or a name for the step as kwargs.By default the preview is a random sample, we can also see the first few rows:
Notes:
name
parameter sets the step name in the scikit-learn pipeline. It could be something more explicit likestep_name
.Implicitly selecting columns for which a transformer applies.
A transformer can reject columns for which it doesn't apply, in which case they are passed through. For example whether a string column contains dates can only be discovered when trying to parse them. Instead of `pipe.use(ToDatetime(), cols="C")`, if we didn't know in advance which columns contain dates, we could have written `pipe.use(ToDatetime())` and the result would be the same. We can have a "strict" mode (which can be the default) where that would result in an error and we would be forced to specify `pipe.use(ToDatetime(), cols="C")`. See [#877](https://github.com/skrub-data/skrub/pull/877) for more discussion.We can then extract a scikit-learn Pipeline that we can cross-validate etc.
This is a regular scikit-learn
Pipeline
, withfit
andtransform
orpredict
methods. We can also see a more human-readable summary of the steps.If the transformation fails we see at which step it failed and the input data for the failing step:
.sample()
doesn't catch the exception so it can be inspected.We can also ask to see only the part of the output that was created by the last step:
TODO: give that parameter a better name or add other methods instead, eg
.sample_last_step()
Option 2: with
Selector.to_datetime()
,Selector.use()
We can also have the
use
method directly on the selectors, some methods for commonly used estimators. This last point avoids having to import estimators and provides tab-completion on their name in interactive shells. Another difference is configuration like the step name is added with an additional method call rather than additional kwargs.(Instead of
chain
it could beapply
,use
,transform
,with_steps
, …)We can also pass directly an estimator (in which case the column selection is
s.all()
), and the selectors have a.use()
method for using estimators that haven't been registered as methods. So this is equivalent to the above:Discarded options
Option3: with
Pipe.cols().to_datetime()
,Pipe.cols().use()
The third option adds a method
.cols
(or maybe.on_cols
) to the pipeline to which we pass the selector. That returns an object that is used to configure the next stepNotes:
Methods that add an estimator (eg
encode_datetime()
) have to return thePipe
object itself, so it's not clear where we should provide configuration such as the step name. That may not be very important, as ATM I don't see anything else than the step name to configure (there could be aparam_grid
but the other way of specifying it described later seems better), and the step name may not be that important.We could also say that there is a
.name()
method on thePipe
itself that implicitly applies to the last step.We cannot pass an estimator directly, but the result of
cols
has ause()
(orapply()
, or …) method:As the
.cols()
looks like we are indexing the data it may be a bit surprising if someone expects the result of the transformation on just those columns to be returned:A user could be surprised to see "C", "D", "E", and "F" in the output above.
Option 4
Having the estimator methods directly on the
Pipe
rather than onpipe.cols
Note that this one would require having methods on
Pipe
such ason_cols
that implicitly apply to the last step.Choosing hyperparameters
It is important to be able to tune hyperparameters, and thus to provide a parameter grid to scikit-learn's
GridSearchCV
,RandomizedSearchCV
or successive halving.Manually specifying a large list of dicts all at once is not very easy because:
Instead, we can have a
choose()
function that wraps the hyperparameter and pass it directly to the estimator.The
Choice
object returned bychoose
has a.name()
method, which we can use to give a more human-friendly name to that hyperparameter choice. That could be used when displaying cross-validation results. Otherwise we always have the usualstep_name__param_name
grid-search name.Example with
Pipe.use
We can see a summary of the hyperparameter grid:
(and of the steps)
And we can obtain a scikit-learn
GridSearchCV
orRandomizedSearchCV
that we can use to tune hyperparameters. This is not yet in the prototybe but we should also have methods (or a parameter) to get a successive halving object as well.hyperparameter choice with the alternative APIs
Selector.use
(option 2)Choices in nested estimators
Using
choose
for sub-estimators or their hyperparameters works as expected.Naming options
If we want to give a name to individual choices we can pass keyword arguments to
choose
. This can be useful to get more human-readable descriptions of pipelines and parameters. The example above can be adapted:(note if we want to use names that are not valid python identifiers we can always use the dict unpacking syntax
choose(**{'my name': 10})
).Choosing among several estimators
We may also want to choose among several estimators, ie have a choice for the whole step. We can pass a
Choice
touse
:pipe.use(choose(RidgeClassifier(), LogisticRegression()))
.optional
is a shorthand for choosing between a step and passthrough. We also havechoose_int
andchoose_float
to get int or floats within a range in a linear or log scale, possibly discretized.With the alternative APIs
Selector.use
(option 2)Keeping the original columns and renaming output columns
Sometimes we want to transform a column but still keep the original one in the output, maybe to transform it in a different way. We can do it with
keep_original
:We can also rename the output columns. For example this can be a way to insert a tag by which we can select them later.
Hyperparam tuning example