A proposal for improvements to Tree submodule
=============================================
:Author: Adam Li (adam2392@gmail.com)
:Status: Draft
:Type: Process
:Created: 2022-10-06
Abstract
--------
The goal of this proposal is to suggest improvements to the `tree` module that will i) make the tree code more maintainable and simplified and ii) enable 3rd party packages to more easily leverage the optimized tree Python/Cython codebase.
Detailed description
--------------------
### Motivation
Currently, the tree submodule is not only difficult to maintain (according to dev team), but also very challenging to improve and extend. This hinders progress not only for tree development within scikit-learn, but also for 3rd party packages that seek to leverage the API of scikit-learn and the the highly optimized performances of the underlying trees.
To improve the trees, we must start with a bottom-up approach by analyzing, improving and simplifying the underlying Cython code. In addition, we need to most likely do a top-down approach of simplifying and modularizing the Python tree API to fit the semantics that any possible decision tree might have. That is, the proposed API abstractions should make as little assumptions as possible, while not sacrificing the speed and optimizations that the tree code has achieved to date.
Moreover, these improvements would enable scikit-learn to be able to support highly cited tree methods, such as quantile trees, survival trees, honest trees and oblique trees, all of which have been cited over 300 times and are highly used in other fields of research and industry.
Some explicit examples that motivate the proposed changes are the enablement of:
- honest trees (Athey 2016; 1000+ citations across multiple papers)
- quantile trees (Meinshausen, 2006; 1465 citations)
- oblique trees (Breiman 2001; 97697 citations)
- unsupervised trees
- canonical correlation tree
- [name=jjerphan] Some scikit-survival models in [`sksurv.ensemble`](https://scikit-survival.readthedocs.io/en/stable/api/ensemble.html) relying on a [`LogrankCriterion`](https://github.com/sebp/scikit-survival/blob/488bf1a03524beb5728b6fb62373be23e350b396/sksurv/tree/_criterion.pyx#L92)
In general, each example tree here differs from the original decision tree in how it i) uses the leaf nodes, ii) computes the split feature value, iii) criterion. These trees don't necessarily need to go into scikit-learn of course.
Notes
- show Thomas's PR fits nicely with oblique trees PR [x]
- demonstrate quantile-trees can subclass the new API, or scikit-survival
- Vincent Maladière to help regarding the above
Implementation
--------------
#### Python API
Currently, the Python API for `BaseDecisionTree` has a few issues that seems to make extra assumptions that breaks the sklearn API contract, or makes extensions difficult.
1. `BaseDecisionTree` should not assume supervised learning:
BaseDecisionTree is simply a BaseEstimator, yet it makes the assumption within its API that it is a supervised tree. This is unnecessary and the code that interfaces with the “y-labels” can be extracted to its downstream classifier trees, where it should be. If one looks at other classes that inherit from BaseEstimator, they do not make this assumption (e.g. `NeighborsBase`, which enables KNN and NearestNeighbors a supervised and unsupervised method).
This most likely would introduce a non-compatible change due to changing the inner API of a `BaseDecisionTree` to assume `y=None` and then migrating the error checking code relevant for ylabels downstream. However, this is actually a sensible change that would make BaseDecisionTree actually make more sense semantically.
This would enable unsupervised decision trees in 3rd party extensions and the ability for sklearn to implement such on our end if an algorithm is deemed for inclusion [1,2,3].
[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.128.8686&rep=rep1&type=pdf
[2] https://www.cs.jhu.edu/~jason/papers/karakos+khudanpur+eisner+priebe.icassp05.pdf
[3] https://arxiv.org/pdf/1907.02844.pdf
2. `BaseDecisionTree` should modularize the setting of Cython `Tree`, Cython `Splitter` and Cython `Criterion`:
This would enable 3rd party extensions to easily leverage BaseDecisionTree as their platform for any kind of decision tree, and implement their own Cython API. See [Oblique Trees PR](https://github.com/scikit-learn/scikit-learn/pull/22754).
#### Cython API
1. Criterion should not assume supervised learning: I have outlined a proposal for this in #[24577](https://github.com/scikit-learn/scikit-learn/issues/24577). Please see there for more details.
2. Criterion init should handle data and meta-parameter initializations separately:
Right now, `criterion.init()` and `splitter.init()` both initialize X and y assuming supervised learning. Moreover, they pass in the view to the X and y arrays while passing in other metadata, such as sample weight, and index pointers.
Firstly y is not changing, but criterion.init(y) is being called repeatedly, whereas that function is meant to be called to reset mainly the start and end index pointers.
3. Splitter should not assume `y` variable:
Splitter should only require X array. y is simply passed in so that way it can be passed to Criterion. That is, it is only used in literally one place.
Instead, Splitter should follow the semantics of a transformer. In essence, it is fed a 2D array of (n_samples, n_features) and it returns another 2D array of (n_samples, m_try_features). Y labels are not involved.
Note this change will also require changing the calls via Tree/TreeBuilder.
4. Tree and TreeBuilder should separate adding split nodes vs leaf nodes:
Split nodes inherently are different from leaf nodes. Although in-code they are mixed up and it works, semantically they should be separate. This would enable honest trees and quantile trees out of the box because the main difference between those trees and the current Tree code is that the leaves are set using other information.
- in honest trees, the leaves are set using a out-of-bag dataset
- in quantile trees, the leaves need to store all y-labels
5. Splitter should separate computing the candidate splitting feature value: I.e. see [oblique trees PR](https://github.com/scikit-learn/scikit-learn/pull/22754) and item 1 in this #[24000](https://github.com/scikit-learn/scikit-learn/issues/24000)
6. Tree and TreeBuilder should separate how it sets node values and how it computes feature values:
See [oblique trees pr](https://github.com/scikit-learn/scikit-learn/pull/22754). This would also enable generalizations for how one stores node values.
7. Enable Tree/TreeBuilder to work with different `Splitter` and `SplitRecord` and `Nodes`
This is hard, but can be done with pointer casting.
### Proposed implementation sequence
1. Modularize `Criterion` without breaking backward compatibility.
2. ** Simplify `Criterion` and `Splitter` inits and working with X/y data.
3. ** Modularize `Splitter` and `Tree` and `TreeBuilder` together to allow arbitrary split values, which introduces incompatible changes in the Cython code
4. Python API for `BaseDecisionTree` without breaking backward compatibility.
5. ** Introduce separation of "leaf" and "split" nodes.
Backward compatibility
----------------------
In the previous section, the implementation sequence marked with "(**)" introduce backwards-incompatible changes.
Alternatives
------------
Although scikit-learn does not "support" a public Cython API, many packages either i) leverage the API directly by `cimport` from their `.pxd` files included in distro, or ii) copy/paste and modify. The copy/paste and modify approach is really just a short-term fix that eventually makes a downstream package unmaintainable and also diverge from the API of scikit-learn.
By modularizing code in Cython and introducing more simple semantics at the `Criterion`, `Splitter`, `Tree` and `TreeBuilder`, scikit-learn can easily introduce more functionality through Python API entrypoints, while 3rd party packages would benefit from the highly optimized Cython codebase without creating a fork.
Discussion
----------
- [Oblique tree PR](https://github.com/scikit-learn/scikit-learn/pull/22754)
- [Criterion issue](https://github.com/scikit-learn/scikit-learn/issues/24577)
- [Modularization of Tree/TreeBuilder/Splitter issue](https://github.com/scikit-learn/scikit-learn/issues/24000)
Design as of December 9th 2022
------------------------------
>[name=Julien Jerphanion] Still WIP
```plantuml
package sklearn.base {
abstract class BaseEstimator
interface MultiOutputMixin
interface ClassifierMixin
interface RegressorMixin
}
package sklearn.tree {
abstract class BaseDecisionTree {
criterion: None
splitter: None
max_depth: None
min_samples_split: None
min_samples_leaf: None
min_weight_fraction_leaf: None
max_features: None
max_leaf_nodes: None
random_state: None
min_impurity_decrease: None
class_weight: None
ccp_alpha: None
}
BaseDecisionTree <|-left- DecisionTreeClassifier
BaseDecisionTree <|-right- DecisionTreeRegressor
package cython {
abstract class Splitter {}
class Tree {}
class TreeBuilder {}
class Criterion {}
Splitter <|-- BestFirstDenseSplitter
Splitter <|-- DepthFirstDenseSplitter
Splitter *-- Criterion
}
BaseDecisionTree *-- Tree
}
note top of sklearn.tree: ExtraTrees not represented\nas their logic is identical.
ClassifierMixin -[hidden]left- BaseEstimator
BaseEstimator <|-- BaseDecisionTree
MultiOutputMixin <|-- BaseDecisionTree
ClassifierMixin <|-- DecisionTreeClassifier
RegressorMixin <|-- DecisionTreeRegressor
```
Copyright
---------
This document has been placed in the public domain [1]_.
References and Footnotes
------------------------
.. [1] Open Publication License: https://www.opencontent.org/openpub/