Dataframe API standard design - why/how did we take a wrong turn with lazy/eager and column ownership?

# Dataframe API standard design - why/how did we take a wrong turn with lazy/eager and column ownership? _This summary is Ralf's perspective on what happened and how we ended up with a fairly problematic API design_ tl;dr we started with something that looked a lot like a clean subset of Pandas/cuDF/Modin (all eager) with no row index. We then tried extending to lazy libraries, and ran into a host of problems and took a few wrong turns trying to shoehorn in support for lazy execution. ## What happened with API evolution over the past ~6 months? 0. We tried to incorporate support for dataframe libraries with lazy execution - Polars LazyFrame most prominently, but issues with it in some cases would also affect Ibis (in particular when used with an SQL backend), Dask, or other libraries. 1. We dropped support for element-wise operations (in [gh-242](https://github.com/data-apis/dataframe-api/pull/242)) 2. We attempted to add support for expressions - this was rejected. 3. We added a `.persist()` method as an execution hint for lazy libraries (see [gh-307](https://github.com/data-apis/dataframe-api/pull/307)), to help bridge some of the issues with lazy execution and avoiding otherwise-possibly-repeated calculations. 4. We then instead added a "column ownership" concept where column instances are owned by the dataframe that they came from. Column-column and column-dataframe operations are only allowed if there is no more than one parent dataframe (see [gh-310](https://github.com/data-apis/dataframe-api/pull/310)). 5. We then added several rules around column ownership, and discovered that the end result was quite cumbersome to use. One conclusion from the above was that whenever there are issues with using columns/dataframes that don't yet have a relationship between them - which is likely problematic for libraries which support lazy execution - this can be resolved by doing a `join` explicitly first. ## Context: kinds of dataframe API usage There are two quite different kinds of dataframe library usages: 1. The main one: SQL-like, lots of high-level operations like merge/join/select/pivot/apply, 2. A historically common (but waning in populariy) one: "2-D arrays with axis labels and masks" Element-wise operations like `==` and `+` are for (2) rather than (1). This is the kind of usage scikit-learn has - essentially they're treating dataframes like arrays with labels, and internally they still use arrays and then re-attach labels before returning a dataframe. That use case doesn't seem to be core to today's dataframe libraries, and scikit-learn & co will not actually need this since it's still the norm (and probably will always be more efficient) to unpack the data and then use an array library for any heavy lifting. Therefore we concluded that it would be okay to drop (2) as a mode of operation - and hence, dropping `==` and other element-wise operations. ## Expectations from the API standard Some misalignment in expectations: 1. Value in a uniform API, even when implementing that API in a given library would not yield performance that's optimal. 2. Only writing an API that is a zero-cost abstraction over the APIs of various libraries. cuDF has done (1) for a long time, being constrained by its goal of matching the Pandas API. Polars wants to only consider (2). ## Considering separate APIs for eager vs. lazy implementations? This was discussed several times. Two problems are: 1. This breaks one of the key goals of the design, which is having an API design that is independent of implementation choices & mode of execution. 2. Lazy libraries are all different in terms of optimizations they implement, and explicit triggering of computation puts the burden on the user to figure out where to do so - and what's optimal for one library may be very sub-optimal for another library. Related: implicit triggering of computation is currently explicitly forbidden by several libraries as a design principle. But to keep an API execution-independent, this should be considered. In particular for methods which can _in principle_ be implemented in a lazy fashion just fine, but _currently_ are not available lazily. Example of that: calling `.shape` (could be done as a lazy object that duck-types with a `tuple`). This contrast with a small handful of methods where this is not possible even in principle. Canonical example of that is `bool()`, which _must_ return a Python scalar (this is prescribed by the Python language). Relevant discussions: - [gh-195](https://github.com/data-apis/dataframe-api/issues/195), - [gh-224](https://github.com/data-apis/dataframe-api/issues/224). - [Discussion on 31 Aug 2023](https://hackmd.io/YAuPZ3aFTByq6jzTXVTTNg?view#Lazy-vs-eager-APIs) which focused on most lazy/eager topics including implicitly triggering execution A quote from the first issue: > A lot of discussion and thought has been put towards lazy / asynchronous execution and making sure the API that is being built out doesn't force unnecessary or inefficient synchronization / materialization. If there's places that are going against that spirit we should generally treat them as mistakes and work towards correcting them. ## Potential problems for lazy implementations [This comment in gh-229](https://github.com/data-apis/dataframe-api/issues/229#issuecomment-1694743613) contains a list of dataframe and API methods/attributes that may be problematic for lazy libraries. ... ### Number of columns (or the schema) must be known statically Mostly this can be determined even before execution by lazy libraries. There are exceptions though: `pivot` is a prominent one. Number of rows on the other hand may be unknown - and hence methods like `unique` are fine. Note: needing to know the schema statically isn't something that is fundamentally required for implementing a lazy library, but in practice so deeply ingrained as an assumption in the implementation of dataframe libraries that it's in practice a hard requirement. ### Row ordering cannot be assumed Eager libraries typically have a fixed row order, lazy libraries (and SQL engines) usually do not. No APIs that assume or depend on row order should be used. ### Other API removals Several other methods were removed after they were identified to be problematic for lazy implementations, e.g. `.get_value`, `.get_rows_by_mask`. ### Fully lazy implementations For a few of the operations that must be eager, like `__bool__()`, a library that is fully lazy (e.g., exports to an ONNX graph) has no way of supporting this, and hence it _must_ be allowed to raise in such cases. [gh-224#comment](https://github.com/data-apis/dataframe-api/issues/224#issuecomment-1678596739) has relevant context. ## Perspective Ralf: comparison with how array libraries handle eager/lazy After a lot of experience with TensorFlow/PyTorch/JAX/MXNet/NumPy+Numba, it looks like array libraries have converged on these lessons: 1. Defaulting to eager execution is the better developer experience 2. Lazy (or graph-based) execution & JIT compilation can be done for a strict subset of the eager API. There are no separate objects or functions/methods; if some eager API isn't supported in lazy mode it raises. - This can be taken further by transparently falling back to eager execution as TorchDynamo does. There are pros and cons to doing so - too early to tell right now. - For the few things that cannot be kept lazy, specific syntax can be introduced (e.g., `bool()` builtin -> `cond()` function) Making users manually trigger execution ultimately doesn't scale or result in a design that generalizes well. What array libraries do today generalizes much better. ## What's the deal with Expressions? I think the story is simple here: 1. Expressions are generally considered a nicer API (much easier to keep lazy, more powerful) than Columns, 2. _However_, many libraries currently do not have any such concept, and for users of the likes of Pandas, cuDF, Modin and Dask it may be a much bigger lift to introduce expressions - and hence the two attempts to standardize expressions (and remove columns) were ultimately rejected.