Expressions and column ownership

# Expressions and column ownership Why was column ownership introduced in the first place? Why problems does it pose? I'll explain why Polars has expressions, what problems they solve, and how not explicitly allowing them in the consoritum led to the strange "parent dataframe" rules. ## Background: why does Polars even have expressions? pandas / cudf / modin / etc are unable to perform non-elementary groupby operations efficiently - their API just doesn't allow them to be expressed without `lambda` function UDFs. Let's take Q1 from the TPC-H benchmarks (simplified, so we can focus on some things) as an example: ```python select sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, from lineitem group by l_returnflag, l_linestatus ``` How would we translate this to pandas syntax? ### pandas syntax If we express this in pandas syntax, without manual optimisation, we have to write something like ```python def func(sub_df): return pd.Series( [ (sub_df['l_extendedprice'] * (1-sub_df['l_discount'])).sum(), ((sub_df['l_extendedprice'] * (1-sub_df['l_discount']))*(1+sub_df['l_tax'])).sum(), # other aggs go here ], # output column names go in the index (!) index=['sum_disc_price', 'sum_charge'] ) result = df.groupby(['l_returnflag','l_linestatus']).apply(func, include_groups=False) result ``` This is "terrible" for a couple of reasons: - `apply(lambda df: ...)` is inefficient, and the pandas groupby API just has no way of expressing anything non-trivial; - `(df["l_extendedprice"] * (1 - df["l_discount"]))` had to be computed twice! Note: in this case, you might be able to spot a simple manual optimisation which allows you to avoid the lambda. That's not always possible: see TPC-H Q8 for a much harder example! ### Polars syntax The Polars API, on the other hand, lets us write: ```python lineitem.group_by("l_returnflag", "l_linestatus").agg( sum_disc_price=(pl.col("l_extendedprice") * (1 - pl.col("l_discount"))).sum(), sum_charge=( pl.col("l_extendedprice") * (1 - pl.col("l_discount")) * (1 + pl.col("l_tax")) ).sum(), ) ``` This solves both of the issues above: - we have a way of expressing arbitrarily complex aggregations within a groupby context without using lambda functions, so the aggregation can be evaluated efficiently - because expressions are just functions (and _not_ lazy columns!), they can be inspected in isolation outside of queries. In particular, in this case, a good query optimiser should be able to figure out that `pl.col("l_extendedprice") * (1 - pl.col("l_discount"))` is repeated, and only evaluate it once. Indeed, that's exactly what Polars and DuckDB do! Note that the above is not a "lazy vs eager" thing. `lineitem` could well be an eager dataframe, and the same principles would apply. ## Making expressions fit into the dataframe API Writing Polars without expressions would mean giving up on its innovations. Let's not consider it. What happens if we try to shoehorn expressions into the DataFrame API? Consider the following DataFrame API syntax: ```python df: DataFrame df = df.filter(df.get_column('a') > 3) ``` What should `df.get_column('a')` return in Polars' case? - it could return a `Series`, but then that's a missed opportunity for Polars - it could return the expression `pl.col('a')` The latter may seem like a good idea - however, consider this: ```python df_0: DataFrame df_1: DataFrame df = df_0.filter(df_1.get_column('a') > 3) ``` If `df_1.get_column('a')` were backed by the expression `pl.col('a')`, then it would evaluate to ```python df_0.filter(pl.col('a') > 3) ``` But that's not correct! Expressions are just functions, and are only evaluated within a dataframe context. For example, `pl.col('a') > 3` can be thought of as: ```python lambda df: (lambda df: df['a'])(df) > 3 ``` So in this case, it would end up keeping rows from `df_0` where `df_0['a']` is greater than 3! Naively backing columns with expressions leads to wrong results! ## The "parent dataframe" compromise In order to be able to make `DataFrame.get_column` return an expression, we need to make sure that expression is only used in a dataframe context where the dataframe is the one it was created from. For example: - ✅`df_0.filter(df_0.get_column('a') > 3)` is allowed - ❌`df_0.filter(df_1.get_column('a') > 3)` is not This is hardly desirable though. People are very likely to get mysterious errors such as "cannot compare column with dataframe it was not derived from", which wouldn't be a good user experience. It gets even worse when we consider scalars though. What is `df_0.get_column('a').mean()` supposed to return? If `df_0.get_column('a')` is an expression (i.e. a function), then so is `df_0.get_column('a').mean()`. And that means that even scalars need a concept of a "parent dataframe"! Given that users would expect to be able to use scalars as free-standing objects, this gets incredibly confusing and frustrating to develop against. ## Conclusion Writing efficient Polars means using expressions. The Consortium doesn't want to distinguish columns and expressions. Having to squash the two into a single class has lead to very unfriendly rules surrounding "parent dataframes".