Floating dtype with missing value

# Floating dtype with missing value ## Introduction Assumption: regardless of the actual storage mechanism (e.g. it could still be np.nan sentinel values), we are going to use pd.NA as missing value indicator for a new floating dtype. This means 1) missing values behave as "missing" in operations following the semantics we defined earlier for the new nullable dtypes (eg propagate for comparisons instead of evaluating to False) and 2) the user gets `pd.NA` when accessing a single missing value. So instead of the current behaviour like this: ```python >>> s = pd.Series([1.5, None]) >>> s 0 1.5 1 NaN dtype: float64 >>> s > 1 0 True 1 False dtype: bool ``` we will eventually want behaviour like this (following the new nullable dtype semantics): ```python >>> s = pd.Series([1.5, None]) >>> s 0 1.5 1 <NA> dtype: Float64 >>> s > 1 0 True 1 <NA> dtype: boolean ``` The above shows how we use `NA` instead of `NaN` as standard missing value. However, that still leaves the question if `NaN` values can also be present in addition to `NA`s. So basically, we have those questions: **1. What user-facing behaviour do we want regarding NaNs?** 1a. Only NA 1b. Both NaN and NA **2. How to store the data?** 2a. Using masked-array approach 2b. Using masked-array approach but ensured that masked values are np.nan 2c. Using single numpy array, use np.nan as sentinel for pd.NA ## User facing behaviour: NaN and NA vs only NA Since `NaN` is a specific float value, and we now use the general `pd.NA` as missing value indicator, it is possible to still have actual `NaN` values in your data as well, in addition to `pd.NA` missing values. In principle, both could serve a different purpose: NA meaning "data missing" versus NaN meaning "not a number" / "bad computational result". On the other hand, having both NA and NaN can also be confusing. So, alternatively, upon construction NaNs present in the data can be indicated in the mask, and then this presence can be further ignored. Similarly, we could check for NaNs in certain operations that can introduce them (division, log, ..) and convert them to NA. This basically means that we disallow NaNs being present in the floating ExtensionArray. Some non-conclusive notes about this: - There are proponents of both approaches: some are in favor of being able to make the distinction (eg treating NaN as NA could hide bugs, see also [this](https://discourse.julialang.org/t/missing-or-nan/12729/2) Julia discourse topic), others argue that there is no clear distinction so we shouldn't make one in the code (e.g. the typical case of 0/0 can also occur because no data is available for a certain group, for example after summing or counting, and thus why should that be NaN instead of NA?) - There are several levels of "NaN support": - "Full"/"native" support (fully distinguishing from NA, following how the compute engine (eg numpy) handles them): NaNs can be introduced by operations (eg 0/0), are not regarded as missing (eg `isna`) and are not skipped in reductions. - Allow presence, but treat as NA: NaNs can be introduced, but are regarded as missing - Disallow presence: this means that we add checks to operations that can produce NaNs (eg division) to ensure that either any produced NaN gets converted to NA or either raises an error. - Comparing to other data science frameworks and languages doesn't make it easier, given the large variability in NaN handling accross systems. In the last section of this document, a brief overview of systems' NaN handling is given. ## How to store the data: mask vs sentinel For how to store the data, we basically have two options: - Use a mask approach for the missing values: a separate boolean array which keeps track of valid vs missing values in main float array with values. - Use NaN as sentinel: we store a single float array where NaNs signal missing values Practically speaking, the second approach only allows the case of not distinguishing NA/NaNs (and thus NaN would be a sentinel for NA). Although, in theory a *different* NaN (quiet, signaling NaN) could be used as NA-sentinel. But I didn't investigate whether this would be easily supported with Python / numpy. Advantages of the sentinel approach: - It's basically the same as what pandas does now for float dtype, and thus existing implementations (eg of nanmean, nansum, etc) can be resused. - It uses less memory - It gives cheap conversion from/to numpy (assuming you are fine with NaN as missing value indicator, eg like scikit-learn now does) Advantages of the mask approach: - It follows the other new nullable dtypes (integer, boolean), and thus ensures consistency within pandas (the same mechanism to store NAs across dtypes) and maximal code sharing for nullable dtypes. - It follows Apache Arrow (with the difference of using a boolean mask vs bit mask, but that's a potential enhancement in pandas in general to use bitmasks). This potentially enables zero copy exchange with systems using Apache Arrow (file formats such as Parquet, memory mapping Feather files, pyspark, ...) A disadvantage of the mask approach is direct compatibility with numpy. Roundtrip from / to numpy is of course an important use case. But I think we could have mechanisms to keep this efficient. One idea might be that, when a FloatArray is created from a numpy array, it can track this information (in a similar way as a "cached" property that gets reset when the data is modified in pandas). When converting back to numpy, this would allow converting to numpy with zero-copy for situations where numpy data are wrapped in a DataFrame as is (eg in scikit-learn pipelines). ## Overview of other data science frameworks / languages The general conclusion is that there is not a clear line in how NaNs are handled. Many details differ between the different systems. One could maybe distinguish the following groups regarding comparison and sorting: - Numpy, Julia, database like ClickHouse: follow the standard for element-wise comparison, but sort NaNs last - Traditional databases (PostgreSQL), Spark: NaNs compare equal and are larger than any other value (from which then also follwos that NaNs sort last) But those still differ regarding the introduction of NaNs in operations (eg 0/0) or not, or how NaNs are handled in certain null-related functions. Additionally, more unique behaviour exists for R (have both NA and NaN, but basically treat NaN as NA (eg when it comes to skipping NAs)) or DuckDB (disallow NaNs, only have NULL). A notebook with some examples illustrating the summaries below is available here: https://gist.github.com/jorisvandenbossche/ddc824d4f5fc4e4efb575c3bd59ea201 **NumPy** Numpy has no support for NA / NULL values, but just to re-iterate its behaviour regarding NaNs. Numpy follows the IEEE 754 standard in those aspects - Comparison operations: NaNs evaluate unequal (`NaN != NaN`) and compare False to any other value - Certain arithmetic operations introduces NaNs (eg 0/0) But numpy deviates when it comes to sorting: - When sorting, NaNs are sorted last to have deterministic results (https://numpy.org/doc/stable/reference/generated/numpy.sort.html) All other languages or frameworks mentioned below support a concept of NULL / NA. **R** R supports both NaN as NA in their numeric data type (https://stat.ethz.ch/R-manual/R-devel/library/base/html/is.finite.html). R partly follows the IEEE 754 standard (eg 0/0 gives NaN), but also deviates in several ways: - Comparison with NaN results in NA (instead of False) - Sorting: first values, then NaNs, then NA - Reductions with `na.rm = TRUE` skip both NA and NaN - The `is.na` function returns TRUE for both NA and NaN (but `is.nan` only checks for NaNs) **Julia** Julia has both as separate concepts for float arrays using the `Missing` type (https://docs.julialang.org/en/v1/manual/missing/). The `ismissing` function can check for nulls. Similarly to NumPy, Julia mostly follows IEEE 754 (when it comes to element-wise comparisons (always false, except for !=) and arithmetic operations like 0/0 producing NaN), but for sorting, it deviates from those rules to provide deterministic results (NaNs always last, but before missing). **Arrow / Dremio** Arrow itself also distinguishes both concepts, since it uses bitmasks consistently for missing values. This is just storage, though, and there not yet many actual computional operations provided in the `pyarrow` package itself. One part of Apache Arrow that already provides computational functionality is [Gandiva](https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/) (LLVM-based execution kernel for SQL-like expressions), which is eg used in Dremio's data services. One operation I could check is division by 0, which raises an error instead of producing a NaN. **PostgreSQL** Note from https://www.postgresql.org/docs/current/datatype-numeric.html: > In most implementations of the “not-a-number” concept, NaN is not considered equal to any other numeric value (including NaN). In order to allow numeric values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values. Other notes: - Division by 0 raises an error (and thus not introduces NaNs) - NaNs propagate in reductions (while NULLs are skipped) **Spark** Since Spark 1.5, it has handling of NaNs (in addition to nulls). It's rather specific handling (https://spark.apache.org/docs/latest/sql-reference.html#nan-semantics, https://issues.apache.org/jira/browse/SPARK-9076): - NaNs are regarded equal (`NaN == NaN`) and are larger than all other values - This ensures internal consistency with sorting and grouping (where are NaNs are grouped together) - Further, NaNs are also filled/dropped when filling / dropping nulls. The first two points are thus consistent with PostgreSQL. **Snowflake** From https://docs.snowflake.com/en/sql-reference/data-types-numeric.html, comparison semantics for 'NaN' differ from the IEEE 754 standard in a similar way to PostgreSQL (NaN evaluate equal, and compare larger than any other value). **ClickHouse** ClickHouse supports NaN and Inf values (https://clickhouse.tech/docs/en/sql-reference/data-types/float/): - In contrast to standard SQL, ClickHouse generates NaN in eg 0/0 - When sorting, it has the "nulls last" default (first values, then NaNs, then Nulls) (https://clickhouse.tech/docs/en/sql-reference/statements/select/#select-order-by) So this way, it seems similar to NumPy and Julia. **DuckDB** DuckDB recently changed to disallow NaN/Inf to be present (https://github.com/cwida/duckdb/issues/530, https://github.com/cwida/duckdb/pull/541). Operations that originally returned NaN now either return NULL or raise an error.