Nullable types by default

# Nullable types by default Nullable types are pretty neat, as they allow for missing values via `pd.NA` without having to cast to `float64`. Currently, they need to be opted into. What steps would need to taking to be able to enable them by default? ## Issue#1: converting to numpy ### Problem statement Converting pandas DataFrames to NumPy is a common operation, and people do it for a variety of reasons. Without nullable types, the conversion is to the same numpy dtype (for Series) and for a dtype which can hold all columns' dtypes (for DataFrame): - DataFrame (`int64` and `float64` become `float64`): ``` In [6]: df = pd.DataFrame({'a': [1,2,3], 'b': [4., 5., 6.]}).astype({'a': 'int64', 'b': 'float64'}) In [7]: df.to_numpy() Out[7]: array([[1., 4.], [2., 5.], [3., 6.]]) In [8]: df.to_numpy().dtype Out[8]: dtype('float64') ``` - Series(`int64` becomes `int64`): ``` In [9]: df['a'].to_numpy() Out[9]: array([1, 2, 3]) In [10]: df['a'].to_numpy().dtype Out[10]: dtype('int64') ``` None of this should come as a surprise to users. For nullable types, the situation is different. Currently, both `Series` and `DataFrame` are cast to `Object`, regardless of whether they contain missing values: ``` In [11]: df.convert_dtypes().to_numpy() Out[11]: array([[1, 4], [2, 5], [3, 6]], dtype=object) In [12]: df.convert_dtypes()['a'].to_numpy() Out[12]: array([1, 2, 3], dtype=object) ``` This is an issue because `numpy` discourages `object` dtype, and it's not what users would probably expect. The main barrier to "just" converting `Int64` to `int64` is that the former can hold missing values, whilst the latter can't. ### Potential solution #1: value-dependent conversion The most obvious solution would be value-dependent conversion: - if a `Int64` Series contains missing values, convert to `object` - else, convert to `int64` However, there is some desire among pandas devs to avoid such value-dependent behaviour, see [this comment](https://github.com/pandas-dev/pandas/issues/30038#issuecomment-613427298). Users may also be surprised. ### Potential solution #2: convert to the corresponding NumPy dtype, raise if can't If an `Int64` Series doesn't contain missing values, then just convert: ``` In [4]: pd.Series([1,2,3], dtype='Int64').to_numpy() Out[4]: array([1, 2, 3]) ``` If it does have missing values, then raise an informative error message: ``` In [8]: pd.Series([1,2,pd.NA], dtype='Int64').to_numpy() --------------------------------------------------------------------------- ValueError: cannot convert to '<class 'int'>'-dtype NumPy array with missing values. Please either: - pass `dtype='float64'` - pass `dtype='object'` - specify an appropriate 'na_value' for this dtype ``` One major difficulty is that pandas sometimes converts to numpy internally, and so this error message might up to users in places where they can't do anything about it. This would need addressing if we were to go with this solution. ### Challenges One big challenge in the above is that `np.asarray` is used a lot internally in pandas. For example, in ```python df = pd.Series([1, 2, 3], dtype='int64').to_frame() df.where(df>1, pd.Series([1, 2, pd.NA], dtype='Int64'), axis=0) ``` , we hit `np.broadcast_to(other)` in https://github.com/pandas-dev/pandas/blob/5344107b8c53d1f176717a257e4df9644364a7d5/pandas/core/generic.py#L9740 , which triggers `np.asarray` under the hood. Here, though, `pd.NA` can't get cast to `int64`, but there's no option to pass the dtype to `broadcast_to` so it gets handed down to `np.asarray`. It's not easy to get around that, as pandas arrays don't necessarily have 2D support https://github.com/pandas-dev/pandas/pull/49055#discussion_r994794095. Note that filling a Series with an ExtensionArray upcasts to ExtensionArray, whereas doing the same with a DataFrame just results in `object` dtype. - for series, how / where does the upcasting happen? In `pandas/core/internals/blocks.py`: block = self.coerce_to_target_dtype(other) - why can't it happen in df case? In df case, we don't even get there. we get stuck on ``` kwargs[k] = obj.iloc[:, b.mgr_locs.indexer]._values ``` In the series case, this works out fine, because `._values` returned an `IntegerArray`. But in the 2D case, it fails, because `._values` tries to return a numpy array, which might not be able to contain missing values. Could just...return object, and coerce later? Maybe - could coerce within `block.py`, in `block = self.coerce_to_target_dtype(other)`. Issue is that that's where it currently happens, and `other` is of object type, and so the result just ends up being `int64`. Could `._values` return a 2d pandas array? Another issue is that although it's possible to reshape a pandas array to be 2D, many operations don't allow it. E.g.: ``` In [7]: pd.Series([1,2,3])._mgr.blocks[0].astype('Int64') Out[7]: ExtensionBlock: 3 dtype: Int64 In [11]: pd.DataFrame({'a': [1,2,3]})._mgr.blocks[0].astype('Int64') --------------------------------------------------------------------------- File ~/pandas-dev/pandas/core/arrays/numeric.py:189, in _coerce_to_data_and_mask(values, mask, dtype, copy, dtype_cls, default_dtype) 186 raise TypeError(f"{values.dtype} cannot be converted to {name}") 188 if values.ndim != 1: --> 189 raise TypeError("values must be a 1D list-like") 191 if mask is None: 192 if is_integer_dtype(values): 193 # fastpath TypeError: values must be a 1D list-like ``` For `DataFrame.to_numpy()`, in `pandas/core/internals/managers.py`, they distinguish between single-block / multiple blocks, filling in missing values. In the case of multiple blocks, `interleave_dtypes` is called, which finds a common dtype. Currently, it's hardcoded to return `object` for nullable types. This could well be changed to return a numpy dtype, provided the same were done for the single-block case. ## Impact on downstream libraries ### Seaborn / matplotlib Missing values wouldn't be handled right now by Seaborn: ```ipython In [2]: sns.barplot(x=Series([1,2,3]), y=Series([1,2,np.nan])) Out[2]: <AxesSubplot: > In [3]: sns.barplot(x=Series([1,2,3]), y=Series([1,2,np.nan]).convert_dtypes()) --------------------------------------------------------------------------- TypeError: float() argument must be a string or a number, not 'NAType' ``` Nor by `matplotlib`, see https://github.com/matplotlib/matplotlib/issues/23991#issuecomment-1256438568 ### Scikit-learn Scikit-learn already converts to `float64` internally, and so it works fine already: ```python In [11]: feature = pd.Series([1, 4, 5]).convert_dtypes() In [12]: target = pd.Series([1, 4, 5]).convert_dtypes() In [13]: from sklearn.linear_model import LinearRegression In [14]: model = LinearRegression().fit(feature.to_numpy().reshape(-1, 1), target.to_numpy()) In [15]: model.predict(feature.to_numpy().reshape(-1, 1)) Out[15]: array([1., 4., 5.]) ``` ## Resources - [API: distinguish NA vs NaN in floating dtypes](https://github.com/pandas-dev/pandas/issues/32265) - [float(pd.NA)](https://github.com/pandas-dev/pandas/issues/48864) -