Pandas String Dtype

# Pandas String Dtype This enhancement proposes a new extension type for text data. It does *not* propose changing the in-memory data format (a NumPy ndarray of objects). Rather, it attempts to define a stable user-API for text data, whose implementation may later be optimized. ``` >>> s = pd.Series(['a', 'b', 'c'], dtype='string') >>> s 0 a 1 b 2 c dtype: string >>> s.array StringArray(['a', 'b', 'c']) ``` ## Motiviation Currently, pandas stores text data as an object-dtype NumPy array of Python strings. ```python >>> s = pd.Series(['a', 'b', 'c']) >>> s.values array(['a', 'b', 'c'], dtype=object) ``` Thinking just about user-API, this is unfortunate for a few reasons: 1. It overloads the meaning of `object`-dtype in pandas. Most of the time, it means "text" data, but as far as the dtype is concerned it means any Python object. 2. It's confusing for newcomers. While we try to document this, it's not exactly intuitive. In addition to the API concerns, I'll note that a dedicated `StringArray` extension type allows for using an alternative in-memory container that is better-suited to storing variable width text data (e.g. Arrow). We can keep the actual implementation private, like we do with IntegerArray, and expose methods for converting to NumPy if users want that. ## Detailed Design We'll add 1. StringDtype, a subclass of ExtensionDtype 2. StringArray, a subclass of ExtensionArray `StringDtype.type` will be `str` (the scalar unit for StringDtype is a Python str). Because we're not changing the actual storage mechanism, the initial `StringArray` implementation will be quite similar to `PandasArray` with the restriction that only `object`-dtype is allowed for the underlying container. ```python >>> arr = pd.array(['a', 'b', 'c'], dtype='string') >>> arr StringArray(['a', 'b', 'c']) ``` It's not clear to me yet what the scalar missing value should be for this type. Some options include `None`, `np.nan`, `NaS` (a new "not a string") type. The empty string is not an acceptable missing value marker. We might consider stricter validation on methods like `StringArray.__setitem__`, and `fillna` to prevent inserting non-string data into a StringArray ```python In [16]: s = pd.Series(['a', 'b', 'c']) In [17]: s[0] = 1 In [18]: type(s[0]) Out[18]: int ``` ```python In [19]: s = pd.Series(['a', 'b', None], dtype='object') In [20]: type(s.fillna(1)[2]) Out[20]: int ``` We would expect `In [17]` and `In [20]` to raise. However, when the `value` is an array that might require an elementwise scan of the values to check for invalid elements. We'll need to benchmark the cost of this. ## API Changes: Initially, this should be opt-in. In code, ``` >>> pd.Series(['a', 'b', 'c']) ``` will continue to return a Series backed by an object-dtype ndarray. IO methods like `read_csv` will continue to return columns backed by object-dtype ndarrays. By specifying `dtype='string'` users can opt into the new behavior. We might add convenience methods to readers / writers like `text_as_strings=True/False` to enable using the new string dtype for all string-dtype columns. In a later release, we'll update places where we infer string dtype (using `lib.infer_dtype == 'string'`) to warn, stating that users should specify `dtype='string'` explicitly. We might consider adding an optional to pandas config to control this, if we deem that it's preferable to update a setting to updating every call site.