# Pandas String Dtype
This enhancement proposes a new extension type for text data. It does *not*
propose changing the in-memory data format (a NumPy ndarray of objects). Rather,
it attempts to define a stable user-API for text data, whose implementation may
later be optimized.
```
>>> s = pd.Series(['a', 'b', 'c'], dtype='string')
>>> s
0 a
1 b
2 c
dtype: string
>>> s.array
StringArray(['a', 'b', 'c'])
```
## Motiviation
Currently, pandas stores text data as an object-dtype NumPy array of Python strings.
```python
>>> s = pd.Series(['a', 'b', 'c'])
>>> s.values
array(['a', 'b', 'c'], dtype=object)
```
Thinking just about user-API, this is unfortunate for a few reasons:
1. It overloads the meaning of `object`-dtype in pandas. Most of the time, it
means "text" data, but as far as the dtype is concerned it means any Python
object.
2. It's confusing for newcomers. While we try to document this, it's
not exactly intuitive.
In addition to the API concerns, I'll note that a dedicated `StringArray`
extension type allows for using an alternative in-memory container that is
better-suited to storing variable width text data (e.g. Arrow). We can keep the
actual implementation private, like we do with IntegerArray, and expose methods
for converting to NumPy if users want that.
## Detailed Design
We'll add
1. StringDtype, a subclass of ExtensionDtype
2. StringArray, a subclass of ExtensionArray
`StringDtype.type` will be `str` (the scalar unit for StringDtype is a Python
str).
Because we're not changing the actual storage mechanism, the initial
`StringArray` implementation will be quite similar to `PandasArray` with the
restriction that only `object`-dtype is allowed for the underlying container.
```python
>>> arr = pd.array(['a', 'b', 'c'], dtype='string')
>>> arr
StringArray(['a', 'b', 'c'])
```
It's not clear to me yet what the scalar missing value should be for this type.
Some options include `None`, `np.nan`, `NaS` (a new "not a string") type. The
empty string is not an acceptable missing value marker.
We might consider stricter validation on methods like `StringArray.__setitem__`,
and `fillna` to prevent inserting non-string data into a StringArray
```python
In [16]: s = pd.Series(['a', 'b', 'c'])
In [17]: s[0] = 1
In [18]: type(s[0])
Out[18]: int
```
```python
In [19]: s = pd.Series(['a', 'b', None], dtype='object')
In [20]: type(s.fillna(1)[2])
Out[20]: int
```
We would expect `In [17]` and `In [20]` to raise. However, when the `value`
is an array that might require an elementwise scan of the values to check for
invalid elements. We'll need to benchmark the cost of this.
## API Changes:
Initially, this should be opt-in. In code,
```
>>> pd.Series(['a', 'b', 'c'])
```
will continue to return a Series backed by an object-dtype ndarray. IO methods
like `read_csv` will continue to return columns backed by object-dtype ndarrays.
By specifying `dtype='string'` users can opt into the new behavior. We might
add convenience methods to readers / writers like `text_as_strings=True/False`
to enable using the new string dtype for all string-dtype columns.
In a later release, we'll update places where we infer string dtype
(using `lib.infer_dtype == 'string'`) to warn, stating that users should specify
`dtype='string'` explicitly. We might consider adding an optional to pandas
config to control this, if we deem that it's preferable to update a setting to
updating every call site.