# Numpy 2.0 lightning talk: Upstreaming `StringDType`
Myself and Peyton Murray (both at Quansight Labs) are working on a new variable width string dtype built on the experimental NEP 42 dtype API as part of a NASA ROSES grant for improving interoperability in the SciPy ecosystem.
https://github.com/numpy/numpy-user-dtypes/tree/main/stringdtype
The primary motivation is to improve string support in pandas, which currently uses either object arrays or PyArrow string arrays, both of which have down sides.
It would be much better if Numpy had support for variable-width string arrays.
See also https://numpy.org/neps/roadmap.html#extensibility where a variable-width unicode string dtype is called out on the project roadmap.
## Design
The array buffer holds a C struct like this:
```clike
typedef struct static_string {
size_t len;
char *buf;
} static_string;
```
String data are stored UTF-8 encoded in the `buf` field.
Currently the actual string data are manually managed with `malloc`/`free` on the heap, although we may explore different allocation strategies (e.g. a memory pool).
Some early pandas benchmark wins:
```
[41.67%] ··· strings.Construction.time_frame_construction ok
[41.67%] ··· =============== ==========
dtype
--------------- ----------
string object 8.86±0ms
StringDType() 486±0μs
=============== ==========
[50.00%] ··· strings.Construction.time_series_construction ok
[50.00%] ··· =============== ==========
dtype
--------------- ----------
string object 6.79±0ms
StringDType() 541±0μs
=============== ==========
```
## Upstreaming to Numpy?
I think it might make sense to ultimately make `StringDType` the default dtype string data, the timeline for shipping it in Numpy 2.0 might be too tight though.
Even without that, I think the dtype ultimately belongs in Numpy itself and having it available will be broadly useful for the community.
## What are the blockers?
* ### NEP 42 DTypes are not production-ready yet.
* Still a number of places in Numpy where using the new dtypes will lead to a seg fault or unhandled exception.
* May need additional DType API surface for more functionality.
* Buffer protocol support for new dtypes needed for Cython support.
* ### Missing values
* Have an opportunity to figure out how to handle missing data more generically in Numpy.
* Should Numpy grow e.g. `np.NA` to represent a missing value in an array, which pandas could then use?
* ### String ufuncs
* Not stictly needed since functions in `np.char` work, but those are much slower than ufuncs would be.
* Probably a lot of work to cover the full python string API, in addition Numpy may need to depend on e.g. ICU or another unicode library.
* Where should the ufuncs go?