# String dtype roadblocks
These issues are ordered from least tractable to most tractable.
* ### No API to control iteration/indexing
* I think we'll need some way to override numpy's iteration machinery, a little fuzzy on how this currently works in numpy and if it's a real problem.
* Not an issue for ufuncs or casting because we control the inner loop fully and will have access to the needed offset information during the ufunc loop.
* Could be an issue for e.g. data selection. There's no API to control which elements are selected when someone does `arr[::3]`. For fixed-width types this isn't a problem but for variable-width types Numpy can't just jump a fixed number of bytes ahead to account for strides.
* ### The signature for the dtype's getitem and setitem methods need to be updated.
* Currently get a reference to the array's dtype, a reference to the object we need to do setitem for, and a pointer to the array data:
```C
static int asciidtype_setitem(
ASCIIDTypeObject *descr, PyObject *obj, char *dataptr)
```
* The pointer will in general be in the middle of the array, for variable-width dtypes there's no way to know the size of the array element dataptr is pointing to.
* We also don't have control over how dataptr is selected. Does the machinery in numpy assume fixed width dtypes?
* Assuming Numpy knows how to determine `dataptr`, then if setitem and getitem received the `index` for the item, that would be sufficient for our purposes.
* ### Mutating array elements might require allocating a new storage for the array data.
* If someone does a `__setitem__` call from Python using a string bigger than the item already stored there, we'll need to reallocate the entire array.
* How do we communicate that to Numpy?
* A single memory buffer to store the array data is kind of a bad fit for a mutable variable length string container, but the assumption that data in numpy arrays are stored in a single heap memory buffer is baked pretty deeply into Numpy.
* ### No mechanism for per-array data storage.
* The current plan is to store an `offsets` array we can use to index the locations in the array data buffer where array elements start. There currently isn't a facility for per-array data storage.
* It might be possible to store the `offsets` on the dtype itself, but we need to be very careful to make sure a new dtype instance is created every time a new view is created. I don't know if this will turn out to be a leaky abstraction.
* ### Identify unicode string library we can depend on
* Need compatibly licensed unicode string library.
* Possible options:
* GNU Libiconv (LGPL)
* Has conversions between the unicode flavors numpy and python uses with UTF-8.
* Only useful to convert encodings, does not give you e.g. len(), uppercase() or other possibly needed functionality.
* libicu (ICU license, BSD-like, need to ship license)
* Very full-featured, also heavyweight and can't be vendored so would need an additional C library dependency.
* Python's unicode API
* Requires the GIL in ufunc and casting loops.
* UTF8-CPP (boost license, BSD-like)
* Small, lightweight, support for encoding/decoding between UTF-16/32 and UTF-8.
* C++, not C
* UTF8proc (MIT)
* Small, lightweight, can just vendor it.
* ### Hooking into np.char / string ufuncs
* np.char defers to the scalar to implement functionality.
* I think we can get `np.char` working just by making the scalar we define for the string dtype subclass `str`.
* Will require some code changes in Numpy to support