The NEP was merged in draft form, see below.
https://numpy.org/neps/nep-0055-string_dtype.html
On Mon, Aug 21, 2023 at 2:36 PM Nathan wrote:
> Hello all,
>
> I just opened a pull request to add NEP 55, see
> https://github.com/numpy/numpy/pull/24483.
>
> Per NEP 0, I've copied everything up to the "detailed description" section
> below.
>
> I'm looking forward to your feedback on this.
>
> -Nathan Goldbaum
>
> =
> NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy
> =
>
> :Author: Nathan Goldbaum
> :Status: Draft
> :Type: Standards Track
> :Created: 2023-06-29
>
>
> Abstract
>
>
> We propose adding a new string data type to NumPy where each item in the
> array
> is an arbitrary length UTF-8 encoded string. This will enable performance,
> memory usage, and usability improvements for NumPy users, including:
>
> * Memory savings for workflows that currently use fixed-width strings and
> store
> primarily ASCII data or a mix of short and long strings in a single NumPy
> array.
>
> * Downstream libraries and users will be able to move away from object
> arrays
> currently used as a substitute for variable-length string arrays, unlocking
> performance improvements by avoiding passes over the data outside of NumPy.
>
> * A more intuitive user-facing API for working with arrays of Python
> strings,
> without a need to think about the in-memory array representation.
>
> Motivation and Scope
>
>
> First, we will describe how the current state of support for string or
> string-like data in NumPy arose. Next, we will summarize the last major
> previous
> discussion about this topic. Finally, we will describe the scope of the
> proposed
> changes to NumPy as well as changes that are explicitly out of scope of
> this
> proposal.
>
> History of String Support in Numpy
> **
>
> Support in NumPy for textual data evolved organically in response to early
> user
> needs and then changes in the Python ecosystem.
>
> Support for strings was added to numpy to support users of the NumArray
> ``chararray`` type. Remnants of this are still visible in the NumPy API:
> string-related functionality lives in ``np.char``, to support the obsolete
> ``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string
> DTypes.
>
> NumPy's ``bytes_`` DType was originally used to represent the Python 2 ``
> str``
> type before Python 3 support was added to NumPy. The bytes DType makes the
> most
> sense when it is used to represent Python 2 strings or other
> null-terminated
> byte sequences. However, ignoring data after the first null character
> means the
> ``bytes_`` DType is only suitable for bytestreams that do not contain
> nulls, so
> it is a poor match for generic bytestreams.
>
> The ``unicode`` DType was added to support the Python 2 ``unicode`` type.
> It
> stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which
> makes for
> a straightforward implementation, but is inefficient for storing text that
> can
> be represented well using a one-byte ASCII or Latin-1 encoding. This was
> not a
> problem in Python 2, where ASCII or mostly-ASCII text could use the Python
> 2
> ``str`` DType (the current ``bytes_`` DType).
>
> With the arrival of Python 3 support in NumPy, the string DTypes were
> largely
> left alone due to backward compatibility concerns, although the unicode
> DType
> became the default DType for ``str`` data and the old ``string`` DType was
> renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal
> situation of shipping a data type originally intended for null-terminated
> bytestrings as the data type for *all* python ``bytes`` data, and a
> default
> string type with an in-memory representation that consumes four times as
> much
> memory as needed for ASCII or mostly-ASCII data.
>
> Problems with Fixed-Width Strings
> *
>
> Both existing string DTypes represent fixed-width sequences, allowing
> storage of
> the string data in the array buffer. This avoids adding out-of-band
> storage to
> NumPy, however, it makes for an awkward user interface. In particular, the
> maximum string size must be inferred by NumPy or estimated by the user
> before
> loading the data into a NumPy array or selecting an output DType for string
> operations. In the worst case, this requires an expensive pass over the
> full
> dataset to calculate the maximum length of an array element. It also wastes
> memory when array elements have varying lengths. Pathological cases where
> an
> array stores many short strings and a few very long strings are
> particularly bad
> for wasting memory.
>
> Downstream usage of string data in NumPy arrays has proven out the need
> for a
> variable-width string data type. In practice, most downstream users employ
> ``object`` arrays for this purpose. In particu