[Numpy-discussion] welcome Andrew Nelson to the NumPy maintainers team

2023-08-21 Thread Ralf Gommers
Hi all,

On behalf of the steering council, I am very happy to announce that Andrew
is joining the Maintainers team. Andrew has been contributing to our CI
setup in particular for the past year, and has contributed for example the
Cirrus CI setup and the musllinux builds:
https://github.com/numpy/numpy/pulls/andyfaff.

Welcome Andrew, I'm looking forward to working with you more!

Cheers,
Ralf
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

2023-08-21 Thread Nathan
Hello all,

I just opened a pull request to add NEP 55, see
https://github.com/numpy/numpy/pull/24483.

Per NEP 0, I've copied everything up to the "detailed description" section
below.

I'm looking forward to your feedback on this.

-Nathan Goldbaum

=
NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy
=

:Author: Nathan Goldbaum 
:Status: Draft
:Type: Standards Track
:Created: 2023-06-29


Abstract


We propose adding a new string data type to NumPy where each item in the
array
is an arbitrary length UTF-8 encoded string. This will enable performance,
memory usage, and usability improvements for NumPy users, including:

* Memory savings for workflows that currently use fixed-width strings and
store
primarily ASCII data or a mix of short and long strings in a single NumPy
array.

* Downstream libraries and users will be able to move away from object
arrays
currently used as a substitute for variable-length string arrays, unlocking
performance improvements by avoiding passes over the data outside of NumPy.

* A more intuitive user-facing API for working with arrays of Python
strings,
without a need to think about the in-memory array representation.

Motivation and Scope


First, we will describe how the current state of support for string or
string-like data in NumPy arose. Next, we will summarize the last major
previous
discussion about this topic. Finally, we will describe the scope of the
proposed
changes to NumPy as well as changes that are explicitly out of scope of this
proposal.

History of String Support in Numpy
**

Support in NumPy for textual data evolved organically in response to early
user
needs and then changes in the Python ecosystem.

Support for strings was added to numpy to support users of the NumArray
``chararray`` type. Remnants of this are still visible in the NumPy API:
string-related functionality lives in ``np.char``, to support the obsolete
``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string
DTypes.

NumPy's ``bytes_`` DType was originally used to represent the Python 2 ``str
``
type before Python 3 support was added to NumPy. The bytes DType makes the
most
sense when it is used to represent Python 2 strings or other null-terminated
byte sequences. However, ignoring data after the first null character means
the
``bytes_`` DType is only suitable for bytestreams that do not contain
nulls, so
it is a poor match for generic bytestreams.

The ``unicode`` DType was added to support the Python 2 ``unicode`` type. It
stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which
makes for
a straightforward implementation, but is inefficient for storing text that
can
be represented well using a one-byte ASCII or Latin-1 encoding. This was
not a
problem in Python 2, where ASCII or mostly-ASCII text could use the Python 2
``str`` DType (the current ``bytes_`` DType).

With the arrival of Python 3 support in NumPy, the string DTypes were
largely
left alone due to backward compatibility concerns, although the unicode
DType
became the default DType for ``str`` data and the old ``string`` DType was
renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal
situation of shipping a data type originally intended for null-terminated
bytestrings as the data type for *all* python ``bytes`` data, and a default
string type with an in-memory representation that consumes four times as
much
memory as needed for ASCII or mostly-ASCII data.

Problems with Fixed-Width Strings
*

Both existing string DTypes represent fixed-width sequences, allowing
storage of
the string data in the array buffer. This avoids adding out-of-band storage
to
NumPy, however, it makes for an awkward user interface. In particular, the
maximum string size must be inferred by NumPy or estimated by the user
before
loading the data into a NumPy array or selecting an output DType for string
operations. In the worst case, this requires an expensive pass over the full
dataset to calculate the maximum length of an array element. It also wastes
memory when array elements have varying lengths. Pathological cases where an
array stores many short strings and a few very long strings are
particularly bad
for wasting memory.

Downstream usage of string data in NumPy arrays has proven out the need for
a
variable-width string data type. In practice, most downstream users employ
``object`` arrays for this purpose. In particular, ``pandas`` has explicitly
deprecated support for NumPy fixed-width strings, coerces NumPy fixed-width
string arrays to ``object`` arrays, and in the future may switch to only
supporting string data via ``PyArrow``, which has native support for UTF-8
encoded variable-width string arrays [1]_. This is unfortunate, since ``
object``
arrays have no type guarantees, necessitating

[Numpy-discussion] Re: welcome Andrew Nelson to the NumPy maintainers team

2023-08-21 Thread Andrew Nelson
On Mon, 21 Aug 2023 at 18:39, Ralf Gommers  wrote:

> Hi all,
>
> On behalf of the steering council, I am very happy to announce that Andrew
> is joining the Maintainers team. Andrew has been contributing to our CI
> setup in particular for the past year, and has contributed for example the
> Cirrus CI setup and the musllinux builds:
> https://github.com/numpy/numpy/pulls/andyfaff.
>
> Welcome Andrew, I'm looking forward to working with you more!
>

Thanks Ralf, it's good to join the team.
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: welcome Andrew Nelson to the NumPy maintainers team

2023-08-21 Thread Stefan van der Walt
On Mon, Aug 21, 2023, at 01:34, Ralf Gommers wrote:
> On behalf of the steering council, I am very happy to announce that Andrew is 
> joining the Maintainers team.

Andrew, a warm welcome to the team! Thank you for all your work so far, and for 
continuing your involvement with the project. 

Best regards, 
Stéfan 
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com