[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

2023-08-29 Thread Nathan
The NEP was merged in draft form, see below.

https://numpy.org/neps/nep-0055-string_dtype.html

On Mon, Aug 21, 2023 at 2:36 PM Nathan  wrote:

> Hello all,
>
> I just opened a pull request to add NEP 55, see
> https://github.com/numpy/numpy/pull/24483.
>
> Per NEP 0, I've copied everything up to the "detailed description" section
> below.
>
> I'm looking forward to your feedback on this.
>
> -Nathan Goldbaum
>
> =
> NEP 55 — Add a UTF-8 Variable-Width String DType to NumPy
> =
>
> :Author: Nathan Goldbaum 
> :Status: Draft
> :Type: Standards Track
> :Created: 2023-06-29
>
>
> Abstract
> 
>
> We propose adding a new string data type to NumPy where each item in the
> array
> is an arbitrary length UTF-8 encoded string. This will enable performance,
> memory usage, and usability improvements for NumPy users, including:
>
> * Memory savings for workflows that currently use fixed-width strings and
> store
> primarily ASCII data or a mix of short and long strings in a single NumPy
> array.
>
> * Downstream libraries and users will be able to move away from object
> arrays
> currently used as a substitute for variable-length string arrays, unlocking
> performance improvements by avoiding passes over the data outside of NumPy.
>
> * A more intuitive user-facing API for working with arrays of Python
> strings,
> without a need to think about the in-memory array representation.
>
> Motivation and Scope
> 
>
> First, we will describe how the current state of support for string or
> string-like data in NumPy arose. Next, we will summarize the last major
> previous
> discussion about this topic. Finally, we will describe the scope of the
> proposed
> changes to NumPy as well as changes that are explicitly out of scope of
> this
> proposal.
>
> History of String Support in Numpy
> **
>
> Support in NumPy for textual data evolved organically in response to early
> user
> needs and then changes in the Python ecosystem.
>
> Support for strings was added to numpy to support users of the NumArray
> ``chararray`` type. Remnants of this are still visible in the NumPy API:
> string-related functionality lives in ``np.char``, to support the obsolete
> ``np.char.chararray`` class, deprecated since NumPy 1.4 in favor of string
> DTypes.
>
> NumPy's ``bytes_`` DType was originally used to represent the Python 2 ``
> str``
> type before Python 3 support was added to NumPy. The bytes DType makes the
> most
> sense when it is used to represent Python 2 strings or other
> null-terminated
> byte sequences. However, ignoring data after the first null character
> means the
> ``bytes_`` DType is only suitable for bytestreams that do not contain
> nulls, so
> it is a poor match for generic bytestreams.
>
> The ``unicode`` DType was added to support the Python 2 ``unicode`` type.
> It
> stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which
> makes for
> a straightforward implementation, but is inefficient for storing text that
> can
> be represented well using a one-byte ASCII or Latin-1 encoding. This was
> not a
> problem in Python 2, where ASCII or mostly-ASCII text could use the Python
> 2
> ``str`` DType (the current ``bytes_`` DType).
>
> With the arrival of Python 3 support in NumPy, the string DTypes were
> largely
> left alone due to backward compatibility concerns, although the unicode
> DType
> became the default DType for ``str`` data and the old ``string`` DType was
> renamed the ``bytes_`` DType. This change left NumPy with the sub-optimal
> situation of shipping a data type originally intended for null-terminated
> bytestrings as the data type for *all* python ``bytes`` data, and a
> default
> string type with an in-memory representation that consumes four times as
> much
> memory as needed for ASCII or mostly-ASCII data.
>
> Problems with Fixed-Width Strings
> *
>
> Both existing string DTypes represent fixed-width sequences, allowing
> storage of
> the string data in the array buffer. This avoids adding out-of-band
> storage to
> NumPy, however, it makes for an awkward user interface. In particular, the
> maximum string size must be inferred by NumPy or estimated by the user
> before
> loading the data into a NumPy array or selecting an output DType for string
> operations. In the worst case, this requires an expensive pass over the
> full
> dataset to calculate the maximum length of an array element. It also wastes
> memory when array elements have varying lengths. Pathological cases where
> an
> array stores many short strings and a few very long strings are
> particularly bad
> for wasting memory.
>
> Downstream usage of string data in NumPy arrays has proven out the need
> for a
> variable-width string data type. In practice, most downstream users employ
> ``object`` arrays for this purpose. In particu

[Numpy-discussion] Re: Suggestions for changes to the polynomial module

2023-08-29 Thread jsdodge
Hi Pieter,

Thanks for pointing this PR out. That certainly fixes the immediate problem 
with the inconsistent print statements that I highlighted in my original 
message.

It doesn't address the more fundamental problem, though, which is that the 
default behavior is to represent the polynomial in this rescaled form, which 
unnecessarily privileges numerical accuracy over ease of use and consistency 
with standard usage. I realize that it has been this way for a while, but 
multiple GitHub issues indicate that it causes confusion, which suggests that 
the issue should be addressed more meaningfully. (As an aside, the whole module 
uses a nonstandard definition of weights that also causes confusion.) I would 
expect the confusion to be compounded if Polynomial.fit (with its cousins) 
adopts the option to return the covariance matrix (which I recommend), since 
this will also depend on the scaling.

I think it's great to provide the *option* to scale the domain, especially for 
things like Chebyshev polynomials, where the domain typically needs rescaling, 
anyway. But a user who wants to fit data with y = a[0] + a[1] * x + a[2] * x**2 
should, IMO, get back the best-fit coefficients for the equation as originally 
formulated by default, not in the form that is most convenient for the 
numerical analyst.

Cheers,
Steve
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com