[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy
Hi, I know that I'm a little late to be asking about this, but I don't see a comment elsewhere on it (in the NEP, the implementation PR #25347, or this email thread). As I understand it, the new StringDType implementation distinguishes 3 types of individual strings, any of which can be present in an array: 1. short strings, included inline in the array (at most 15 bytes on a 64-bit system) 2. arena-allocated strings, which are managed by the npy_string_allocator 3. heap-allocated strings, which are pointers anywhere in RAM. Does case 3 include strings that are passed to the array as views, without copying? If so, then the ownership of strings would either need to be tracked on a per-string basis (distinct from the array_owned boolean, which characterizes the whole array), or they need to all be considered stolen references (NumPy will free all of them when the array goes out of scope), or they all need to be considered borrowed references (NumPy will not free any of them when the array goes out of scope). If the array does not accept new strings as views, but always copies any externally provided string, then why distinguish between cases 2 and 3? How would an array end up with some strings being arena-allocated and other strings being heap-allocated? Thanks! -- Jim On Wed, Sep 20, 2023 at 10:25 AM Nathan wrote: > > > On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard > wrote: > >> >> >> On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers >> wrote: >> >>> >>> >>> On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser < >>> warren.weckes...@gmail.com> wrote: >>> On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < warren.weckes...@gmail.com> wrote: > > > > On Mon, Sep 11, 2023 at 12:25 PM Nathan wrote: >> >> >> >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < warren.weckes...@gmail.com> wrote: >>> >>> >>> >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan wrote: >>> > >>> > The NEP was merged in draft form, see below. >>> > >>> > https://numpy.org/neps/nep-0055-string_dtype.html >>> > >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan wrote: >>> >> >>> >> Hello all, >>> >> >>> >> I just opened a pull request to add NEP 55, see https://github.com/numpy/numpy/pull/24483. >>> >> >>> >> Per NEP 0, I've copied everything up to the "detailed description" section below. >>> >> >>> >> I'm looking forward to your feedback on this. >>> >> >>> >> -Nathan Goldbaum >>> >> >>> >>> This will be a nice addition to NumPy, and matches a suggestion by >>> @rkern (and probably others) made in the 2017 mailing list thread; >>> see the last bullet of >>> >>> https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html >>> >>> So +1 for the enhancement! >>> >>> Now for some nitty-gritty review... >> >> >> Thanks for the nitty-gritty review! I was on vacation last week and haven't had a chance to look over this in detail yet, but at first glance this seems like a really nice improvement. >> >> I'm going to try to integrate your proposed design into the dtype prototype this week. If that works, I'd like to include some of the text from the README in your repo in the NEP and add you as an author, would that be alright? > > > > Sure, that would be fine. > > I have a few more comments and questions about the NEP that I'll finish up and send this weekend. > One more comment on the NEP... My first impression of the missing data API design is that it is more complicated than necessary. An alternative that is simpler--and is consistent with the pattern established for floats and datetimes--is to define a "not a string" value, say `np.nastring` or something similar, just like we have `nan` for floats and `nat` for datetimes. Its behavior could be what you called "nan-like". >>> >>> Float `np.nan` and datetime missing value sentinel are not all that >>> similar, and the latter was always a bit questionable (at least partially >>> it's a left-over of trying to introduce generic missing value support I >>> believe). `nan` is a float and part of C/C++ standards with well-defined >>> numerical behavior. In contrast, there is no `np.nat`; you can retrieve a >>> sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's >>> possible to generate a NaT value with a regular operation on a datetime >>> array a la `np.array([1.5]) / 0.0`. >>> >>> The handling of `np.nastring` would be an intrinsic part of the dtype, so there would be no need for the `na_object` parameter of `StringDType`. All `StringDType`s would handle `np.nastring` in the same consistent manner. The use-case for the string sentinel does not see
[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy
On Mon, Feb 12, 2024 at 1:47 PM Jim Pivarski wrote: > Hi, > > I know that I'm a little late to be asking about this, but I don't see a > comment elsewhere on it (in the NEP, the implementation PR #25347, or this > email thread). > > As I understand it, the new StringDType implementation distinguishes 3 > types of individual strings, any of which can be present in an array: > >1. short strings, included inline in the array (at most 15 bytes on a >64-bit system) >2. arena-allocated strings, which are managed by the >npy_string_allocator >3. heap-allocated strings, which are pointers anywhere in RAM. > > Does case 3 include strings that are passed to the array as views, without > copying? If so, then the ownership of strings would either need to be > tracked on a per-string basis (distinct from the array_owned boolean, > which characterizes the whole array), or they need to all be considered > stolen references (NumPy will free all of them when the array goes out of > scope), or they all need to be considered borrowed references (NumPy will > not free any of them when the array goes out of scope). > Stringdtyoe arrays don’t intern python strings directly, there’s always a copy. Array views are allowed, but I don’t think that’s what you’re talking about. The mutex guarding access to the string data prevents arrays from being garbage collected while a C thread holds a pointer to the string data, at least assuming correct usage of the C API that doesn’t try to use a string after releasing the allocator. > If the array does not accept new strings as views, but always copies any > externally provided string, then why distinguish between cases 2 and 3? How > would an array end up with some strings being arena-allocated and other > strings being heap-allocated? > You can only get a heap string entry in an array if you enlarge an entry in the array. The goal with allowing heap strings like this was to have an escape hatch that allows enlarging a single array entry without adding complexity or needing to re-allocate the entire arena buffer. For example, if you create an array with a short string entry and then edit that entry to be longer than 15 bytes. Rather than appending to the arena or re-allocating it, we convert the entry to a heap string. > Thanks! > -- Jim > > > > > On Wed, Sep 20, 2023 at 10:25 AM Nathan wrote: > >> >> >> On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard < >> kevin.k.shepp...@gmail.com> wrote: >> >>> >>> >>> On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers >>> wrote: >>> On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser < warren.weckes...@gmail.com> wrote: > > > On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < > warren.weckes...@gmail.com> wrote: > > > > > > > > On Mon, Sep 11, 2023 at 12:25 PM Nathan > wrote: > >> > >> > >> > >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < > warren.weckes...@gmail.com> wrote: > >>> > >>> > >>> > >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan > wrote: > >>> > > >>> > The NEP was merged in draft form, see below. > >>> > > >>> > https://numpy.org/neps/nep-0055-string_dtype.html > >>> > > >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan < > nathan.goldb...@gmail.com> wrote: > >>> >> > >>> >> Hello all, > >>> >> > >>> >> I just opened a pull request to add NEP 55, see > https://github.com/numpy/numpy/pull/24483. > >>> >> > >>> >> Per NEP 0, I've copied everything up to the "detailed > description" section below. > >>> >> > >>> >> I'm looking forward to your feedback on this. > >>> >> > >>> >> -Nathan Goldbaum > >>> >> > >>> > >>> This will be a nice addition to NumPy, and matches a suggestion by > >>> @rkern (and probably others) made in the 2017 mailing list thread; > >>> see the last bullet of > >>> > >>> > https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html > >>> > >>> So +1 for the enhancement! > >>> > >>> Now for some nitty-gritty review... > >> > >> > >> Thanks for the nitty-gritty review! I was on vacation last week and > haven't had a chance to look over this in detail yet, but at first glance > this seems like a really nice improvement. > >> > >> I'm going to try to integrate your proposed design into the dtype > prototype this week. If that works, I'd like to include some of the text > from the README in your repo in the NEP and add you as an author, would > that be alright? > > > > > > > > Sure, that would be fine. > > > > I have a few more comments and questions about the NEP that I'll > finish up and send this weekend. > > > > One more comment on the NEP... > > My first impression of the missing data API design is that > it is more complicated than necessary. An alternative that > is si
[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy
I see: thank you for the explanations! On Mon, Feb 12, 2024 at 3:04 PM Nathan wrote: > Stringdtyoe arrays don’t intern python strings directly, there’s always a > copy. > I had been thinking of accepting a memoryview without copying, but if there's always a copy in any case, that answers my question about ownership. -- Jim ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com