[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

2024-02-12 Thread Jim Pivarski
Hi,

I know that I'm a little late to be asking about this, but I don't see a
comment elsewhere on it (in the NEP, the implementation PR #25347, or this
email thread).

As I understand it, the new StringDType implementation distinguishes 3
types of individual strings, any of which can be present in an array:

   1. short strings, included inline in the array (at most 15 bytes on a
   64-bit system)
   2. arena-allocated strings, which are managed by the npy_string_allocator
   3. heap-allocated strings, which are pointers anywhere in RAM.

Does case 3 include strings that are passed to the array as views, without
copying? If so, then the ownership of strings would either need to be
tracked on a per-string basis (distinct from the array_owned boolean, which
characterizes the whole array), or they need to all be considered stolen
references (NumPy will free all of them when the array goes out of scope),
or they all need to be considered borrowed references (NumPy will not free
any of them when the array goes out of scope).

If the array does not accept new strings as views, but always copies any
externally provided string, then why distinguish between cases 2 and 3? How
would an array end up with some strings being arena-allocated and other
strings being heap-allocated?

Thanks!
-- Jim




On Wed, Sep 20, 2023 at 10:25 AM Nathan  wrote:

>
>
> On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard 
> wrote:
>
>>
>>
>> On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers 
>> wrote:
>>
>>>
>>>
>>> On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser <
>>> warren.weckes...@gmail.com> wrote:
>>>


 On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser <
 warren.weckes...@gmail.com> wrote:
 >
 >
 >
 > On Mon, Sep 11, 2023 at 12:25 PM Nathan 
 wrote:
 >>
 >>
 >>
 >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
 warren.weckes...@gmail.com> wrote:
 >>>
 >>>
 >>>
 >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan 
 wrote:
 >>> >
 >>> > The NEP was merged in draft form, see below.
 >>> >
 >>> > https://numpy.org/neps/nep-0055-string_dtype.html
 >>> >
 >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan 
 wrote:
 >>> >>
 >>> >> Hello all,
 >>> >>
 >>> >> I just opened a pull request to add NEP 55, see
 https://github.com/numpy/numpy/pull/24483.
 >>> >>
 >>> >> Per NEP 0, I've copied everything up to the "detailed
 description" section below.
 >>> >>
 >>> >> I'm looking forward to your feedback on this.
 >>> >>
 >>> >> -Nathan Goldbaum
 >>> >>
 >>>
 >>> This will be a nice addition to NumPy, and matches a suggestion by
 >>> @rkern (and probably others) made in the 2017 mailing list thread;
 >>> see the last bullet of
 >>>
 >>>
 https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
 >>>
 >>> So +1 for the enhancement!
 >>>
 >>> Now for some nitty-gritty review...
 >>
 >>
 >> Thanks for the nitty-gritty review! I was on vacation last week and
 haven't had a chance to look over this in detail yet, but at first glance
 this seems like a really nice improvement.
 >>
 >> I'm going to try to integrate your proposed design into the dtype
 prototype this week. If that works, I'd like to include some of the text
 from the README in your repo in the NEP and add you as an author, would
 that be alright?
 >
 >
 >
 > Sure, that would be fine.
 >
 > I have a few more comments and questions about the NEP that I'll
 finish up and send this weekend.
 >

 One more comment on the NEP...

 My first impression of the missing data API design is that
 it is more complicated than necessary. An alternative that
 is simpler--and is consistent with the pattern established for
 floats and datetimes--is to define a "not a string" value, say
 `np.nastring` or something similar, just like we have `nan` for
 floats and `nat` for datetimes. Its behavior could be what
 you called "nan-like".

>>>
>>> Float `np.nan` and datetime missing value sentinel are not all that
>>> similar, and the latter was always a bit questionable (at least partially
>>> it's a left-over of trying to introduce generic missing value support I
>>> believe). `nan` is a float and part of C/C++ standards with well-defined
>>> numerical behavior. In contrast, there is no `np.nat`; you can retrieve a
>>> sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's
>>> possible to generate a NaT value with a regular operation on a datetime
>>> array a la `np.array([1.5]) / 0.0`.
>>>
>>> The handling of `np.nastring` would be an intrinsic part of the
 dtype, so there would be no need for the `na_object` parameter
 of `StringDType`. All `StringDType`s would handle `np.nastring`
 in the same consistent manner.

 The use-case for the string sentinel does not see

[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

2024-02-12 Thread Nathan
On Mon, Feb 12, 2024 at 1:47 PM Jim Pivarski  wrote:

> Hi,
>
> I know that I'm a little late to be asking about this, but I don't see a
> comment elsewhere on it (in the NEP, the implementation PR #25347, or this
> email thread).
>
> As I understand it, the new StringDType implementation distinguishes 3
> types of individual strings, any of which can be present in an array:
>
>1. short strings, included inline in the array (at most 15 bytes on a
>64-bit system)
>2. arena-allocated strings, which are managed by the
>npy_string_allocator
>3. heap-allocated strings, which are pointers anywhere in RAM.
>
> Does case 3 include strings that are passed to the array as views, without
> copying? If so, then the ownership of strings would either need to be
> tracked on a per-string basis (distinct from the array_owned boolean,
> which characterizes the whole array), or they need to all be considered
> stolen references (NumPy will free all of them when the array goes out of
> scope), or they all need to be considered borrowed references (NumPy will
> not free any of them when the array goes out of scope).
>

Stringdtyoe arrays don’t intern python strings directly, there’s always a
copy. Array views are allowed, but I don’t think that’s what you’re talking
about. The mutex guarding access to the string data prevents arrays from
being garbage collected while a C thread holds a pointer to the string
data, at least assuming correct usage of the C API that doesn’t try to use
a string after releasing the allocator.


> If the array does not accept new strings as views, but always copies any
> externally provided string, then why distinguish between cases 2 and 3? How
> would an array end up with some strings being arena-allocated and other
> strings being heap-allocated?
>

You can only get a heap string entry in an array if you enlarge an entry in
the array. The goal with allowing heap strings like this was to have an
escape hatch that allows enlarging a single array entry without adding
complexity or needing to re-allocate the entire arena buffer.

For example, if you create an array with a short string entry and then edit
that entry to be longer than 15 bytes. Rather than appending to the arena
or re-allocating it, we convert the entry to a heap string.


> Thanks!
> -- Jim
>
>
>
>
> On Wed, Sep 20, 2023 at 10:25 AM Nathan  wrote:
>
>>
>>
>> On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard <
>> kevin.k.shepp...@gmail.com> wrote:
>>
>>>
>>>
>>> On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers 
>>> wrote:
>>>


 On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser <
 warren.weckes...@gmail.com> wrote:

>
>
> On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser <
> warren.weckes...@gmail.com> wrote:
> >
> >
> >
> > On Mon, Sep 11, 2023 at 12:25 PM Nathan 
> wrote:
> >>
> >>
> >>
> >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser <
> warren.weckes...@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan 
> wrote:
> >>> >
> >>> > The NEP was merged in draft form, see below.
> >>> >
> >>> > https://numpy.org/neps/nep-0055-string_dtype.html
> >>> >
> >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan <
> nathan.goldb...@gmail.com> wrote:
> >>> >>
> >>> >> Hello all,
> >>> >>
> >>> >> I just opened a pull request to add NEP 55, see
> https://github.com/numpy/numpy/pull/24483.
> >>> >>
> >>> >> Per NEP 0, I've copied everything up to the "detailed
> description" section below.
> >>> >>
> >>> >> I'm looking forward to your feedback on this.
> >>> >>
> >>> >> -Nathan Goldbaum
> >>> >>
> >>>
> >>> This will be a nice addition to NumPy, and matches a suggestion by
> >>> @rkern (and probably others) made in the 2017 mailing list thread;
> >>> see the last bullet of
> >>>
> >>>
> https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html
> >>>
> >>> So +1 for the enhancement!
> >>>
> >>> Now for some nitty-gritty review...
> >>
> >>
> >> Thanks for the nitty-gritty review! I was on vacation last week and
> haven't had a chance to look over this in detail yet, but at first glance
> this seems like a really nice improvement.
> >>
> >> I'm going to try to integrate your proposed design into the dtype
> prototype this week. If that works, I'd like to include some of the text
> from the README in your repo in the NEP and add you as an author, would
> that be alright?
> >
> >
> >
> > Sure, that would be fine.
> >
> > I have a few more comments and questions about the NEP that I'll
> finish up and send this weekend.
> >
>
> One more comment on the NEP...
>
> My first impression of the missing data API design is that
> it is more complicated than necessary. An alternative that
> is si

[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy

2024-02-12 Thread Jim Pivarski
I see: thank you for the explanations!

On Mon, Feb 12, 2024 at 3:04 PM Nathan  wrote:

> Stringdtyoe arrays don’t intern python strings directly, there’s always a
> copy.
>

I had been thinking of accepting a memoryview without copying, but if
there's always a copy in any case, that answers my question about ownership.

-- Jim
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com