[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy
On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser wrote: > > > On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < > warren.weckes...@gmail.com> wrote: > > > > > > > > On Mon, Sep 11, 2023 at 12:25 PM Nathan > wrote: > >> > >> > >> > >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < > warren.weckes...@gmail.com> wrote: > >>> > >>> > >>> > >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan > wrote: > >>> > > >>> > The NEP was merged in draft form, see below. > >>> > > >>> > https://numpy.org/neps/nep-0055-string_dtype.html > >>> > > >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan > wrote: > >>> >> > >>> >> Hello all, > >>> >> > >>> >> I just opened a pull request to add NEP 55, see > https://github.com/numpy/numpy/pull/24483. > >>> >> > >>> >> Per NEP 0, I've copied everything up to the "detailed description" > section below. > >>> >> > >>> >> I'm looking forward to your feedback on this. > >>> >> > >>> >> -Nathan Goldbaum > >>> >> > >>> > >>> This will be a nice addition to NumPy, and matches a suggestion by > >>> @rkern (and probably others) made in the 2017 mailing list thread; > >>> see the last bullet of > >>> > >>> > https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html > >>> > >>> So +1 for the enhancement! > >>> > >>> Now for some nitty-gritty review... > >> > >> > >> Thanks for the nitty-gritty review! I was on vacation last week and > haven't had a chance to look over this in detail yet, but at first glance > this seems like a really nice improvement. > >> > >> I'm going to try to integrate your proposed design into the dtype > prototype this week. If that works, I'd like to include some of the text > from the README in your repo in the NEP and add you as an author, would > that be alright? > > > > > > > > Sure, that would be fine. > > > > I have a few more comments and questions about the NEP that I'll finish > up and send this weekend. > > > > One more comment on the NEP... > > My first impression of the missing data API design is that > it is more complicated than necessary. An alternative that > is simpler--and is consistent with the pattern established for > floats and datetimes--is to define a "not a string" value, say > `np.nastring` or something similar, just like we have `nan` for > floats and `nat` for datetimes. Its behavior could be what > you called "nan-like". > Float `np.nan` and datetime missing value sentinel are not all that similar, and the latter was always a bit questionable (at least partially it's a left-over of trying to introduce generic missing value support I believe). `nan` is a float and part of C/C++ standards with well-defined numerical behavior. In contrast, there is no `np.nat`; you can retrieve a sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's possible to generate a NaT value with a regular operation on a datetime array a la `np.array([1.5]) / 0.0`. The handling of `np.nastring` would be an intrinsic part of the > dtype, so there would be no need for the `na_object` parameter > of `StringDType`. All `StringDType`s would handle `np.nastring` > in the same consistent manner. > > The use-case for the string sentinel does not seem very > compelling (but maybe I just don't understand the use-cases). > If there is a real need here that is not covered by > `np.nastring`, perhaps just a flag to control the repr of > `np.nastring` for each StringDType instance would be enough? > My understanding is that the NEP provides the necessary but limited support to allow Pandas to adopt the new dtype. The scope section of the NEP says: "Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself.". And then further down: "By only supporting user-provided missing data sentinels, we avoid resolving exactly how NumPy itself should support missing data and the correct semantics of the missing data object, leaving that up to users to decide" That general approach I agree with, it's a large can of worms and not the main purpose of this NEP. Nathan may have more thoughts about what, if anything, from your suggestions could be adopted, but the general "let's introduce a missing value thing" is a path we should not go down here imho. > > If there is an objection to a potential proliferation of > "not a thing" special values, one for each type that can > handle them, then perhaps a generic "not a value" (say > `np.navalue`) could be created that, when assigned to an > element of an array, results in the appropriate "not a thing" > value actually being assigned. In a sense, I guess this NEP is > proposing that, but it is reusing the floating point object > `np.nan` as the generic "not a thing" value > It is explicitly not using `np.nan` but instead allowing the user to provide their preferred sentinel. You're probably referring to the example with `na_object=np.nan`, but that example would work with another sentinel value too. Cheers, Ralf > , and my preference > is that, *if* we go with such
[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy
On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers wrote: > > > On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser < > warren.weckes...@gmail.com> wrote: > >> >> >> On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < >> warren.weckes...@gmail.com> wrote: >> > >> > >> > >> > On Mon, Sep 11, 2023 at 12:25 PM Nathan >> wrote: >> >> >> >> >> >> >> >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < >> warren.weckes...@gmail.com> wrote: >> >>> >> >>> >> >>> >> >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan >> wrote: >> >>> > >> >>> > The NEP was merged in draft form, see below. >> >>> > >> >>> > https://numpy.org/neps/nep-0055-string_dtype.html >> >>> > >> >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan >> wrote: >> >>> >> >> >>> >> Hello all, >> >>> >> >> >>> >> I just opened a pull request to add NEP 55, see >> https://github.com/numpy/numpy/pull/24483. >> >>> >> >> >>> >> Per NEP 0, I've copied everything up to the "detailed description" >> section below. >> >>> >> >> >>> >> I'm looking forward to your feedback on this. >> >>> >> >> >>> >> -Nathan Goldbaum >> >>> >> >> >>> >> >>> This will be a nice addition to NumPy, and matches a suggestion by >> >>> @rkern (and probably others) made in the 2017 mailing list thread; >> >>> see the last bullet of >> >>> >> >>> >> https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html >> >>> >> >>> So +1 for the enhancement! >> >>> >> >>> Now for some nitty-gritty review... >> >> >> >> >> >> Thanks for the nitty-gritty review! I was on vacation last week and >> haven't had a chance to look over this in detail yet, but at first glance >> this seems like a really nice improvement. >> >> >> >> I'm going to try to integrate your proposed design into the dtype >> prototype this week. If that works, I'd like to include some of the text >> from the README in your repo in the NEP and add you as an author, would >> that be alright? >> > >> > >> > >> > Sure, that would be fine. >> > >> > I have a few more comments and questions about the NEP that I'll finish >> up and send this weekend. >> > >> >> One more comment on the NEP... >> >> My first impression of the missing data API design is that >> it is more complicated than necessary. An alternative that >> is simpler--and is consistent with the pattern established for >> floats and datetimes--is to define a "not a string" value, say >> `np.nastring` or something similar, just like we have `nan` for >> floats and `nat` for datetimes. Its behavior could be what >> you called "nan-like". >> > > Float `np.nan` and datetime missing value sentinel are not all that > similar, and the latter was always a bit questionable (at least partially > it's a left-over of trying to introduce generic missing value support I > believe). `nan` is a float and part of C/C++ standards with well-defined > numerical behavior. In contrast, there is no `np.nat`; you can retrieve a > sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's > possible to generate a NaT value with a regular operation on a datetime > array a la `np.array([1.5]) / 0.0`. > > The handling of `np.nastring` would be an intrinsic part of the >> dtype, so there would be no need for the `na_object` parameter >> of `StringDType`. All `StringDType`s would handle `np.nastring` >> in the same consistent manner. >> >> The use-case for the string sentinel does not seem very >> compelling (but maybe I just don't understand the use-cases). >> If there is a real need here that is not covered by >> `np.nastring`, perhaps just a flag to control the repr of >> `np.nastring` for each StringDType instance would be enough? >> > > My understanding is that the NEP provides the necessary but limited > support to allow Pandas to adopt the new dtype. The scope section of the > NEP says: "Fully agreeing on the semantics of a missing data sentinels or > adding a missing data sentinel to NumPy itself.". And then further down: > "By only supporting user-provided missing data sentinels, we avoid > resolving exactly how NumPy itself should support missing data and the > correct semantics of the missing data object, leaving that up to users to > decide" > > That general approach I agree with, it's a large can of worms and not the > main purpose of this NEP. Nathan may have more thoughts about what, if > anything, from your suggestions could be adopted, but the general "let's > introduce a missing value thing" is a path we should not go down here imho. > > > >> >> If there is an objection to a potential proliferation of >> "not a thing" special values, one for each type that can >> handle them, then perhaps a generic "not a value" (say >> `np.navalue`) could be created that, when assigned to an >> element of an array, results in the appropriate "not a thing" >> value actually being assigned. In a sense, I guess this NEP is >> proposing that, but it is reusing the floating point object >> `np.nan` as the generic "not a thing" value >> > > It is explicitly not using `np.nan` but instead allowing
[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy
On Wed, Sep 20, 2023 at 12:26 AM Warren Weckesser < warren.weckes...@gmail.com> wrote: > > > On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < > warren.weckes...@gmail.com> wrote: > > > > > > > > On Mon, Sep 11, 2023 at 12:25 PM Nathan > wrote: > >> > >> > >> > >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < > warren.weckes...@gmail.com> wrote: > >>> > >>> > >>> > >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan > wrote: > >>> > > >>> > The NEP was merged in draft form, see below. > >>> > > >>> > https://numpy.org/neps/nep-0055-string_dtype.html > >>> > > >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan > wrote: > >>> >> > >>> >> Hello all, > >>> >> > >>> >> I just opened a pull request to add NEP 55, see > https://github.com/numpy/numpy/pull/24483. > >>> >> > >>> >> Per NEP 0, I've copied everything up to the "detailed description" > section below. > >>> >> > >>> >> I'm looking forward to your feedback on this. > >>> >> > >>> >> -Nathan Goldbaum > >>> >> > >>> > >>> This will be a nice addition to NumPy, and matches a suggestion by > >>> @rkern (and probably others) made in the 2017 mailing list thread; > >>> see the last bullet of > >>> > >>> > https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html > >>> > >>> So +1 for the enhancement! > >>> > >>> Now for some nitty-gritty review... > >> > >> > >> Thanks for the nitty-gritty review! I was on vacation last week and > haven't had a chance to look over this in detail yet, but at first glance > this seems like a really nice improvement. > >> > >> I'm going to try to integrate your proposed design into the dtype > prototype this week. If that works, I'd like to include some of the text > from the README in your repo in the NEP and add you as an author, would > that be alright? > > > > > > > > Sure, that would be fine. > > > > I have a few more comments and questions about the NEP that I'll finish > up and send this weekend. > > > > One more comment on the NEP... > > My first impression of the missing data API design is that > it is more complicated than necessary. An alternative that > is simpler--and is consistent with the pattern established for > floats and datetimes--is to define a "not a string" value, say > `np.nastring` or something similar, just like we have `nan` for > floats and `nat` for datetimes. Its behavior could be what > you called "nan-like". > > The handling of `np.nastring` would be an intrinsic part of the > dtype, so there would be no need for the `na_object` parameter > of `StringDType`. All `StringDType`s would handle `np.nastring` > in the same consistent manner. > > The use-case for the string sentinel does not seem very > compelling (but maybe I just don't understand the use-cases). > If there is a real need here that is not covered by > `np.nastring`, perhaps just a flag to control the repr of > `np.nastring` for each StringDType instance would be enough? > > If there is an objection to a potential proliferation of > "not a thing" special values, one for each type that can > handle them, then perhaps a generic "not a value" (say > `np.navalue`) could be created that, when assigned to an > element of an array, results in the appropriate "not a thing" > value actually being assigned. In a sense, I guess this NEP is > proposing that, but it is reusing the floating point object > `np.nan` as the generic "not a thing" value, and my preference > is that, *if* we go with such a generic object, it is not > the floating point value `nan` but a new thing with a name > that reflects its purpose. (I guess Pandas users might be > accustomed to `nan` being a generic sentinel for missing data, > so its use doesn't feel as incohesive as it might to others. > Passing a string array to `np.isnan()` just feels *wrong* to > me.) > > Any, that's my 2¢. > > Warren > > In addition to Ralf's points, I don't think it's possible for NumPy to support all downstream usages of object string arrays without something like what's in the NEP. Some downstream libraries want their NA sentinel to not be comparable with strings (like `None`). Some people want the result of comparisons with the NA sentinel to return the NA sentinel (libraries that use np.nan, pandas.NA also works like this). Others want the sentinel to behave like a string and have a well-defined ordering (pandas does this internally to support sorting strings with missing data in a low-level C routine). I don't see how it's possible to simultaneously support all of this in a single sentinel object, unless that object can be created with some parameters, and then we're no simpler than what I'm proposing *and* we have to decide on sensible default behavior. > > > > > Warren > > > >> > >> > >>> > >>> > >>> There is a design change that I think should be made in the > >>> implementation of missing values. > >>> > >>> In the current design described in the NEP, and expanded on in the > >>> comment > >>> > >>> https://github.com/numpy/numpy/pull/24483#discussion_r1311815944, > >>> >
[Numpy-discussion] Re: NEP 55 - Add a UTF-8 Variable-Width String DType to NumPy
On Wed, Sep 20, 2023 at 4:40 AM Kevin Sheppard wrote: > > > On Wed, Sep 20, 2023 at 11:23 AM Ralf Gommers > wrote: > >> >> >> On Wed, Sep 20, 2023 at 8:26 AM Warren Weckesser < >> warren.weckes...@gmail.com> wrote: >> >>> >>> >>> On Fri, Sep 15, 2023 at 3:18 PM Warren Weckesser < >>> warren.weckes...@gmail.com> wrote: >>> > >>> > >>> > >>> > On Mon, Sep 11, 2023 at 12:25 PM Nathan >>> wrote: >>> >> >>> >> >>> >> >>> >> On Sun, Sep 3, 2023 at 10:54 AM Warren Weckesser < >>> warren.weckes...@gmail.com> wrote: >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Aug 29, 2023 at 10:09 AM Nathan >>> wrote: >>> >>> > >>> >>> > The NEP was merged in draft form, see below. >>> >>> > >>> >>> > https://numpy.org/neps/nep-0055-string_dtype.html >>> >>> > >>> >>> > On Mon, Aug 21, 2023 at 2:36 PM Nathan >>> wrote: >>> >>> >> >>> >>> >> Hello all, >>> >>> >> >>> >>> >> I just opened a pull request to add NEP 55, see >>> https://github.com/numpy/numpy/pull/24483. >>> >>> >> >>> >>> >> Per NEP 0, I've copied everything up to the "detailed >>> description" section below. >>> >>> >> >>> >>> >> I'm looking forward to your feedback on this. >>> >>> >> >>> >>> >> -Nathan Goldbaum >>> >>> >> >>> >>> >>> >>> This will be a nice addition to NumPy, and matches a suggestion by >>> >>> @rkern (and probably others) made in the 2017 mailing list thread; >>> >>> see the last bullet of >>> >>> >>> >>> >>> https://mail.python.org/pipermail/numpy-discussion/2017-April/076681.html >>> >>> >>> >>> So +1 for the enhancement! >>> >>> >>> >>> Now for some nitty-gritty review... >>> >> >>> >> >>> >> Thanks for the nitty-gritty review! I was on vacation last week and >>> haven't had a chance to look over this in detail yet, but at first glance >>> this seems like a really nice improvement. >>> >> >>> >> I'm going to try to integrate your proposed design into the dtype >>> prototype this week. If that works, I'd like to include some of the text >>> from the README in your repo in the NEP and add you as an author, would >>> that be alright? >>> > >>> > >>> > >>> > Sure, that would be fine. >>> > >>> > I have a few more comments and questions about the NEP that I'll >>> finish up and send this weekend. >>> > >>> >>> One more comment on the NEP... >>> >>> My first impression of the missing data API design is that >>> it is more complicated than necessary. An alternative that >>> is simpler--and is consistent with the pattern established for >>> floats and datetimes--is to define a "not a string" value, say >>> `np.nastring` or something similar, just like we have `nan` for >>> floats and `nat` for datetimes. Its behavior could be what >>> you called "nan-like". >>> >> >> Float `np.nan` and datetime missing value sentinel are not all that >> similar, and the latter was always a bit questionable (at least partially >> it's a left-over of trying to introduce generic missing value support I >> believe). `nan` is a float and part of C/C++ standards with well-defined >> numerical behavior. In contrast, there is no `np.nat`; you can retrieve a >> sentinel value with `np.datetime64("NaT")` only. I'm not sure if it's >> possible to generate a NaT value with a regular operation on a datetime >> array a la `np.array([1.5]) / 0.0`. >> >> The handling of `np.nastring` would be an intrinsic part of the >>> dtype, so there would be no need for the `na_object` parameter >>> of `StringDType`. All `StringDType`s would handle `np.nastring` >>> in the same consistent manner. >>> >>> The use-case for the string sentinel does not seem very >>> compelling (but maybe I just don't understand the use-cases). >>> If there is a real need here that is not covered by >>> `np.nastring`, perhaps just a flag to control the repr of >>> `np.nastring` for each StringDType instance would be enough? >>> >> >> My understanding is that the NEP provides the necessary but limited >> support to allow Pandas to adopt the new dtype. The scope section of the >> NEP says: "Fully agreeing on the semantics of a missing data sentinels or >> adding a missing data sentinel to NumPy itself.". And then further down: >> "By only supporting user-provided missing data sentinels, we avoid >> resolving exactly how NumPy itself should support missing data and the >> correct semantics of the missing data object, leaving that up to users to >> decide" >> >> That general approach I agree with, it's a large can of worms and not the >> main purpose of this NEP. Nathan may have more thoughts about what, if >> anything, from your suggestions could be adopted, but the general "let's >> introduce a missing value thing" is a path we should not go down here imho. >> >> >> >>> >>> If there is an objection to a potential proliferation of >>> "not a thing" special values, one for each type that can >>> handle them, then perhaps a generic "not a value" (say >>> `np.navalue`) could be created that, when assigned to an >>> element of an array, results in the appropriate "not a thing" >>> value actually being assigned. In a sense,