[Numpy-discussion] Re: mean_std function returning both mean and std

2023-07-06 Thread Ronald van Elburg
Second attempt after the triage review of last week: ENH: add mean keyword to 
std and var #24126 (https://github.com/numpy/numpy/pull/24126)
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: mean_std function returning both mean and std

2023-07-06 Thread Neal Becker
On a somewhat related note, I usually find I need to compute stats
incrementally.  To do this, a stat object is created so batches of samples
can be fed to it sequentially.

I used to use an implementation based on boost::accumulator for this.  More
recently I'm using my own c++ code based on xtensor, exposed to python with
xtensor-python and pybind11.

The basic technique to find 2nd order stats is to keep 2 running sums,
sum(x) and sum(x**2).

It would be useful to have functionality for incremental stats like this in
numpy, as well as other incremental operations (e.g., histogram).  I
frequently find I need to process large amounts of data in small batches at
a time, generated by iterative monte-carlo simulations, for example.
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] New user dtypes and the buffer protocol

2023-07-06 Thread Nathan
Hi all,

As you may know, I'm currently working on a variable-width string dtype
using the new experimental user dtype API. As part of this work I'm running
into papercuts that future dtype authors will likely hit and I've been
trying to fix them as I go.

One issue I'd like to raise with the list is that the Python buffer
protocol and the `__array_interface__` protocol support a limited set of
data types.

This leads to three concrete issues I'm working around:

   * The `npy` file format uses the type strings defined by the
`__array_interface__` protocol, so any type that doesn't have a type string
defined in that protocol cannot currently be saved [1].

* Cython uses the buffer protocol in its support for numpy arrays and
in the typed memoryview interface so that means any array with a dtype that
doesn't support the buffer protocol cannot be accessed using idiomatic
cython code [2]. The same issue means cython can't easily support float16
or datetime dtypes [3].

* Currently new dtypes don't have a way to export a string version of
themselves that numpy can subsequently load (implicitly importing the
dtype). This makes it more awkward to update downstream libraries that
currently treat dtypes as strings.

One way to fix this is to define an ad-hoc extension to the buffer
protocol. Officially, the buffer protocol only supports the format codes
used in the struct module [4]. Unofficially, memoryview doesn't raise a
NotImplementedError if you pass it an invalid format code, only raising an
error when it tries to access the data. This means we can stuff an
arbitrary string into the format code. See the proposal from Sebastian on
the Python Discuss forum [5] and his proof-of-concept [6]. The hardest
issue with this approach is that it's a social problem, requiring
cross-project coordination with at least Cython, and possibly a PEP to
standardize whatever extension to the buffer protocol we come up with.

Another option would be to exchange data using the arrow data format [7],
which already supports many of the kinds of memory layouts custom dtype
authors might want to use and supports defining custom data types [8]. The
big issue here is that NumPy probably can't depend on the arrow C++ library
(I think?) so we would need to write a bunch of code to support arrow data
layouts and data types, but then we would also need to do the same thing on
the Cython side.

Implementing either of these approaches fixes the issues I enumerated above
at the cost of some added complexity. We don't necessarily have to make an
immediate decision for my work to be viable, I can work around most of
these issues, but I think now is probably the time to raise this as an
issue and see if anyone has strong opinions about what NumPy should
ultimately do.

I've raised this on the Cython mailing list to get their take as well [9].

[1] https://github.com/numpy/numpy/issues/24110
[2] https://github.com/numpy/numpy/issues/18442
[3] https://github.com/numpy/numpy/issues/4983
[4] https://docs.python.org/3/library/struct.html#format-strings
[5]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
[6] https://github.com/numpy/numpy/issues/23500#issuecomment-1525103546
[7] https://arrow.apache.org/docs/format/Columnar.html
[8] https://arrow.apache.org/docs/format/Columnar.html#extension-types
[9] https://mail.python.org/pipermail/cython-devel/2023-July/005434.html
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: New user dtypes and the buffer protocol

2023-07-06 Thread Evgeni Burovski
I wonder if the dlpack protocol can be helpful for these kinds of dtypes?



On Thu, Jul 6, 2023 at 7:56 PM Nathan  wrote:
>
> Hi all,
>
> As you may know, I'm currently working on a variable-width string dtype
using the new experimental user dtype API. As part of this work I'm running
into papercuts that future dtype authors will likely hit and I've been
trying to fix them as I go.
>
> One issue I'd like to raise with the list is that the Python buffer
protocol and the `__array_interface__` protocol support a limited set of
data types.
>
> This leads to three concrete issues I'm working around:
>
>* The `npy` file format uses the type strings defined by the
`__array_interface__` protocol, so any type that doesn't have a type string
defined in that protocol cannot currently be saved [1].
>
> * Cython uses the buffer protocol in its support for numpy arrays and
in the typed memoryview interface so that means any array with a dtype that
doesn't support the buffer protocol cannot be accessed using idiomatic
cython code [2]. The same issue means cython can't easily support float16
or datetime dtypes [3].
>
> * Currently new dtypes don't have a way to export a string version of
themselves that numpy can subsequently load (implicitly importing the
dtype). This makes it more awkward to update downstream libraries that
currently treat dtypes as strings.
>
> One way to fix this is to define an ad-hoc extension to the buffer
protocol. Officially, the buffer protocol only supports the format codes
used in the struct module [4]. Unofficially, memoryview doesn't raise a
NotImplementedError if you pass it an invalid format code, only raising an
error when it tries to access the data. This means we can stuff an
arbitrary string into the format code. See the proposal from Sebastian on
the Python Discuss forum [5] and his proof-of-concept [6]. The hardest
issue with this approach is that it's a social problem, requiring
cross-project coordination with at least Cython, and possibly a PEP to
standardize whatever extension to the buffer protocol we come up with.
>
> Another option would be to exchange data using the arrow data format [7],
which already supports many of the kinds of memory layouts custom dtype
authors might want to use and supports defining custom data types [8]. The
big issue here is that NumPy probably can't depend on the arrow C++ library
(I think?) so we would need to write a bunch of code to support arrow data
layouts and data types, but then we would also need to do the same thing on
the Cython side.
>
> Implementing either of these approaches fixes the issues I enumerated
above at the cost of some added complexity. We don't necessarily have to
make an immediate decision for my work to be viable, I can work around most
of these issues, but I think now is probably the time to raise this as an
issue and see if anyone has strong opinions about what NumPy should
ultimately do.
>
> I've raised this on the Cython mailing list to get their take as well [9].
>
> [1] https://github.com/numpy/numpy/issues/24110
> [2] https://github.com/numpy/numpy/issues/18442
> [3] https://github.com/numpy/numpy/issues/4983
> [4] https://docs.python.org/3/library/struct.html#format-strings
> [5]
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256
> [6] https://github.com/numpy/numpy/issues/23500#issuecomment-1525103546
> [7] https://arrow.apache.org/docs/format/Columnar.html
> [8] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> [9] https://mail.python.org/pipermail/cython-devel/2023-July/005434.html
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: evgeny.burovs...@gmail.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: New user dtypes and the buffer protocol

2023-07-06 Thread Matti Picus

On 6/7/23 20:44, Evgeni Burovski wrote:


On Thu, Jul 6, 2023 at 7:56 PM Nathan  wrote:
>
> Hi all,
>
> As you may know, I'm currently working on a variable-width string 
dtype using the new experimental user dtype API. As part of this work 
I'm running into papercuts that future dtype authors will likely hit 
and I've been trying to fix them as I go.

>
> One issue I'd like to raise with the list is that the Python buffer 
protocol and the `__array_interface__` protocol support a limited set 
of data types.

>
> This leads to three concrete issues I'm working around:
>
>    * The `npy` file format uses the type strings defined by the 
`__array_interface__` protocol, so any type that doesn't have a type 
string defined in that protocol cannot currently be saved [1].

>
>     * Cython uses the buffer protocol in its support for numpy 
arrays and in the typed memoryview interface so that means any array 
with a dtype that doesn't support the buffer protocol cannot be 
accessed using idiomatic cython code [2]. The same issue means cython 
can't easily support float16 or datetime dtypes [3].

>
>     * Currently new dtypes don't have a way to export a string 
version of themselves that numpy can subsequently load (implicitly 
importing the dtype). This makes it more awkward to update downstream 
libraries that currently treat dtypes as strings.

>
> One way to fix this is to define an ad-hoc extension to the buffer 
protocol. Officially, the buffer protocol only supports the format 
codes used in the struct module [4]. Unofficially, memoryview doesn't 
raise a NotImplementedError if you pass it an invalid format code, 
only raising an error when it tries to access the data. This means we 
can stuff an arbitrary string into the format code. See the proposal 
from Sebastian on the Python Discuss forum [5] and his 
proof-of-concept [6]. The hardest issue with this approach is that 
it's a social problem, requiring cross-project coordination with at 
least Cython, and possibly a PEP to standardize whatever extension to 
the buffer protocol we come up with.

>
> Another option would be to exchange data using the arrow data format 
[7], which already supports many of the kinds of memory layouts custom 
dtype authors might want to use and supports defining custom data 
types [8]. The big issue here is that NumPy probably can't depend on 
the arrow C++ library (I think?) so we would need to write a bunch of 
code to support arrow data layouts and data types, but then we would 
also need to do the same thing on the Cython side.

>
> Implementing either of these approaches fixes the issues I 
enumerated above at the cost of some added complexity. We don't 
necessarily have to make an immediate decision for my work to be 
viable, I can work around most of these issues, but I think now is 
probably the time to raise this as an issue and see if anyone has 
strong opinions about what NumPy should ultimately do.

>
> I've raised this on the Cython mailing list to get their take as 
well [9].

>
> [1] https://github.com/numpy/numpy/issues/24110
> [2] https://github.com/numpy/numpy/issues/18442
> [3] https://github.com/numpy/numpy/issues/4983
> [4] https://docs.python.org/3/library/struct.html#format-strings
> [5] 
https://discuss.python.org/t/buffer-protocol-and-arbitrary-data-types/26256

> [6] https://github.com/numpy/numpy/issues/23500#issuecomment-1525103546
> [7] https://arrow.apache.org/docs/format/Columnar.html
> [8] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> [9] https://mail.python.org/pipermail/cython-devel/2023-July/005434.html

I wonder if the dlpack protocol can be helpful for these kinds of dtypes?



No. DLPack has an enum for a fixed number of known dtypes [0], and 
adding new ones is non-trivial.


[0] 
https://github.com/dmlc/dlpack/blob/ca4d00ad3e2e0f410eeab3264d21b8a39397f362/include/dlpack/dlpack.h#L158


Matti

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com