[Numpy-discussion] ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread d . lenard80
Hello Numpy community,

I'm proposing the introduction of a `pipe` method for NumPy arrays to enhance 
their usability and expressiveness.
Similar to other data processing libraries like pandas, a `pipe` method would 
allow users to chain operations together in a more readable and intuitive 
manner.
Consider the following examples where method chaining with `pipe` can improve 
code readability compared to traditional NumPy code:

# 
# Class PipeableArray just for illustration

import numpy as np

class PipeableArray:
def __init__(self, array: np.ndarray):
self.array = array

def pipe(self, func, *args, **kwargs):
"""Apply function and return the result wrapped in PipeableArray."""
try:
result = func(self.array, *args, **kwargs)
return PipeableArray(result)
except Exception as exc:
print('Ups, something went wrong...')

def __repr__(self):
return repr(self.array)

# 
# Original code using traditional NumPy chaining
arr = np.array([1, 2, 3, 4, 5])
arr = np.square(arr)
arr = np.log(arr)
arr = np.cumsum(arr)

# Original code using traditional NumPy nested functions
arr = np.arange(1., 5.)
result = np.cumsum(np.log(np.square(arr)))

# 
# Proposed Numpy method chaining using a new pipe method

arr = PipeableArray(np.arange(1., 5.))
result = (arr
  .pipe(np.square)
  .pipe(np.log)
  .pipe(np.cumsum)
)
# 

Benefits:
- Readability: Method chaining with pipe offers a more readable and intuitive 
way to express complex data transformations, making the intended data 
processing pipeline easier to understand.
- Customization: The pipe method allows users to chain custom functions or 
already implemented NumPy operations seamlessly.
- Modularity: Users can define reusable functions and chain them together using 
pipe, leading to cleaner and more maintainable code.
- Consistency: Introducing a pipe method in NumPy aligns with similar 
functionality available in other libraries like pandas, polars, etc.
- Optimization: While NumPy may not currently optimize chained expressions, the 
introduction of pipe lays the groundwork for potential future optimizations 
with lazy evaluation.

I believe this enhancement could benefit the NumPy community by providing a 
more flexible and expressive way to work with arrays.
I'd love to see such a feature in Numpy and like to hear your thoughts on this 
proposal.

Best regards,
Oyibo
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Robert Kern
On Thu, Feb 15, 2024 at 10:21 AM  wrote:

> Hello Numpy community,
>
> I'm proposing the introduction of a `pipe` method for NumPy arrays to
> enhance their usability and expressiveness.
>

Adding a prominent method like this to `np.ndarray` is something that we
will probably not take up ourselves unless it is adopted by the Array API
standard . It's possible that you
might get some interest there since the Array API deliberately strips out
the number of methods that we already have (e.g. `.mean()`, `.sum()`, etc.)
in favor of functions. A general way to add some kind of fluency cheaply in
an Array API-agnostic fashion might be helpful to people trying to make
their numpy-only code that uses our current set of methods in this way a
bit easier. But you'll have to make the proposal to them, I think, to get
started.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Marten van Kerkwijk
Hi Oyibo,

> I'm proposing the introduction of a `pipe` method for NumPy arrays to enhance 
> their usability and expressiveness.

I think it is an interesting idea, but agree with Robert that it is
unlikely to fly on its own.  Part of the logic of even frowning on
methods like .mean() and .sum() is that ndarray is really a data
container, and should have methods related to that, as much as possible
independent of the meaning of those data (which is given by the dtype).

A bit more generally, your example is nice, but a pipe can have just one
input, while of course many operations require two or more.

> - Optimization: While NumPy may not currently optimize chained
> expressions, the introduction of pipe lays the groundwork for
> potential future optimizations with lazy evaluation.

Optimization might indeed be made possible, though I would think that
for that one may be better off with something like dask.

That said, I've been playing with the ability to chain ufuncs to
optimize their execution, by applying the ufuncs in series on small
pieces of larger arrays, thus avoiding large temporaries (a bit like
numexpr but with the idea of defining a fast function rather than giving
an expression as a string); see https://github.com/mhvk/chain_ufunc

All the best,

Marten


___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Dom Grigonis
What were your conclusions after experimenting with chained ufuncs?

If the speed is comparable to numexpr, wouldn’t it be `nicer` to have 
non-string input format?

It would feel a bit less like a black-box.

Regards,
DG

> On 15 Feb 2024, at 22:52, Marten van Kerkwijk  wrote:
> 
> Hi Oyibo,
> 
>> I'm proposing the introduction of a `pipe` method for NumPy arrays to 
>> enhance their usability and expressiveness.
> 
> I think it is an interesting idea, but agree with Robert that it is
> unlikely to fly on its own.  Part of the logic of even frowning on
> methods like .mean() and .sum() is that ndarray is really a data
> container, and should have methods related to that, as much as possible
> independent of the meaning of those data (which is given by the dtype).
> 
> A bit more generally, your example is nice, but a pipe can have just one
> input, while of course many operations require two or more.
> 
>> - Optimization: While NumPy may not currently optimize chained
>> expressions, the introduction of pipe lays the groundwork for
>> potential future optimizations with lazy evaluation.
> 
> Optimization might indeed be made possible, though I would think that
> for that one may be better off with something like dask.
> 
> That said, I've been playing with the ability to chain ufuncs to
> optimize their execution, by applying the ufuncs in series on small
> pieces of larger arrays, thus avoiding large temporaries (a bit like
> numexpr but with the idea of defining a fast function rather than giving
> an expression as a string); see https://github.com/mhvk/chain_ufunc
> 
> All the best,
> 
> Marten
> 
> 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Marten van Kerkwijk
> What were your conclusions after experimenting with chained ufuncs?
> 
> If the speed is comparable to numexpr, wouldn’t it be `nicer` to have
> non-string input format?
> 
> It would feel a bit less like a black-box.

I haven't gotten further than it yet, it is just some toying around I've
been doing.  But I'd indeed prefer not to go via strings -- possibly
numexpr could use a similar mechanism to what I did to construct the
function that is being evaluated.

Aside: your suggestion of the pipe led to some further discussion at
https://github.com/numpy/numpy/issues/25826#issuecomment-1947342581
-- as a more general way of passing arrays to functions.

-- Marten
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Dom Grigonis
Just to clarify, I am not the one who suggested pipes. :)

Read the issue. My 2 cents:

From my experience, calling methods is generally faster than functions. I 
figure it is due to having less overhead figuring out the input. Maybe it is 
not significant for large data, but it does make a difference even when working 
for medium sized arrays - say float size 5000.

%timeit a.sum()
3.17 µs
%timeit np.sum(a)
5.18 µs

(In my experience, `sum` for medium size arrays often becomes a bottleneck in 
greedy optimisation algorithms where distances are calculated over and over for 
partial space.)

In short, all I want to say is that it would be great if such if speed 
considerations were addressed if/when developing piping or anything similar.

E.g. Pipe implementation could allow additions of optimisations.

Then numexpr could then make a plugin.

At the top user writes:
np.pipe_use_plugin(numexpr.plug_pipe)# or something similar

Then, numexpr would kick-in whenever appropriate when using pipes.

Regards,
DG

> On 16 Feb 2024, at 00:12, Marten van Kerkwijk  wrote:
> 
>> What were your conclusions after experimenting with chained ufuncs?
>> 
>> If the speed is comparable to numexpr, wouldn’t it be `nicer` to have
>> non-string input format?
>> 
>> It would feel a bit less like a black-box.
> 
> I haven't gotten further than it yet, it is just some toying around I've
> been doing.  But I'd indeed prefer not to go via strings -- possibly
> numexpr could use a similar mechanism to what I did to construct the
> function that is being evaluated.
> 
> Aside: your suggestion of the pipe led to some further discussion at
> https://github.com/numpy/numpy/issues/25826#issuecomment-1947342581
> -- as a more general way of passing arrays to functions.
> 
> -- Marten
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Michael Siebert
Hi all,

in PyTorch they (kind of) recently introduced torch.compile:

https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

In TensorFlow, eager execution needs to be activated manually, otherwise it 
creates a graph object which then acts like this kind of pipe.

Don‘t know whether that‘s useful info for an implementation in Numpy. I‘m just 
referring to what I think may be similar to pipes in other Numpy-like 
frameworks.

Best, Michael

> On 15. Feb 2024, at 22:13, Marten van Kerkwijk  wrote:
> 
> 
>> 
>> What were your conclusions after experimenting with chained ufuncs?
>> 
>> If the speed is comparable to numexpr, wouldn’t it be `nicer` to have
>> non-string input format?
>> 
>> It would feel a bit less like a black-box.
> 
> I haven't gotten further than it yet, it is just some toying around I've
> been doing.  But I'd indeed prefer not to go via strings -- possibly
> numexpr could use a similar mechanism to what I did to construct the
> function that is being evaluated.
> 
> Aside: your suggestion of the pipe led to some further discussion at
> https://github.com/numpy/numpy/issues/25826#issuecomment-1947342581
> -- as a more general way of passing arrays to functions.
> 
> -- Marten
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: michael.sieber...@gmail.com
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Marten van Kerkwijk
> From my experience, calling methods is generally faster than
> functions. I figure it is due to having less overhead figuring out the
> input. Maybe it is not significant for large data, but it does make a
> difference even when working for medium sized arrays - say float size
> 5000.
> 
> %timeit a.sum()
> 3.17 µs
> %timeit np.sum(a)
> 5.18 µs

It is more that np.sum checks if there is a .sum() method and if so
calls that.  And then `ndarray.sum()` calls `np.add.reduce(array)`.

In [2]: a = np.arange(5000.)

In [3]: %timeit np.sum(a)
3.89 µs ± 411 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit a.sum()
2.43 µs ± 42 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit np.add.reduce(a)
2.33 µs ± 31 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Though I must admit I'm a bit surprised the excess is *that* large for
using np.sum...  There may be a little micro-optimization to be found...

-- Marten
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Dom Grigonis
Thanks for this, every little helps.

One more thing to mention on this topic.

From a certain size dot product becomes faster than sum (due to parallelisation 
I guess?).

E.g.
def dotsum(arr):
a = arr.reshape(1000, 100)
return a.dot(np.ones(100)).sum()

a = np.ones(10)

In [45]: %timeit np.add.reduce(a, axis=None)
42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [43]: %timeit dotsum(a)
26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But theoretically, sum, should be faster than dot product by a fair bit.

Isn’t parallelisation implemented for it?

Regards,
DG


> On 16 Feb 2024, at 01:37, Marten van Kerkwijk  wrote:
> 
> It is more that np.sum checks if there is a .sum() method and if so
> calls that.  And then `ndarray.sum()` calls `np.add.reduce(array)`.

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Marten van Kerkwijk
> One more thing to mention on this topic.
>
> From a certain size dot product becomes faster than sum (due to 
> parallelisation I guess?).
>
> E.g.
> def dotsum(arr):
> a = arr.reshape(1000, 100)
> return a.dot(np.ones(100)).sum()
>
> a = np.ones(10)
>
> In [45]: %timeit np.add.reduce(a, axis=None)
> 42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>
> In [43]: %timeit dotsum(a)
> 26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>
> But theoretically, sum, should be faster than dot product by a fair bit.
>
> Isn’t parallelisation implemented for it?

I cannot reproduce that:

In [3]: %timeit np.add.reduce(a, axis=None)
19.7 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit dotsum(a)
47.2 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But almost certainly it is indeed due to optimizations, since .dot uses
BLAS which is highly optimized (at least on some platforms, clearly
better on yours than on mine!).

I thought .sum() was optimized too, but perhaps less so?

It may be good to raise a quick issue about this!

Thanks, Marten
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Homeier, Derek


> On 16 Feb 2024, at 2:48 am, Marten van Kerkwijk  
> wrote:
> 
>> In [45]: %timeit np.add.reduce(a, axis=None)
>> 42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>> 
>> In [43]: %timeit dotsum(a)
>> 26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>> 
>> But theoretically, sum, should be faster than dot product by a fair bit.
>> 
>> Isn’t parallelisation implemented for it?
> 
> I cannot reproduce that:
> 
> In [3]: %timeit np.add.reduce(a, axis=None)
> 19.7 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> 
> In [4]: %timeit dotsum(a)
> 47.2 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
> 
> But almost certainly it is indeed due to optimizations, since .dot uses
> BLAS which is highly optimized (at least on some platforms, clearly
> better on yours than on mine!).
> 
> I thought .sum() was optimized too, but perhaps less so?


I can confirm at least it does not seem to use multithreading – with the 
conda-installed numpy+BLAS
I almost exactly reproduce your numbers, whereas linked against my own OpenBLAS 
build

In [3]: %timeit np.add.reduce(a, axis=None)
19 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

# OMP_NUM_THREADS=1
In [4]: %timeit dots(a)
20.5 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# OMP_NUM_THREADS=8
In [4]: %timeit dots(a)
9.84 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

add.reduce shows no difference between the two and always remains at <= 100 % 
CPU usage.
dotsum is scaling still better with larger matrices, e.g. ~4 x for 1000x1000.

Cheers,
Derek
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com