[Numpy-discussion] Re: array.T.max() vs array.max(axis=0) performance difference

2025-03-22 Thread Fang Zhang
Sebastian,

I think the core issue you pointed out is indeed correct, but your detailed
explanation is backwards, since `maximum(arr[:, 0], arr[:, 1])` implements
`arr.max(axis=1)` instead of `arr.max(axis=0)`. So OP's transpose method is
essentially approach 1, which for this array shape has less overhead than
approach 2 (and since 2 is small enough, it doesn't matter that we are
paying this overhead in Python instead of in C).

>From my tests, when n = 1_000_000, the advantage of the transpose method
persists up to m = 6. But for smaller n the normal method becomes better
again, so it's not that clear cut (although it would be nice to have a way
to use the transpose method but pay the overhead in C instead of Python!)

Test code:

import numpy as np
from IPython import get_ipython

def calculate_bbox_normal(vertices: np.ndarray):
return vertices.min(axis=0), vertices.max(axis=0)


def calculate_bbox_transpose(vertices: np.ndarray):
return np.array([col.min() for col in vertices.T]), np.array([col.max()
for col in vertices.T])

n = 1_000_000
for m in range(4, 9):
print(f"shape: {n, m}")
vertices = np.random.random((n, m))
vmin, vmax = calculate_bbox_normal(vertices)
vmin_t, vmax_t = calculate_bbox_transpose(vertices)
assert (vmin == vmin_t).all()
assert (vmax == vmax_t).all()
print("Normal:", end="")
get_ipython().run_line_magic("timeit",
"calculate_bbox_normal(vertices)")
print("Transpose: ", end="")
get_ipython().run_line_magic("timeit",
"calculate_bbox_transpose(vertices)")
print()

- Fang

On Sat, Mar 22, 2025 at 7:10 PM Sebastian Berg 
wrote:

> On Fri, 2025-03-21 at 23:22 +0100, Tiziano Zito via NumPy-Discussion
> wrote:
> > Hi George,
> >
> > what you see is due to the memory layout of numpy arrays. If you
> > switch your array to F-order you'll see that the two functions have
> > the same timings, i.e. both are fast (on my machine 25 times faster
> > for the 1_000_000 points case).
>
>
> Memory order is indeed very important, but it should be indirectly so!
>
> The thing is that NumPy tries to re-order most operations for memory
> access order (not particularly advanced).
>
> I.e. NumPy can calculate this type of operation in two ways (the inner-
> loop is always 1-D, so in C first, outer, loop here is separate):
> 1. for i in arr.shape[0]: res[i] = arr[i].max()
> 2. for i in arr.shape[1]: res[:] = maximum(res, arr[i])
>(For i == 0, you would just copy `arr[0]` or initialize `res` first)
>
> NumPy chooses the first form here, precisely because it is (normally)
> better for the memory layout, here.
> But for 1 `arr[i].max()` is actually a function call inside NumPy (deep
> in C, not on an actual NumPy array object of course).
>
> So for `arr.shape[1] == 2`, that just gets in the way because of
> overheads!
> Writing it as 2. is much better (memory order doesn't even matter),
> calculating it as `maximum(arr[0], arr[1])` would be the best.
>
> But even for other small values for `arr.shape[1]` it would probably be
> better to use the second version, memory order will start to matter
> more and more (I could imagine even for arr.shape[1] == 3 or 4 it
> dominates quickly for huge arrays, but didn't try).
>
> So what you see should be overheads of using approach 1.  Taking the
> maximum of 2 elements is simply fast compared to even a lightweight
> outer for loop and just calling the function to take the maximum.
> (And the core function does a bit of extra work to optimize for many
> elements additionally [1])
> You really don't need large inner-loops to hide most of those
> overheads, but 2 elements just isn't enough.
>
> Anyway, of course it could be cool to add a heuristic to pick approach
> 2 here when we obviously have `arr.shape[1] == 2` or maybe `arr.shape <
> small_value`.
> There are also probably some low-hanging fruits to just reduce the
> overheads here (for this specific case I know the outer iteration can
> be optimized, although I am not sure it would improve things, also,
> some inner-loop optimizations could be moved out of the inner-loop
> since a few year now -- but only casts actually do so).
>
> In the end, `maximum(arr[:, 0], arr[:, 1])` will always the fastest way
> for this specific thing, because it explicitly picks the clearly best
> approach.
> It would be good to close that gap at least a bit, though.
>
> - Sebastian
>
>
> [1] locally I can actually see a small speed-up if I pick less
> optimized loops at run-time via `NPY_DISABLE_CPU_FEATURES=AVX2`.
>
> >
> > Try:
> >
> > vertices = np.array(np.random.random((n, 2)), order='F')
> >
> > When your array doesn't fit in L1-cache anymore, either order 'C' or
> > order 'F' becomes (much) more efficient depending on which dimension
> > you are internally looping through.
> >
> > You can read more about it here:
> >
> https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays
> > and
> >
> https://numpy.org/doc/stable/reference/arrays.ndarray.htm

[Numpy-discussion] next NumPy community meeting - Wednesday, March 26th, 2025 at 6 pm UTC

2025-03-22 Thread Inessa Pawson via NumPy-Discussion
The next NumPy community meeting will be held this Wednesday, March 26th at
18:00 UTC.
Join us via Zoom:
https://numfocus-org.zoom.us/j/83278611437?pwd=ekhoLzlHRjdWc0NOY2FQM0NPemdkZz09
Everyone is welcome and encouraged to attend.
To add to the meeting agenda the topics you’d like to discuss, follow the
link: https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg?both
For the notes from the previous meetings, visit:
https://github.com/numpy/archive/tree/main/community_meetings

-- 
Cheers,
Inessa

Inessa Pawson
GitHub: inessapawson
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: array.T.max() vs array.max(axis=0) performance difference

2025-03-22 Thread Sebastian Berg
On Fri, 2025-03-21 at 23:22 +0100, Tiziano Zito via NumPy-Discussion
wrote:
> Hi George,
> 
> what you see is due to the memory layout of numpy arrays. If you
> switch your array to F-order you'll see that the two functions have
> the same timings, i.e. both are fast (on my machine 25 times faster
> for the 1_000_000 points case).


Memory order is indeed very important, but it should be indirectly so!

The thing is that NumPy tries to re-order most operations for memory
access order (not particularly advanced).

I.e. NumPy can calculate this type of operation in two ways (the inner-
loop is always 1-D, so in C first, outer, loop here is separate):
1. for i in arr.shape[0]: res[i] = arr[i].max()
2. for i in arr.shape[1]: res[:] = maximum(res, arr[i])
   (For i == 0, you would just copy `arr[0]` or initialize `res` first)

NumPy chooses the first form here, precisely because it is (normally)
better for the memory layout, here.
But for 1 `arr[i].max()` is actually a function call inside NumPy (deep
in C, not on an actual NumPy array object of course).

So for `arr.shape[1] == 2`, that just gets in the way because of
overheads!
Writing it as 2. is much better (memory order doesn't even matter),
calculating it as `maximum(arr[0], arr[1])` would be the best.

But even for other small values for `arr.shape[1]` it would probably be
better to use the second version, memory order will start to matter
more and more (I could imagine even for arr.shape[1] == 3 or 4 it
dominates quickly for huge arrays, but didn't try).

So what you see should be overheads of using approach 1.  Taking the
maximum of 2 elements is simply fast compared to even a lightweight
outer for loop and just calling the function to take the maximum.
(And the core function does a bit of extra work to optimize for many
elements additionally [1])
You really don't need large inner-loops to hide most of those
overheads, but 2 elements just isn't enough.

Anyway, of course it could be cool to add a heuristic to pick approach
2 here when we obviously have `arr.shape[1] == 2` or maybe `arr.shape <
small_value`.
There are also probably some low-hanging fruits to just reduce the
overheads here (for this specific case I know the outer iteration can
be optimized, although I am not sure it would improve things, also,
some inner-loop optimizations could be moved out of the inner-loop
since a few year now -- but only casts actually do so).

In the end, `maximum(arr[:, 0], arr[:, 1])` will always the fastest way
for this specific thing, because it explicitly picks the clearly best
approach.
It would be good to close that gap at least a bit, though.

- Sebastian


[1] locally I can actually see a small speed-up if I pick less
optimized loops at run-time via `NPY_DISABLE_CPU_FEATURES=AVX2`.

> 
> Try:
> 
> vertices = np.array(np.random.random((n, 2)), order='F')
> 
> When your array doesn't fit in L1-cache anymore, either order 'C' or
> order 'F' becomes (much) more efficient depending on which dimension
> you are internally looping through.
> 
> You can read more about it here:
> https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays
> and
> https://numpy.org/doc/stable/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray
> 
> Hope that helps,
> Tiziano
> 
> 
> On Fri 21 Mar, 12:10 +0200, George Tsiamasiotis via NumPy-Discussion
>  wrote:
> > Hello NumPy community!
> > 
> > I was writing a function that calculates the bounding box of a
> > polygon (the smallest rectangle that fully contains the polygon,
> > and who's sides are parallel to the x and y axes). The input is a
> > (N,2) array containing the vertices of the polygon, and the output
> > is a 4-tuple containing the vertices of the 2 corners of the
> > bounding box.
> > 
> > I found two ways to do that using the np.min() and np.max()
> > methods, and I was surprised to see a significant speed difference,
> > even though they seemingly do the same thing.
> > 
> > While for small N the speed is essentially the same, the difference
> > becomes noticeable for larger N. >From my testing, the difference
> > seems to plateau, with the one way being around 4-5 times faster
> > than the other.
> > 
> > Is there an explanation for this?
> > 
> > Here is a small benchmark I wrote (must be executed with IPython):
> > 
> > import numpy as np
> > from IPython import get_ipython
> > 
> > vertices = np.random.random((1000, 2))
> > 
> > def calculate_bbox_normal(vertices: np.ndarray) ->
> > tuple[np.float64]:
> >     xmin, ymin = vertices.min(axis=0)
> >     xmax, ymax = vertices.max(axis=0)
> >     return xmin, ymin, xmax, ymax
> > 
> > def calculate_bbox_transpose(vertices: np.ndarray) ->
> > tuple[np.float64]:
> >     xmin = vertices.T[0].min()
> >     xmax = vertices.T[0].max()
> >     ymin = vertices.T[1].min()
> >     ymax = vertices.T[1].max()
> >     return xmin, ymin, xmax, ymax
> > 
> > bbox_normal = calculate_bbox_normal(vertices)
> > bbox_transpose = calculate_bbox_tra

[Numpy-discussion] Re: array.T.max() vs array.max(axis=0) performance difference

2025-03-22 Thread George Tsiamasiotis via NumPy-Discussion
Very interesting! Thanks for the quick response!
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com