[Numpy-discussion] Re: array.T.max() vs array.max(axis=0) performance difference
Sebastian, I think the core issue you pointed out is indeed correct, but your detailed explanation is backwards, since `maximum(arr[:, 0], arr[:, 1])` implements `arr.max(axis=1)` instead of `arr.max(axis=0)`. So OP's transpose method is essentially approach 1, which for this array shape has less overhead than approach 2 (and since 2 is small enough, it doesn't matter that we are paying this overhead in Python instead of in C). >From my tests, when n = 1_000_000, the advantage of the transpose method persists up to m = 6. But for smaller n the normal method becomes better again, so it's not that clear cut (although it would be nice to have a way to use the transpose method but pay the overhead in C instead of Python!) Test code: import numpy as np from IPython import get_ipython def calculate_bbox_normal(vertices: np.ndarray): return vertices.min(axis=0), vertices.max(axis=0) def calculate_bbox_transpose(vertices: np.ndarray): return np.array([col.min() for col in vertices.T]), np.array([col.max() for col in vertices.T]) n = 1_000_000 for m in range(4, 9): print(f"shape: {n, m}") vertices = np.random.random((n, m)) vmin, vmax = calculate_bbox_normal(vertices) vmin_t, vmax_t = calculate_bbox_transpose(vertices) assert (vmin == vmin_t).all() assert (vmax == vmax_t).all() print("Normal:", end="") get_ipython().run_line_magic("timeit", "calculate_bbox_normal(vertices)") print("Transpose: ", end="") get_ipython().run_line_magic("timeit", "calculate_bbox_transpose(vertices)") print() - Fang On Sat, Mar 22, 2025 at 7:10 PM Sebastian Berg wrote: > On Fri, 2025-03-21 at 23:22 +0100, Tiziano Zito via NumPy-Discussion > wrote: > > Hi George, > > > > what you see is due to the memory layout of numpy arrays. If you > > switch your array to F-order you'll see that the two functions have > > the same timings, i.e. both are fast (on my machine 25 times faster > > for the 1_000_000 points case). > > > Memory order is indeed very important, but it should be indirectly so! > > The thing is that NumPy tries to re-order most operations for memory > access order (not particularly advanced). > > I.e. NumPy can calculate this type of operation in two ways (the inner- > loop is always 1-D, so in C first, outer, loop here is separate): > 1. for i in arr.shape[0]: res[i] = arr[i].max() > 2. for i in arr.shape[1]: res[:] = maximum(res, arr[i]) >(For i == 0, you would just copy `arr[0]` or initialize `res` first) > > NumPy chooses the first form here, precisely because it is (normally) > better for the memory layout, here. > But for 1 `arr[i].max()` is actually a function call inside NumPy (deep > in C, not on an actual NumPy array object of course). > > So for `arr.shape[1] == 2`, that just gets in the way because of > overheads! > Writing it as 2. is much better (memory order doesn't even matter), > calculating it as `maximum(arr[0], arr[1])` would be the best. > > But even for other small values for `arr.shape[1]` it would probably be > better to use the second version, memory order will start to matter > more and more (I could imagine even for arr.shape[1] == 3 or 4 it > dominates quickly for huge arrays, but didn't try). > > So what you see should be overheads of using approach 1. Taking the > maximum of 2 elements is simply fast compared to even a lightweight > outer for loop and just calling the function to take the maximum. > (And the core function does a bit of extra work to optimize for many > elements additionally [1]) > You really don't need large inner-loops to hide most of those > overheads, but 2 elements just isn't enough. > > Anyway, of course it could be cool to add a heuristic to pick approach > 2 here when we obviously have `arr.shape[1] == 2` or maybe `arr.shape < > small_value`. > There are also probably some low-hanging fruits to just reduce the > overheads here (for this specific case I know the outer iteration can > be optimized, although I am not sure it would improve things, also, > some inner-loop optimizations could be moved out of the inner-loop > since a few year now -- but only casts actually do so). > > In the end, `maximum(arr[:, 0], arr[:, 1])` will always the fastest way > for this specific thing, because it explicitly picks the clearly best > approach. > It would be good to close that gap at least a bit, though. > > - Sebastian > > > [1] locally I can actually see a small speed-up if I pick less > optimized loops at run-time via `NPY_DISABLE_CPU_FEATURES=AVX2`. > > > > > Try: > > > > vertices = np.array(np.random.random((n, 2)), order='F') > > > > When your array doesn't fit in L1-cache anymore, either order 'C' or > > order 'F' becomes (much) more efficient depending on which dimension > > you are internally looping through. > > > > You can read more about it here: > > > https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays > > and > > > https://numpy.org/doc/stable/reference/arrays.ndarray.htm
[Numpy-discussion] next NumPy community meeting - Wednesday, March 26th, 2025 at 6 pm UTC
The next NumPy community meeting will be held this Wednesday, March 26th at 18:00 UTC. Join us via Zoom: https://numfocus-org.zoom.us/j/83278611437?pwd=ekhoLzlHRjdWc0NOY2FQM0NPemdkZz09 Everyone is welcome and encouraged to attend. To add to the meeting agenda the topics you’d like to discuss, follow the link: https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg?both For the notes from the previous meetings, visit: https://github.com/numpy/archive/tree/main/community_meetings -- Cheers, Inessa Inessa Pawson GitHub: inessapawson ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: array.T.max() vs array.max(axis=0) performance difference
On Fri, 2025-03-21 at 23:22 +0100, Tiziano Zito via NumPy-Discussion wrote: > Hi George, > > what you see is due to the memory layout of numpy arrays. If you > switch your array to F-order you'll see that the two functions have > the same timings, i.e. both are fast (on my machine 25 times faster > for the 1_000_000 points case). Memory order is indeed very important, but it should be indirectly so! The thing is that NumPy tries to re-order most operations for memory access order (not particularly advanced). I.e. NumPy can calculate this type of operation in two ways (the inner- loop is always 1-D, so in C first, outer, loop here is separate): 1. for i in arr.shape[0]: res[i] = arr[i].max() 2. for i in arr.shape[1]: res[:] = maximum(res, arr[i]) (For i == 0, you would just copy `arr[0]` or initialize `res` first) NumPy chooses the first form here, precisely because it is (normally) better for the memory layout, here. But for 1 `arr[i].max()` is actually a function call inside NumPy (deep in C, not on an actual NumPy array object of course). So for `arr.shape[1] == 2`, that just gets in the way because of overheads! Writing it as 2. is much better (memory order doesn't even matter), calculating it as `maximum(arr[0], arr[1])` would be the best. But even for other small values for `arr.shape[1]` it would probably be better to use the second version, memory order will start to matter more and more (I could imagine even for arr.shape[1] == 3 or 4 it dominates quickly for huge arrays, but didn't try). So what you see should be overheads of using approach 1. Taking the maximum of 2 elements is simply fast compared to even a lightweight outer for loop and just calling the function to take the maximum. (And the core function does a bit of extra work to optimize for many elements additionally [1]) You really don't need large inner-loops to hide most of those overheads, but 2 elements just isn't enough. Anyway, of course it could be cool to add a heuristic to pick approach 2 here when we obviously have `arr.shape[1] == 2` or maybe `arr.shape < small_value`. There are also probably some low-hanging fruits to just reduce the overheads here (for this specific case I know the outer iteration can be optimized, although I am not sure it would improve things, also, some inner-loop optimizations could be moved out of the inner-loop since a few year now -- but only casts actually do so). In the end, `maximum(arr[:, 0], arr[:, 1])` will always the fastest way for this specific thing, because it explicitly picks the clearly best approach. It would be good to close that gap at least a bit, though. - Sebastian [1] locally I can actually see a small speed-up if I pick less optimized loops at run-time via `NPY_DISABLE_CPU_FEATURES=AVX2`. > > Try: > > vertices = np.array(np.random.random((n, 2)), order='F') > > When your array doesn't fit in L1-cache anymore, either order 'C' or > order 'F' becomes (much) more efficient depending on which dimension > you are internally looping through. > > You can read more about it here: > https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays > and > https://numpy.org/doc/stable/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray > > Hope that helps, > Tiziano > > > On Fri 21 Mar, 12:10 +0200, George Tsiamasiotis via NumPy-Discussion > wrote: > > Hello NumPy community! > > > > I was writing a function that calculates the bounding box of a > > polygon (the smallest rectangle that fully contains the polygon, > > and who's sides are parallel to the x and y axes). The input is a > > (N,2) array containing the vertices of the polygon, and the output > > is a 4-tuple containing the vertices of the 2 corners of the > > bounding box. > > > > I found two ways to do that using the np.min() and np.max() > > methods, and I was surprised to see a significant speed difference, > > even though they seemingly do the same thing. > > > > While for small N the speed is essentially the same, the difference > > becomes noticeable for larger N. >From my testing, the difference > > seems to plateau, with the one way being around 4-5 times faster > > than the other. > > > > Is there an explanation for this? > > > > Here is a small benchmark I wrote (must be executed with IPython): > > > > import numpy as np > > from IPython import get_ipython > > > > vertices = np.random.random((1000, 2)) > > > > def calculate_bbox_normal(vertices: np.ndarray) -> > > tuple[np.float64]: > > xmin, ymin = vertices.min(axis=0) > > xmax, ymax = vertices.max(axis=0) > > return xmin, ymin, xmax, ymax > > > > def calculate_bbox_transpose(vertices: np.ndarray) -> > > tuple[np.float64]: > > xmin = vertices.T[0].min() > > xmax = vertices.T[0].max() > > ymin = vertices.T[1].min() > > ymax = vertices.T[1].max() > > return xmin, ymin, xmax, ymax > > > > bbox_normal = calculate_bbox_normal(vertices) > > bbox_transpose = calculate_bbox_tra
[Numpy-discussion] Re: array.T.max() vs array.max(axis=0) performance difference
Very interesting! Thanks for the quick response! ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com