[Numpy-discussion] Fastpathing lexsort for integers

2024-05-07 Thread ml
Hi all (my first message in this list),

I have written some (Python) code to make lexsort on (lists of arrays of) 
integers more efficient, by "merging" all levels in a single array. I'm asking 
for some feedback between doing a PR.

Compared to the current code, there is an improvement as long as:
- the arrays (levels) are long at least 10^3 - the improvement becomes 
significant (~3x) above 10^4
- the sum of bits required for the representation of each level is no larger 
than the platform uint's bits; otherwise, my code falls back to Python int, and 
becomes way less performant.

My rough guess it that the second condition holds in the vast majority of use 
cases (including a planned use in scipy - see 
https://github.com/scipy/scipy/issues/20151 for context).

The code is here: https://github.com/toobaz/numpy/commit/aa5c6948

I know that the current form is not well written - at least _int_lexsort seems 
out of place, and imports placed inside the function - but I didn't know in 
which file it would belong (and anyway, this version makes a quick glance 
easy). In any case, feedback is welcome as to what should go where.

For the records, the approach is similar to that employed in pandas 
MultiIndexes: https://github.com/pandas-dev/pandas/pull/19074

API-wise, there are various possibilities to do this:

- the current status is that a new parameter "int_path" is added, and this 
determines which path to take; so nothing changes unless the user wants to use 
the new path

- alternatively, I could check if all arrays are int (O(1)) and their 
maxima/minima (O(n)) and use the optimized code path only if it relies on uint, 
not Python int. The downside is that I would (slightly?) slow down those cases 
which do fallback to the old code.

- moreover, in the current version "int_path" can take as value either a bool 
or a sorting algorithm, as understood by np.argsort. But an alternative is that 
we add a new argument "kind" to lexsort (consistently with argsort), which at 
the moment would raise NotImplementedError in the "old" path. By the way, my C 
is too rusty to really understand what goes on here:
https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/item_selection.c#L1796
... but maybe it wouldn't be so difficult to support switching "kind" also in 
the "general" lexsort (old code path).

- note that lexsort is guaranteed to be stable, while the new code path is 
stable only if "kind" is set to e.g. 'stable' (the default for argsort being 
'quicksort')

The current code misses tests (I'll be happy to add some once someone confirms 
that there is interest for this contribution), but some (performance) tests I 
ran are here:
https://github.com/toobaz/int_lexsort/blob/main/Show%20benchmarks.ipynb

Thanks for your attention,

Pietro Battiston
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Fastpathing lexsort for integers

2024-05-11 Thread ml
Any feedback, even just on where to locate the Python code I wrote?

Otherwise I will try to just open a PR and see how it goes.

Thanks,

Pietro
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Fastpathing lexsort for integers

2024-05-14 Thread ml
Dear Nathan,

thanks for your feedback! I agree a PR is a better way to discuss code, I had 
written to the ML and waited for a reply just because the PR template on github 
recommends "IF IT'S A NEW FEATURE OR API CHANGE, TEST THE WATERS" with a 
pointer to the ML.

... but anyway, here is the PR: https://github.com/numpy/numpy/pull/26434 And 
indeed, this version has no API changes!

I followed some of your suggestions, in particular the heuristics to decide 
which code path to take. However I didn't port to C all my code. Let me know if 
my contribution is interesting anyway.

Cheers,

Pietro
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com