Hi all (my first message in this list),
I have written some (Python) code to make lexsort on (lists of arrays of)
integers more efficient, by "merging" all levels in a single array. I'm asking
for some feedback between doing a PR.
Compared to the current code, there is an improvement as long as:
- the arrays (levels) are long at least 10^3 - the improvement becomes
significant (~3x) above 10^4
- the sum of bits required for the representation of each level is no larger
than the platform uint's bits; otherwise, my code falls back to Python int, and
becomes way less performant.
My rough guess it that the second condition holds in the vast majority of use
cases (including a planned use in scipy - see
https://github.com/scipy/scipy/issues/20151 for context).
The code is here: https://github.com/toobaz/numpy/commit/aa5c6948
I know that the current form is not well written - at least _int_lexsort seems
out of place, and imports placed inside the function - but I didn't know in
which file it would belong (and anyway, this version makes a quick glance
easy). In any case, feedback is welcome as to what should go where.
For the records, the approach is similar to that employed in pandas
MultiIndexes: https://github.com/pandas-dev/pandas/pull/19074
API-wise, there are various possibilities to do this:
- the current status is that a new parameter "int_path" is added, and this
determines which path to take; so nothing changes unless the user wants to use
the new path
- alternatively, I could check if all arrays are int (O(1)) and their
maxima/minima (O(n)) and use the optimized code path only if it relies on uint,
not Python int. The downside is that I would (slightly?) slow down those cases
which do fallback to the old code.
- moreover, in the current version "int_path" can take as value either a bool
or a sorting algorithm, as understood by np.argsort. But an alternative is that
we add a new argument "kind" to lexsort (consistently with argsort), which at
the moment would raise NotImplementedError in the "old" path. By the way, my C
is too rusty to really understand what goes on here:
https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/item_selection.c#L1796
... but maybe it wouldn't be so difficult to support switching "kind" also in
the "general" lexsort (old code path).
- note that lexsort is guaranteed to be stable, while the new code path is
stable only if "kind" is set to e.g. 'stable' (the default for argsort being
'quicksort')
The current code misses tests (I'll be happy to add some once someone confirms
that there is interest for this contribution), but some (performance) tests I
ran are here:
https://github.com/toobaz/int_lexsort/blob/main/Show%20benchmarks.ipynb
Thanks for your attention,
Pietro Battiston
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com