[Numpy-discussion] Speeding up `unique` and adding "kind" parameter
Dear all, There is a PR that adds a lookup table approach to `unique`, shown below. You can get up to ~16x speedup for large integer arrays, at the cost of potentially greater memory usage. https://github.com/numpy/numpy/pull/21843 This is controlled by a new `kind` parameter, which is described below. The current approach uses "sort" while the proposed change implements the "table" method. The default option None will automatically select "table" when possible and memory usage is not too large. ``` kind : {None, 'sort', 'table'}, optional The algorithm to use. This will not affect the final result, but will affect the speed and memory use. The default, None, will select automatically based on memory considerations. * If 'sort', will use a mergesort-based approach. * If 'table', will use a lookup table approach similar to a counting sort. This is only available for boolean and integer arrays. This will have a memory usage of the size of `ar` plus the max-min value of `ar`. The options `return_index`, `return_inverse`, `axis`, and `equal_nan` are unavailable with this option. * If None, will automatically choose 'table' if possible, and the required memory allocation is less than or equal to 6 times the size of `ar`. Will otherwise will use 'sort'. This is done to not use a large amount of memory by default, even though 'table' may be faster in most cases. ``` The method and API are very similar to that merged last week for `isin`: https://github.com/numpy/numpy/pull/12065/. One difference is that `return_counts` required a slightly modified approach–using `bincount` seems to work well for this. I am eager to hear your comments on this new PR. Thanks! Miles ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
On Tue, Jun 28, 2022 at 6:33 PM Miles Cranmer wrote: > Dear all, > > There is a PR that adds a lookup table approach to `unique`, shown below. > You can get up to ~16x speedup for large integer arrays, at the cost of > potentially greater memory usage. > I've seen multiple requests for not sorting in `unique`, and the associated performance benefits can indeed be very large. So +1 in principle to add this option. > https://github.com/numpy/numpy/pull/21843 > > This is controlled by a new `kind` parameter, which is described below. > The current approach uses "sort" while the proposed change implements the > "table" method. The default option None will automatically select "table" > when possible and memory usage is not too large. > You cannot switch the default behavior, that will break backwards compatibility. > ``` > kind : {None, 'sort', 'table'}, optional > Regarding the name, `'table'` is an implementation detail. The end user should not have to care what the data structure is that is used. I suggest to use something like "unsorted" and just explain it as the ordering of results being undefined, which can give significant performance benefits. Cheers, Ralf The algorithm to use. This will not affect the final result, > but will affect the speed and memory use. The default, None, > will select automatically based on memory considerations. > > * If 'sort', will use a mergesort-based approach. > * If 'table', will use a lookup table approach similar > to a counting sort. This is only available for boolean and > integer arrays. This will have a memory usage of the > size of `ar` plus the max-min value of `ar`. The options > `return_index`, `return_inverse`, `axis`, and `equal_nan` > are unavailable with this option. > * If None, will automatically choose 'table' if possible, > and the required memory allocation is less than or equal to > 6 times the size of `ar`. Will otherwise will use 'sort'. > This is done to not use a large amount of memory by default, > even though 'table' may be faster in most cases. > ``` > The method and API are very similar to that merged last week for `isin`: > https://github.com/numpy/numpy/pull/12065/. One difference is that > `return_counts` required a slightly modified approach–using `bincount` > seems to work well for this. > > I am eager to hear your comments on this new PR. > > Thanks! > Miles > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: ralf.gomm...@googlemail.com > ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
On Tue, 28 Jun 2022, 6:50 pm Ralf Gommers, wrote: > > >> ``` >> kind : {None, 'sort', 'table'}, optional >> > > Regarding the name, `'table'` is an implementation detail. The end user > should not have to care what the data structure is that is used. I suggest > to use something like "unsorted" and just explain it as the ordering of > results being undefined, which can give significant performance benefits. > But that suggests that, if I wanted them sorted, I should use the "sorted" kind, but it is probably faster to do a table unique and sort the results. There are two concerns from the point of view of the user. One is the sorting of the results, and the other is memory usage. I suggest adding two boolean flags, "low_memory" (following sklearn, and I think, scipy too), and "sorted". Depending on the algorithm, "sorted=True" will perform a sort, or do nothing. /David > Cheers, > Ralf > > The algorithm to use. This will not affect the final result, >> but will affect the speed and memory use. The default, None, >> will select automatically based on memory considerations. >> >> * If 'sort', will use a mergesort-based approach. >> * If 'table', will use a lookup table approach similar >> to a counting sort. This is only available for boolean and >> integer arrays. This will have a memory usage of the >> size of `ar` plus the max-min value of `ar`. The options >> `return_index`, `return_inverse`, `axis`, and `equal_nan` >> are unavailable with this option. >> * If None, will automatically choose 'table' if possible, >> and the required memory allocation is less than or equal to >> 6 times the size of `ar`. Will otherwise will use 'sort'. >> This is done to not use a large amount of memory by default, >> even though 'table' may be faster in most cases. >> ``` >> The method and API are very similar to that merged last week for `isin`: >> https://github.com/numpy/numpy/pull/12065/. One difference is that >> `return_counts` required a slightly modified approach–using `bincount` >> seems to work well for this. >> >> I am eager to hear your comments on this new PR. >> >> Thanks! >> Miles >> ___ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: ralf.gomm...@googlemail.com >> > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: davidmen...@gmail.com > ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
Thanks for the comments Ralf! > You cannot switch the default behavior, that will break backwards > compatibility. The default `kind=None` have no effect on input/output behavior of the function. The only changes a user will see are in terms of speed and memory usage. `unique` will select this new algorithm `"table"` only if it is available (integral array, no axis specified, return_index and return_inverse set to False) and the required memory allocation is not too big (which is arbitrarily defined as six times the allocation of the input array - similar to what the sorting method use). Using `kind="table"` won't affect the input/output either, but it is only available for certain arrays (somewhat similar to the usage of `assume_unique` for `np.isin`). Does this sound fine to you? > Regarding the name, `'table'` is an implementation detail. The end user > should not have to care what the data structure is that is used. I suggest to > use something like "unsorted" and just explain it as the ordering of results > being undefined, which can give significant performance benefits. The names `'table'` and `'sort'` were selected for consistency as these names were recently put in place for `np.isin` and `np.in1d`, and the analogous methods used for `unique` are conceptually similar. I have no particular attachment to either name though. Thanks again! Miles ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
Ah, I did not clarify this: `kind="table"` will *also* return a sorted array. It simply does not use a sorting algorithm to get to it. This is because the table is generated using `np.arange` (i.e., already sorted) which is then masked. ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
On Tue, Jun 28, 2022 at 7:21 PM Miles Cranmer wrote: > Thanks for the comments Ralf! > > > You cannot switch the default behavior, that will break backwards > compatibility. > > The default `kind=None` have no effect on input/output behavior of the > function. The only changes a user will see are in terms of speed and memory > usage. `unique` will select this new algorithm `"table"` only if it is > available (integral array, no axis specified, return_index and > return_inverse set to False) and the required memory allocation is not too > big (which is arbitrarily defined as six times the allocation of the input > array - similar to what the sorting method use). Using `kind="table"` won't > affect the input/output either, but it is only available for certain arrays > (somewhat similar to the usage of `assume_unique` for `np.isin`). Does this > sound fine to you? > Ah, I didn't get from the initial description that the results would still be sorted. Then I'd say that I'm not sure that: 1. I'm not sure that the performance vs. complexity trade-off is worth it. But I'll leave that to others; Sebastian seemed happy to merge the similar change in `in1d` 2. If you're interested in working more on this, implementing an unsorted option would be more interesting imho - significantly more performance benefits, and not just for integers or at the cost of more memory use. Cheers, Ralf > > Regarding the name, `'table'` is an implementation detail. The end user > should not have to care what the data structure is that is used. I suggest > to use something like "unsorted" and just explain it as the ordering of > results being undefined, which can give significant performance benefits. > > The names `'table'` and `'sort'` were selected for consistency as these > names were recently put in place for `np.isin` and `np.in1d`, and the > analogous methods used for `unique` are conceptually similar. I have no > particular attachment to either name though. > > Thanks again! > Miles > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: ralf.gomm...@googlemail.com > ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter
Regarding 2., did you have a particular approach in mind? This new lookup table method is already O(n) scaling (similar to a counting sort), so I cannot fathom a method that, as you suggest, would get significantly better performance for integer arrays. The sorting here is "free" in some sense since you aren't spending additional cycles, the table is initialized pre-sorted. I could see in the future that one could create a similar approach for arbitrary datatypes, and you would use a hash table rather than a fixed-size array. In this case you would similarly get O(n) performance, and would produce an **unsorted** output. But in any case a hash table would probably be much less efficient for integers than an array as implemented here - the latter which also gives you a sorted output as a side effect. Personally I have used np.unique for integer data far more than any other type, and I would _guess_ it is similar in the broader community (though I have no data or insight to support that)–so I think having a big speedup like this could be quite nice for users. Thanks, Miles ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com