[Numpy-discussion] Speeding up `unique` and adding "kind" parameter

2022-06-28 Thread Miles Cranmer
Dear all,

There is a PR that adds a lookup table approach to `unique`, shown below. You 
can get up to ~16x speedup for large integer arrays, at the cost of potentially 
greater memory usage.

https://github.com/numpy/numpy/pull/21843

This is controlled by a new `kind` parameter, which is described below. The 
current approach uses "sort" while the proposed change implements the "table" 
method. The default option None will automatically select "table" when possible 
and memory usage is not too large.
```
kind : {None, 'sort', 'table'}, optional
The algorithm to use. This will not affect the final result,
but will affect the speed and memory use. The default, None,
will select automatically based on memory considerations.

* If 'sort', will use a mergesort-based approach.
* If 'table', will use a lookup table approach similar
  to a counting sort. This is only available for boolean and
  integer arrays. This will have a memory usage of the
  size of `ar` plus the max-min value of `ar`. The options
  `return_index`, `return_inverse`, `axis`, and `equal_nan`
  are unavailable with this option.
* If None, will automatically choose 'table' if possible,
  and the required memory allocation is less than or equal to
  6 times the size of `ar`. Will otherwise will use 'sort'.
  This is done to not use a large amount of memory by default,
  even though 'table' may be faster in most cases.
```
The method and API are very similar to that merged last week for `isin`: 
https://github.com/numpy/numpy/pull/12065/. One difference is that 
`return_counts` required a slightly modified approach–using `bincount` seems to 
work well for this.

I am eager to hear your comments on this new PR.

Thanks!
Miles
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-28 Thread Ralf Gommers
On Tue, Jun 28, 2022 at 6:33 PM Miles Cranmer 
wrote:

> Dear all,
>
> There is a PR that adds a lookup table approach to `unique`, shown below.
> You can get up to ~16x speedup for large integer arrays, at the cost of
> potentially greater memory usage.
>

I've seen multiple requests for not sorting in `unique`, and the associated
performance benefits can indeed be very large. So +1 in principle to add
this option.


> https://github.com/numpy/numpy/pull/21843
>
> This is controlled by a new `kind` parameter, which is described below.
> The current approach uses "sort" while the proposed change implements the
> "table" method. The default option None will automatically select "table"
> when possible and memory usage is not too large.
>

You cannot switch the default behavior, that will break backwards
compatibility.


> ```
> kind : {None, 'sort', 'table'}, optional
>

Regarding the name, `'table'` is an implementation detail. The end user
should not have to care what the data structure is that is used. I suggest
to use something like "unsorted" and just explain it as the ordering of
results being undefined, which can give significant performance benefits.

Cheers,
Ralf

The algorithm to use. This will not affect the final result,
> but will affect the speed and memory use. The default, None,
> will select automatically based on memory considerations.
>
> * If 'sort', will use a mergesort-based approach.
> * If 'table', will use a lookup table approach similar
>   to a counting sort. This is only available for boolean and
>   integer arrays. This will have a memory usage of the
>   size of `ar` plus the max-min value of `ar`. The options
>   `return_index`, `return_inverse`, `axis`, and `equal_nan`
>   are unavailable with this option.
> * If None, will automatically choose 'table' if possible,
>   and the required memory allocation is less than or equal to
>   6 times the size of `ar`. Will otherwise will use 'sort'.
>   This is done to not use a large amount of memory by default,
>   even though 'table' may be faster in most cases.
> ```
> The method and API are very similar to that merged last week for `isin`:
> https://github.com/numpy/numpy/pull/12065/. One difference is that
> `return_counts` required a slightly modified approach–using `bincount`
> seems to work well for this.
>
> I am eager to hear your comments on this new PR.
>
> Thanks!
> Miles
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: ralf.gomm...@googlemail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-28 Thread David Menéndez Hurtado
On Tue, 28 Jun 2022, 6:50 pm Ralf Gommers,  wrote:

>
>
>> ```
>> kind : {None, 'sort', 'table'}, optional
>>
>
> Regarding the name, `'table'` is an implementation detail. The end user
> should not have to care what the data structure is that is used. I suggest
> to use something like "unsorted" and just explain it as the ordering of
> results being undefined, which can give significant performance benefits.
>

But that suggests that, if I wanted them  sorted, I should use the "sorted"
kind, but it is probably faster to do a table unique and sort the results.

There are two concerns from the point of view of the user. One is the
sorting of the results, and the other is memory usage. I suggest adding two
boolean flags, "low_memory" (following sklearn, and I think, scipy too),
and "sorted". Depending on the algorithm, "sorted=True" will perform a
sort, or do nothing.

/David


> Cheers,
> Ralf
>
> The algorithm to use. This will not affect the final result,
>> but will affect the speed and memory use. The default, None,
>> will select automatically based on memory considerations.
>>
>> * If 'sort', will use a mergesort-based approach.
>> * If 'table', will use a lookup table approach similar
>>   to a counting sort. This is only available for boolean and
>>   integer arrays. This will have a memory usage of the
>>   size of `ar` plus the max-min value of `ar`. The options
>>   `return_index`, `return_inverse`, `axis`, and `equal_nan`
>>   are unavailable with this option.
>> * If None, will automatically choose 'table' if possible,
>>   and the required memory allocation is less than or equal to
>>   6 times the size of `ar`. Will otherwise will use 'sort'.
>>   This is done to not use a large amount of memory by default,
>>   even though 'table' may be faster in most cases.
>> ```
>> The method and API are very similar to that merged last week for `isin`:
>> https://github.com/numpy/numpy/pull/12065/. One difference is that
>> `return_counts` required a slightly modified approach–using `bincount`
>> seems to work well for this.
>>
>> I am eager to hear your comments on this new PR.
>>
>> Thanks!
>> Miles
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: ralf.gomm...@googlemail.com
>>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: davidmen...@gmail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-28 Thread Miles Cranmer
Thanks for the comments Ralf!

> You cannot switch the default behavior, that will break backwards 
> compatibility.

The default `kind=None` have no effect on input/output behavior of the 
function. The only changes a user will see are in terms of speed and memory 
usage. `unique` will select this new algorithm `"table"` only if it is 
available (integral array, no axis specified, return_index and return_inverse 
set to False) and the required memory allocation is not too big (which is 
arbitrarily defined as six times the allocation of the input array - similar to 
what the sorting method use). Using `kind="table"` won't affect the 
input/output either, but it is only available for certain arrays (somewhat 
similar to the usage of `assume_unique` for `np.isin`). Does this sound fine to 
you?

> Regarding the name, `'table'` is an implementation detail. The end user 
> should not have to care what the data structure is that is used. I suggest to 
> use something like "unsorted" and just explain it as the ordering of results 
> being undefined, which can give significant performance benefits.

The names `'table'` and `'sort'` were selected for consistency as these names 
were recently put in place for `np.isin` and `np.in1d`, and the analogous 
methods used for `unique` are conceptually similar. I have no particular 
attachment to either name though.

Thanks again!
Miles
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-28 Thread Miles Cranmer
Ah, I did not clarify this: `kind="table"` will *also* return a sorted array. 
It simply does not use a sorting algorithm to get to it. This is because the 
table is generated using `np.arange` (i.e., already sorted) which is then 
masked.
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-28 Thread Ralf Gommers
On Tue, Jun 28, 2022 at 7:21 PM Miles Cranmer 
wrote:

> Thanks for the comments Ralf!
>
> > You cannot switch the default behavior, that will break backwards
> compatibility.
>
> The default `kind=None` have no effect on input/output behavior of the
> function. The only changes a user will see are in terms of speed and memory
> usage. `unique` will select this new algorithm `"table"` only if it is
> available (integral array, no axis specified, return_index and
> return_inverse set to False) and the required memory allocation is not too
> big (which is arbitrarily defined as six times the allocation of the input
> array - similar to what the sorting method use). Using `kind="table"` won't
> affect the input/output either, but it is only available for certain arrays
> (somewhat similar to the usage of `assume_unique` for `np.isin`). Does this
> sound fine to you?
>

Ah, I didn't get from the initial description that the results would still
be sorted. Then I'd say that I'm not sure that:
1. I'm not sure that the performance vs. complexity trade-off is worth it.
But I'll leave that to others; Sebastian seemed happy to merge the similar
change in `in1d`
2. If you're interested in working more on this, implementing an unsorted
option would be more interesting imho - significantly more performance
benefits, and not just for integers or at the cost of more memory use.

Cheers,
Ralf



> > Regarding the name, `'table'` is an implementation detail. The end user
> should not have to care what the data structure is that is used. I suggest
> to use something like "unsorted" and just explain it as the ordering of
> results being undefined, which can give significant performance benefits.
>
> The names `'table'` and `'sort'` were selected for consistency as these
> names were recently put in place for `np.isin` and `np.in1d`, and the
> analogous methods used for `unique` are conceptually similar. I have no
> particular attachment to either name though.
>
> Thanks again!
> Miles
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: ralf.gomm...@googlemail.com
>
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

2022-06-28 Thread Miles Cranmer
Regarding 2., did you have a particular approach in mind? This new lookup table 
method is already O(n) scaling (similar to a counting sort), so I cannot fathom 
a method that, as you suggest, would get significantly better performance for 
integer arrays. The sorting here is "free" in some sense since you aren't 
spending additional cycles, the table is initialized pre-sorted.

I could see in the future that one could create a similar approach for 
arbitrary datatypes, and you would use a hash table rather than a fixed-size 
array. In this case you would similarly get O(n) performance, and would produce 
an **unsorted** output. But in any case a hash table would probably be much 
less efficient for integers than an array as implemented here - the latter 
which also gives you a sorted output as a side effect. Personally I have used 
np.unique for integer data far more than any other type, and I would _guess_ it 
is similar in the broader community (though I have no data or insight to 
support that)–so I think having a big speedup like this could be quite nice for 
users.

Thanks,
Miles
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com