Hi all,
just a note that I merged the PR with the following semantics:
A new `kind` keyword-only argument:
* `kind=None` uses a memory bound based heuristic to decide
which method to use
* `kind="table"` uses the new approach (integer arrays only)
* `kind="sort"` forces the old behavior
The new documentation is available at:
https://numpy.org/devdocs/reference/generated/numpy.in1d.html
It seems this addition should be useful in many cases, but if you have
any concern about the choice of API please comment!
Cheers,
Sebastian
On Thu, 2022-06-16 at 06:08 -0700, Sebastian Berg wrote:
> Hi all,
>
> there is a PR to add a faster path to `np.isin`, that uses a look-up-
> table for all the elements that are included in the haystack
> (`test_elements`):
>
> https://github.com/numpy/numpy/pull/12065/files
>
> Such a table means that the memory overhead can be very significant,
> but the speedup as well, so there was the idea of adding an option to
> pick which version is used.
>
> The current documentation for this new `method` keyword argument
> would
> be. So the main questions are:
>
> * Is there any concern about adding such a new kwarg?
> * Is `method` the best name? Sorts uses `kind` which may also be
> good
>
> There is also the smaller question of what heuristic 'auto' would
> use,
> but that can be tweaked at any time.
>
> ```
> method : {'auto', 'sort', 'dictionary'}, optional
> The algorithm to use. This will not affect the final result,
> but will affect the speed. Default is 'auto'.
>
> - If 'sort', will use a mergesort-based approach. This will
> have
> a memory usage of roughly 6 times the sum of the sizes of
> `ar1` and `ar2`, not accounting for size of dtypes.
> - If 'dictionary', will use a key-dictionary approach
> similar
> to a counting sort. This is only available for boolean and
> integer arrays. This will have a memory usage of the
> size of `ar1` plus the max-min value of `ar2`. This tends
> to be the faster method if the following formula is true:
> `log10(len(ar2)) > (log10(max(ar2)-min(ar2)) - 2.27) /
> 0.927`,
> but may use greater memory.
> - If 'auto', will automatically choose the method which is
> expected to perform the fastest, using the above
> formula. For larger sizes or smaller range,
> 'dictionary' is chosen. For larger range or smaller
> sizes, 'sort' is chosen.`
> ```
>
> Cheers,
>
> Sebastian
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: sebast...@sipsolutions.net
signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com