Hello mailing list,
I would like to open a PR that improves the speed of intersect1d in particular
for large arrays of potentially unequal sizes. As parts of the change would
involve a new keyword I think I should write to this mailing list first
(correct me if I am wrong).
Lets start with the current implementation of intersect1d. The steps are
(roughly): I) concatenate both arrays II) sort that large array and III) keep
only duplicates. This is a simple but not necessary fast implementation. In
particular it has has complexity O( (len1 + len2) log(len1 + len2)).
I have two suggestions one of which can greatly improve the runtime (1) while
the other has some minor improvements (2). Lets start with the major one.
1) Adding assume_sorted as keyword argument.
If you can already assume that the arrays are sorted you can improve the
runtime to min(len1, len2)*log(max(len1, len2)) which is orders of magnitude
faster especially when the arrays intersected are unequal in size.
The implementation basically is I) figure out which is the smaller array
II) make that array unique III) use np.searchsorted to find duplicates of the
smaller in the larger.
I think that having assume_sorted as a keyword argument is quite useful,
especially as the result of assume_sorted is always sorted. So chaining
multiple intersections together can be made much faster.
2) Improve the performance in the case that we cannot assume sorted arrays
In the case that we cannot assume that both arrays are already sorted we
can still gain advantages. Mainly by sorting the arrays separately and then
using np.searchsorted again. This brings the complexity down to O( len1
log(len1) + len2 log(len2)) + min(len1, len2)*log(max(len1, len2)).
I already mention this in the issue 16964:
https://github.com/numpy/numpy/issues/16964
Looking forward for your feedback.
Have a great day and stay safe
Felix Stamm
_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion