[Numpy-discussion] Re: API: make numpy.lib._arraysetops.intersect1d work on multiple arrays #25688

2024-02-06 Thread Stephan Kuschel via NumPy-Discussion

Dear Dom,

thanks for bringing up the possible constriction. I agree that this 
would be serious argument against the change.


However, as you said the overlapping/non-overlapping indices would 
become ambiguous with more than two arrays. And calling the fucntion 
with only two arrays at a time would still be possible. So we will be 
unable to generalize in the future towards a problem, that only has 
ambinuous solutions. So I fail to see what exactly we the other use case 
would be.


The point of this change is not the luxory of allowing multiple arrays 
to calculate the intersection. Its all about getting the indices in the 
original arrays, using `return_indices=True`.


All the Best
Stephan

Am 02.02.24 um 17:36 schrieb Dom Grigonis:
Also, I don’t know if this could be of value, but my use case for this 
is to find overlaps, then split arrays into overlapping and 
non-overlapping segments.


Thus, it might be useful for `return_indices=True` to return indices of 
all instances, not only the first.


Also, in my case I need both overlapping and non-overlapping indices, 
but this would become ambiguous with more than 2 arrays.


If it was left with 2 array input, then it can be extended to return 
both overlapping and non-overlapping parts. I think it could be another 
potential path to consider.


E.g. what would be the speed comparison:


intr=  intersect1d(arr1, arr2, assume_unique=False)
intr=  intersect1d(intr, np.unique(arr3), assume_unique=True)

# VS new

intr=  intersect1d(arr1, arr2, arr3, assume_unique=False)

Then, does the gain from such generalisation justify constriction it 
introduces?


Regards,
DG

On 2 Feb 2024, at 17:31, Marten van Kerkwijk > wrote:



For my own work, I required the intersect1d function to work on multiple
arrays while returning the indices (using `return_indizes=True`).
Consequently I changed the function in numpy and now I am seeking
feedback from the community.

This is the corresponding PR: 
https://github.com/numpy/numpy/pull/25688 





To me this looks like a very sensible generalization.  In terms of numpy
API, the only real change is that, effectively, the assume_unique and
return_indices arguments become keyword-only, i.e., in the unlikely case
that someone passed those as positional, a trivial backward-compatible
change will fix it.

-- Marten
___
NumPy-Discussion mailing list -- numpy-discussion@python.org 

To unsubscribe send an email to numpy-discussion-le...@python.org 

https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 


Member address: dom.grigo...@gmail.com



___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: stephan.kusc...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: API: make numpy.lib._arraysetops.intersect1d work on multiple arrays #25688

2024-02-06 Thread Stephan Kuschel via NumPy-Discussion

Dear Dom,

just check, and on my computer the new version is ~factor 2 faster 
compared to the reduce approach if arrays are shuffled. For sorted 
arrays, the the new version is factor 3.4. faster:



from functools import reduce
idss = [np.random.permutation(np.arange(a*100, int(1e5)+a*100, 1)) for a 
in range(20)]


%timeit intersect1d(*idss)   # 166 +- 47ms
%timeit reduce(np.intersect1d, idss)  # 301 +- 3.7ms

and

from functools import reduce
idss = [np.arange(a*100, int(1e5)+a*100, 1) for a in range(20)]

%timeit intersect1d(*idss)# 77 +- 6ms
%timeit reduce(np.intersect1d, idss) 212 +- 3.8ms

Stephan


Am 02.02.24 um 17:10 schrieb Dom Grigonis:

Just curious, how much faster is it compared to currently recommended `reduce` 
approach?

DG


On 2 Feb 2024, at 17:31, Marten van Kerkwijk  wrote:


For my own work, I required the intersect1d function to work on multiple
arrays while returning the indices (using `return_indizes=True`).
Consequently I changed the function in numpy and now I am seeking
feedback from the community.

This is the corresponding PR: https://github.com/numpy/numpy/pull/25688




To me this looks like a very sensible generalization.  In terms of numpy
API, the only real change is that, effectively, the assume_unique and
return_indices arguments become keyword-only, i.e., in the unlikely case
that someone passed those as positional, a trivial backward-compatible
change will fix it.

-- Marten
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: dom.grigo...@gmail.com


___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: stephan.kusc...@gmail.com

___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: API: make numpy.lib._arraysetops.intersect1d work on multiple arrays #25688

2024-02-06 Thread Dom Grigonis
Hi Stephan,

Thanks for replying to my concerns.

I have looked a bit more into it and `intersect1d` is not the right concept for 
my problem anyway. And it is great that it works faster than reduce approach.

Also, I did more speed tests and, after all, `in1d` approach is not generally 
faster. It is faster in some cases and slower in others. Also, it uses more 
memory in all cases.

Regards,
DG

> On 5 Feb 2024, at 14:30, Stephan Kuschel via NumPy-Discussion 
>  wrote:
> 
> Dear Dom,
> 
> thanks for bringing up the possible constriction. I agree that this would be 
> serious argument against the change.
> 
> However, as you said the overlapping/non-overlapping indices would become 
> ambiguous with more than two arrays. And calling the fucntion with only two 
> arrays at a time would still be possible. So we will be unable to generalize 
> in the future towards a problem, that only has ambinuous solutions. So I fail 
> to see what exactly we the other use case would be.
> 
> The point of this change is not the luxory of allowing multiple arrays to 
> calculate the intersection. Its all about getting the indices in the original 
> arrays, using `return_indices=True`.
> 
> All the Best
> Stephan
> 
> Am 02.02.24 um 17:36 schrieb Dom Grigonis:
>> Also, I don’t know if this could be of value, but my use case for this is to 
>> find overlaps, then split arrays into overlapping and non-overlapping 
>> segments.
>> Thus, it might be useful for `return_indices=True` to return indices of all 
>> instances, not only the first.
>> Also, in my case I need both overlapping and non-overlapping indices, but 
>> this would become ambiguous with more than 2 arrays.
>> If it was left with 2 array input, then it can be extended to return both 
>> overlapping and non-overlapping parts. I think it could be another potential 
>> path to consider.
>> E.g. what would be the speed comparison:
>> intr=  intersect1d(arr1, arr2, assume_unique=False)
>> intr=  intersect1d(intr, np.unique(arr3), assume_unique=True)
>> # VS new
>> intr=  intersect1d(arr1, arr2, arr3, assume_unique=False)
>> Then, does the gain from such generalisation justify constriction it 
>> introduces?
>> Regards,
>> DG
>>> On 2 Feb 2024, at 17:31, Marten van Kerkwijk >> >> >> wrote:
>>> 
 For my own work, I required the intersect1d function to work on multiple
 arrays while returning the indices (using `return_indizes=True`).
 Consequently I changed the function in numpy and now I am seeking
 feedback from the community.
 
 This is the corresponding PR: https://github.com/numpy/numpy/pull/25688 
 >
>>> 
>>> 
>>> 
>>> To me this looks like a very sensible generalization.  In terms of numpy
>>> API, the only real change is that, effectively, the assume_unique and
>>> return_indices arguments become keyword-only, i.e., in the unlikely case
>>> that someone passed those as positional, a trivial backward-compatible
>>> change will fix it.
>>> 
>>> -- Marten
>>> ___
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>>>  >> >
>>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>>> >>  >
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>> >>  >
>>> Member address: dom.grigo...@gmail.com 
>> ___
>> NumPy-Discussion mailing list -- numpy-discussion@python.org 
>> 
>> To unsubscribe send an email to numpy-discussion-le...@python.org 
>> 
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> 
>> Member address: stephan.kusc...@gmail.com 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org 
> 
> To unsubscribe send an email to numpy-discussion-le...@python.org 
> 
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> 
> Member address: dom.grigo...@gmail.com