Bump. I'd be interested to know if this is a desirable feature for numpy? (specifically the 1D "find" functionality rather than the "any"/"all" also discussed) If so, I'd be more than happy to submit a PR, but I don't want to put in the effort if the principle isn't desirable in the core of numpy.
Cheers, On 8 March 2013 17:38, Phil Elson <[email protected]> wrote: > Interesting. I hadn't thought of those. I've implemented (very roughly > without a sound logic check) and benchmarked: > > def my_any(a, predicate, chunk_size=2048): > try: > next(find(a, predicate, chunk_size)) > return True > except StopIteration: > return False > > def my_all(a, predicate, chunk_size=2048): > return not my_any(a, lambda a: ~predicate(a), chunk_size) > > > With the following setup: > > import numpy as np > import numpy.random > > np.random.seed(1) > a = np.random.randn(1e8) > > > For a low frequency *any*: > > In [12]: %timeit (np.abs(a) > 6).any() > 1 loops, best of 3: 1.29 s per loop > > In [13]: %timeit my_any(a, lambda a: np.abs(a) > 6) > > 1 loops, best of 3: 792 ms per loop > > In [14]: %timeit my_any(a, lambda a: np.abs(a) > 6, chunk_size=10000) > 1 loops, best of 3: 654 ms per loop > > For a False *any*: > > In [16]: %timeit (np.abs(a) > 7).any() > 1 loops, best of 3: 1.22 s per loop > > In [17]: %timeit my_any(a, lambda a: np.abs(a) > 7) > 1 loops, best of 3: 2.4 s per loop > > For a high probability *any*: > > In [28]: %timeit (np.abs(a) > 1).any() > 1 loops, best of 3: 972 ms per loop > > In [27]: %timeit my_any(a, lambda a: np.abs(a) > 1) > 10000 loops, best of 3: 67 us per loop > > --------------- > > For a low probability *all*: > > In [18]: %timeit (np.abs(a) < 6).all() > 1 loops, best of 3: 1.16 s per loop > > In [19]: %timeit my_all(a, lambda a: np.abs(a) < 6) > 1 loops, best of 3: 880 ms per loop > > In [20]: %timeit my_all(a, lambda a: np.abs(a) < 6, chunk_size=10000) > 1 loops, best of 3: 706 ms per loop > > For a True *all*: > > In [22]: %timeit (np.abs(a) < 7).all() > 1 loops, best of 3: 1.47 s per loop > > In [23]: %timeit my_all(a, lambda a: np.abs(a) < 7) > 1 loops, best of 3: 2.65 s per loop > > For a high probability *all*: > > In [25]: %timeit (np.abs(a) < 1).all() > 1 loops, best of 3: 978 ms per loop > > In [26]: %timeit my_all(a, lambda a: np.abs(a) < 1) > 10000 loops, best of 3: 73.6 us per loop > > > > > > > > On 6 March 2013 21:16, Benjamin Root <[email protected]> wrote: > >> >> >> On Tue, Mar 5, 2013 at 9:15 AM, Phil Elson <[email protected]> wrote: >> >>> The ticket https://github.com/numpy/numpy/issues/2269 discusses the >>> possibility of implementing a "find first" style function which can >>> optimise the process of finding the first value(s) which match a predicate >>> in a given 1D array. For example: >>> >>> >>> >>> a = np.sin(np.linspace(0, np.pi, 200)) >>> >>> print find_first(a, lambda a: a > 0.9) >>> ((71, ), 0.900479032457) >>> >>> >>> This has been discussed in several locations: >>> >>> https://github.com/numpy/numpy/issues/2269 >>> https://github.com/numpy/numpy/issues/2333 >>> >>> http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item >>> >>> >>> *Rationale* >>> >>> For small arrays there is no real reason to avoid doing: >>> >>> >>> a = np.sin(np.linspace(0, np.pi, 200)) >>> >>> ind = (a > 0.9).nonzero()[0][0] >>> >>> print (ind, ), a[ind] >>> (71,) 0.900479032457 >>> >>> >>> But for larger arrays, this can lead to massive amounts of work even if >>> the result is one of the first to be computed. Example: >>> >>> >>> a = np.arange(1e8) >>> >>> print (a == 5).nonzero()[0][0] >>> 5 >>> >>> >>> So a function which terminates when the first matching value is found is >>> desirable. >>> >>> As mentioned in #2269, it is possible to define a consistent ordering >>> which allows this functionality for >1D arrays, but IMHO it overcomplicates >>> the problem and was not a case that I personally needed, so I've limited >>> the scope to 1D arrays only. >>> >>> >>> *Implementation* >>> >>> My initial assumption was that to get any kind of performance I would >>> need to write the *find* function in C, however after prototyping with >>> some array chunking it became apparent that a trivial python function would >>> be quick enough for my needs. >>> >>> The approach I've implemented in the code found in #2269 simply breaks >>> the array into sub-arrays of maximum length *chunk_size* (2048 by >>> default, though there is no real science to this number), applies the given >>> predicating function, and yields the results from *nonzero()*. The >>> given function should be a python function which operates on the whole of >>> the sub-array element-wise (i.e. the function should be vectorized). >>> Returning a generator also has the benefit of allowing users to get the >>> first *n* matching values/indices. >>> >>> >>> *Results* >>> >>> >>> I timed the implementation of *find* found in my comment at >>> https://github.com/numpy/numpy/issues/2269#issuecomment-14436725 with >>> an obvious test: >>> >>> >>> In [1]: from np_utils import find >>> >>> In [2]: import numpy as np >>> >>> In [3]: import numpy.random >>> >>> In [4]: np.random.seed(1) >>> >>> In [5]: a = np.random.randn(1e8) >>> >>> In [6]: a.min(), a.max() >>> Out[6]: (-6.1194900990552776, 5.9632246301166321) >>> >>> In [7]: next(find(a, lambda a: np.abs(a) > 6)) >>> Out[7]: ((33105441,), -6.1194900990552776) >>> >>> In [8]: (np.abs(a) > 6).nonzero() >>> Out[8]: (array([33105441]),) >>> >>> In [9]: %timeit (np.abs(a) > 6).nonzero() >>> 1 loops, best of 3: 1.51 s per loop >>> >>> In [10]: %timeit next(find(a, lambda a: np.abs(a) > 6)) >>> 1 loops, best of 3: 912 ms per loop >>> >>> In [11]: %timeit next(find(a, lambda a: np.abs(a) > 6, >>> chunk_size=100000)) >>> 1 loops, best of 3: 470 ms per loop >>> >>> In [12]: %timeit next(find(a, lambda a: np.abs(a) > 6, >>> chunk_size=1000000)) >>> 1 loops, best of 3: 483 ms per loop >>> >>> >>> This shows that picking a sensible *chunk_size* can yield massive >>> speed-ups (nonzero is x3 slower in one case). A similar example with a much >>> smaller 1D array shows similar promise: >>> >>> In [41]: a = np.random.randn(1e4) >>> >>> In [42]: %timeit next(find(a, lambda a: np.abs(a) > 3)) >>> 10000 loops, best of 3: 35.8 us per loop >>> >>> In [43]: %timeit (np.abs(a) > 3).nonzero() >>> 10000 loops, best of 3: 148 us per loop >>> >>> >>> As I commented on the issue tracker, if you think this function is worth >>> taking forward, I'd be happy to open up a pull request. >>> >>> Feedback greatfully received. >>> >>> Cheers, >>> >>> Phil >>> >>> >>> >> In the interest of generalizing code and such, could such approaches be >> used for functions like np.any() and np.all() for short-circuiting if True >> or False (respectively) are found? I wonder what other sort of functions >> in NumPy might benefit from this? >> >> Ben Root >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> [email protected] >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
