Naturally, youd want to avoid redoing the indexing where you can, which is another good reason to factor out the indexing mechanisms into separate classes. A factor two performance difference does not get me too excited; again, I think it would be the other way around for an out-of-cache dataset being grouped. But this by itself is ofcourse another argument for factoring out the indexing behind a uniform interface, so we can play around with those implementation details later, and specialize the indexing to serve different scenarios. Also, it really helps with code maintainability; most arraysetops are almost trivial to implement once you have abstracted away the indexing machinery.
On Thu, Sep 4, 2014 at 8:36 PM, Jeff Reback <jeffreb...@gmail.com> wrote: > FYI pandas DOES use a very performant hash table impl for unique (and > value_counts). Sorted state IS maintained > by underlying Index implmentation. > https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx > > In [8]: a = np.random.randint(10, size=(10000,)) > > In [9]: %timeit np.unique(a) > 1000 loops, best of 3: 284 µs per loop > > In [10]: %timeit Series(a).unique() > 10000 loops, best of 3: 161 µs per loop > > In [11]: s = Series(a) > > # without the creation overhead > In [12]: %timeit s.unique() > 10000 loops, best of 3: 75.3 µs per loop > > > > On Thu, Sep 4, 2014 at 2:29 PM, Eelco Hoogendoorn < > hoogendoorn.ee...@gmail.com> wrote: > >> >> On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn < >> hoogendoorn.ee...@gmail.com> wrote: >> >>> I should clarify: I am speaking about my implementation, I havnt looked >>> at the numpy implementation for a while so im not sure what it is up to. >>> Note that by 'almost free', we are still talking about three passes over >>> the whole array plus temp allocations, but I am assuming a use-case where >>> the various sorts involved are the dominant cost, which I imagine they are, >>> for out-of-cache sorts. Perhaps this isn't too realistic an assumption >>> about the average use case though, I don't know. Though I suppose its a >>> reasonable guideline to assume that either the dataset is big, or >>> performance isn't that big a concern in the first place. >>> >>> >>> On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río < >>> jaime.f...@gmail.com> wrote: >>> >>>> On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < >>>> hoogendoorn.ee...@gmail.com> wrote: >>>> >>>>> >>>>> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < >>>>> hoogendoorn.ee...@gmail.com> wrote: >>>>> >>>>>> >>>>>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < >>>>>> jaime.f...@gmail.com> wrote: >>>>>> >>>>>>> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < >>>>>>> jaime.f...@gmail.com> wrote: >>>>>>> >>>>>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >>>>>>>> hoogendoorn.ee...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Not sure about the hashing. Indeed one can also build an index of >>>>>>>>> a set by means of a hash table, but its questionable if this leads to >>>>>>>>> improved performance over performing an argsort. Hashing may have >>>>>>>>> better >>>>>>>>> asymptotic time complexity in theory, but many datasets used in >>>>>>>>> practice >>>>>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is >>>>>>>>> higher. But more importantly, using a hash-table guarantees poor cache >>>>>>>>> behavior for many operations using this index. By contrast, sorting >>>>>>>>> may >>>>>>>>> (but need not) make one random access pass to build the index, and >>>>>>>>> may (but >>>>>>>>> need not) perform one random access to reorder values for grouping. >>>>>>>>> But >>>>>>>>> insofar as the keys are better behaved than pure random, this >>>>>>>>> coherence >>>>>>>>> will be exploited. >>>>>>>>> >>>>>>>> >>>>>>>> If you want to give it a try, these branch of my numpy fork has >>>>>>>> hash table based implementations of unique (with no extra indices) and >>>>>>>> in1d: >>>>>>>> >>>>>>>> https://github.com/jaimefrio/numpy/tree/hash-unique >>>>>>>> >>>>>>>> A use cases where the hash table is clearly better: >>>>>>>> >>>>>>>> In [1]: import numpy as np >>>>>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d >>>>>>>> >>>>>>>> In [3]: a = np.random.randint(10, size=(10000,)) >>>>>>>> In [4]: %timeit np.unique(a) >>>>>>>> 1000 loops, best of 3: 258 us per loop >>>>>>>> In [5]: %timeit _unique(a) >>>>>>>> 10000 loops, best of 3: 143 us per loop >>>>>>>> In [6]: %timeit np.sort(_unique(a)) >>>>>>>> 10000 loops, best of 3: 149 us per loop >>>>>>>> >>>>>>>> It typically performs between 1.5x and 4x faster than sorting. I >>>>>>>> haven't profiled it properly to know, but there may be quite a bit of >>>>>>>> performance to dig out: have type specific comparison functions, >>>>>>>> optimize >>>>>>>> the starting hash table size based on the size of the array to avoid >>>>>>>> reinsertions... >>>>>>>> >>>>>>>> If getting the elements sorted is a necessity, and the array >>>>>>>> contains very few or no repeated items, then the hash table approach >>>>>>>> may >>>>>>>> even perform worse,: >>>>>>>> >>>>>>>> In [8]: a = np.random.randint(10000, size=(5000,)) >>>>>>>> In [9]: %timeit np.unique(a) >>>>>>>> 1000 loops, best of 3: 277 us per loop >>>>>>>> In [10]: %timeit np.sort(_unique(a)) >>>>>>>> 1000 loops, best of 3: 320 us per loop >>>>>>>> >>>>>>>> But the hash table still wins in extracting the unique items only: >>>>>>>> >>>>>>>> In [11]: %timeit _unique(a) >>>>>>>> 10000 loops, best of 3: 187 us per loop >>>>>>>> >>>>>>>> Where the hash table shines is in more elaborate situations. If you >>>>>>>> keep the first index where it was found, and the number of repeats, in >>>>>>>> the >>>>>>>> hash table, you can get return_index and return_counts almost for free, >>>>>>>> which means you are performing an extra 3x faster than with sorting. >>>>>>>> return_inverse requires a little more trickery, so I won;t attempt to >>>>>>>> quantify the improvement. But I wouldn't be surprised if, after fine >>>>>>>> tuning >>>>>>>> it, there is close to an order of magnitude overall improvement >>>>>>>> >>>>>>>> The spped-up for in1d is also nice: >>>>>>>> >>>>>>>> In [16]: a = np.random.randint(1000, size=(1000,)) >>>>>>>> In [17]: b = np.random.randint(1000, size=(500,)) >>>>>>>> In [18]: %timeit np.in1d(a, b) >>>>>>>> 1000 loops, best of 3: 178 us per loop >>>>>>>> In [19]: %timeit _in1d(a, b) >>>>>>>> 10000 loops, best of 3: 30.1 us per loop >>>>>>>> >>>>>>>> Of course, there is no point in >>>>>>>> >>>>>>> >>>>>>> Ooops!!! Hit the send button too quick. Not to extend myself too >>>>>>> long: if we are going to rethink all of this, we should approach it >>>>>>> with an >>>>>>> open mind. Still, and this post is not helping with that either, I am >>>>>>> afraid that we are discussing implementation details, but are missing a >>>>>>> broader vision of what we want to accomplish and why. That vision of >>>>>>> what >>>>>>> numpy's grouping functionality, if any, should be, and how it >>>>>>> complements >>>>>>> or conflicts with what pandas is providing, should precede anything >>>>>>> else. I >>>>>>> now I haven't, but has anyone looked at how pandas implements grouping? >>>>>>> Their documentation on the subject is well worth a read: >>>>>>> >>>>>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html >>>>>>> >>>>>>> Does numpy need to replicate this? What/why/how can we add to that? >>>>>>> >>>>>>> Jaime >>>>>>> >>>>>>> -- >>>>>>> (\__/) >>>>>>> ( O.o) >>>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus >>>>>>> planes de dominación mundial. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> NumPy-Discussion mailing list >>>>>>> NumPy-Discussion@scipy.org >>>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>>>>> >>>>>>> I would certainly not be opposed to having a hashing based indexing >>>>>> mechanism; I think it would make sense design-wise to have a HashIndex >>>>>> class with the same interface as the rest, and use that subclass in those >>>>>> arraysetops where it makes sense. The 'how to' of indexing and its >>>>>> applications are largely orthogonal I think (with some tiny performance >>>>>> compromises which are worth the abstraction imo). For datasets which are >>>>>> not purely random, have many unique items, and which do not fit into >>>>>> cache, >>>>>> I would expect sorting to come out on top, but indeed it depends on the >>>>>> dataset. >>>>>> >>>>>> Yeah, the question how pandas does grouping, and whether we can do >>>>>> better, is a relevant one. >>>>>> >>>>>> From what I understand, pandas relies on cython extensions to get >>>>>> vectorized grouping functionality. This is no longer necessary since the >>>>>> introduction of ufuncs in numpy. I don't know how the implementations >>>>>> compare in terms of performance, but I doubt the difference is huge. >>>>>> >>>>>> I personally use grouping a lot in my code, and I don't like >>>>>> having to use pandas for it. Most importantly, I don't want to go around >>>>>> creating a dataframe for a single one-line hit-and-run association >>>>>> between >>>>>> keys and values. The permanent association of different types of data and >>>>>> their metadata which pandas offers is I think the key difference from >>>>>> numpy, which is all about manipulating just plain ndarrays. Arguably, >>>>>> grouping itself is a pretty elementary manipulating of ndarrays, and >>>>>> adding >>>>>> calls to DataFrame or Series inbetween a statement that could just be >>>>>> simply group_by(keys).mean(values) feels wrong to me. As does including >>>>>> pandas as a dependency just to use this small piece of >>>>>> functionality. Grouping is a more general functionality than any >>>>>> particular >>>>>> method of organizing your data. >>>>>> >>>>>> In terms of features, adding transformations and filtering might be >>>>>> nice too; I hadn't thought about it, but that is because unlike the >>>>>> currently implemented features, the need has never arose for me. Im only >>>>>> a >>>>>> small sample size, and I don't see any fundamental objection to adding >>>>>> such >>>>>> functionality though. It certainly raises the question as to where to >>>>>> draw >>>>>> the line with pandas; but my rule of thumb is that if you can think of it >>>>>> as an elementary operation on ndarrays, then it probably belongs in >>>>>> numpy. >>>>>> >>>>>> >>>>> Oh I forgot to add: with an indexing mechanism based on sorting, >>>>> unique values and counts also come 'for free', not counting the O(N) cost >>>>> of actually creating those arrays. The only time an operating relying on >>>>> an >>>>> index incurs another nontrivial amount of overhead is in case its 'rank' >>>>> or >>>>> 'inverse' property is used, which invokes another argsort. But for the >>>>> vast >>>>> majority of grouping or set operations, these properties are never used. >>>>> >>>> >>>> That extra argsort is now gone from master: >>>> >>>> https://github.com/numpy/numpy/pull/5012 >>>> >>>> Even with this improvement, returning any index typically makes >>>> `np.unique` run at least 2x slower: >>>> >>>> In [1]: import numpy as np >>>> In [2]: a = np.random.randint(100, size=(1000,)) >>>> In [3]: %timeit np.unique(a) >>>> 10000 loops, best of 3: 37.3 us per loop >>>> In [4]: %timeit np.unique(a, return_inverse=True) >>>> 10000 loops, best of 3: 62.1 us per loop >>>> In [5]: %timeit np.unique(a, return_index=True) >>>> 10000 loops, best of 3: 72.8 us per loop >>>> In [6]: %timeit np.unique(a, return_counts=True) >>>> 10000 loops, best of 3: 56.4 us per loop >>>> In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, >>>> return_coun >>>> ts=True) >>>> 10000 loops, best of 3: 112 us per loop >>>> >>>> -- >>>> (\__/) >>>> ( O.o) >>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus >>>> planes de dominación mundial. >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> >>> >> Yeah, I looked at the numpy implementation, and it seems these speed >> differences are simply the result of the extra O(N) costs involved, so my >> implementation would have the same characteristics. If another array copy >> or two has meaningful impact on performance, then you are done as far as >> optimization within numpy is concerned, id say. You could fuse those loops >> on the C level, but as you said I think its better to think about these >> kind of optimizations once we have a more complete picture of the >> functionality we want. >> >> Good call on removing the extra argsort, that hadn't occurred to me. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion