FYI pandas DOES use a very performant hash table impl for unique (and value_counts). Sorted state IS maintained by underlying Index implmentation. https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx
In [8]: a = np.random.randint(10, size=(10000,)) In [9]: %timeit np.unique(a) 1000 loops, best of 3: 284 µs per loop In [10]: %timeit Series(a).unique() 10000 loops, best of 3: 161 µs per loop In [11]: s = Series(a) # without the creation overhead In [12]: %timeit s.unique() 10000 loops, best of 3: 75.3 µs per loop On Thu, Sep 4, 2014 at 2:29 PM, Eelco Hoogendoorn < hoogendoorn.ee...@gmail.com> wrote: > > On Thu, Sep 4, 2014 at 8:14 PM, Eelco Hoogendoorn < > hoogendoorn.ee...@gmail.com> wrote: > >> I should clarify: I am speaking about my implementation, I havnt looked >> at the numpy implementation for a while so im not sure what it is up to. >> Note that by 'almost free', we are still talking about three passes over >> the whole array plus temp allocations, but I am assuming a use-case where >> the various sorts involved are the dominant cost, which I imagine they are, >> for out-of-cache sorts. Perhaps this isn't too realistic an assumption >> about the average use case though, I don't know. Though I suppose its a >> reasonable guideline to assume that either the dataset is big, or >> performance isn't that big a concern in the first place. >> >> >> On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río < >> jaime.f...@gmail.com> wrote: >> >>> On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < >>> hoogendoorn.ee...@gmail.com> wrote: >>> >>>> >>>> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < >>>> hoogendoorn.ee...@gmail.com> wrote: >>>> >>>>> >>>>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < >>>>> jaime.f...@gmail.com> wrote: >>>>> >>>>>> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < >>>>>> jaime.f...@gmail.com> wrote: >>>>>> >>>>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >>>>>>> hoogendoorn.ee...@gmail.com> wrote: >>>>>>> >>>>>>>> Not sure about the hashing. Indeed one can also build an index of >>>>>>>> a set by means of a hash table, but its questionable if this leads to >>>>>>>> improved performance over performing an argsort. Hashing may have >>>>>>>> better >>>>>>>> asymptotic time complexity in theory, but many datasets used in >>>>>>>> practice >>>>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is >>>>>>>> higher. But more importantly, using a hash-table guarantees poor cache >>>>>>>> behavior for many operations using this index. By contrast, sorting may >>>>>>>> (but need not) make one random access pass to build the index, and may >>>>>>>> (but >>>>>>>> need not) perform one random access to reorder values for grouping. But >>>>>>>> insofar as the keys are better behaved than pure random, this coherence >>>>>>>> will be exploited. >>>>>>>> >>>>>>> >>>>>>> If you want to give it a try, these branch of my numpy fork has hash >>>>>>> table based implementations of unique (with no extra indices) and in1d: >>>>>>> >>>>>>> https://github.com/jaimefrio/numpy/tree/hash-unique >>>>>>> >>>>>>> A use cases where the hash table is clearly better: >>>>>>> >>>>>>> In [1]: import numpy as np >>>>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d >>>>>>> >>>>>>> In [3]: a = np.random.randint(10, size=(10000,)) >>>>>>> In [4]: %timeit np.unique(a) >>>>>>> 1000 loops, best of 3: 258 us per loop >>>>>>> In [5]: %timeit _unique(a) >>>>>>> 10000 loops, best of 3: 143 us per loop >>>>>>> In [6]: %timeit np.sort(_unique(a)) >>>>>>> 10000 loops, best of 3: 149 us per loop >>>>>>> >>>>>>> It typically performs between 1.5x and 4x faster than sorting. I >>>>>>> haven't profiled it properly to know, but there may be quite a bit of >>>>>>> performance to dig out: have type specific comparison functions, >>>>>>> optimize >>>>>>> the starting hash table size based on the size of the array to avoid >>>>>>> reinsertions... >>>>>>> >>>>>>> If getting the elements sorted is a necessity, and the array >>>>>>> contains very few or no repeated items, then the hash table approach may >>>>>>> even perform worse,: >>>>>>> >>>>>>> In [8]: a = np.random.randint(10000, size=(5000,)) >>>>>>> In [9]: %timeit np.unique(a) >>>>>>> 1000 loops, best of 3: 277 us per loop >>>>>>> In [10]: %timeit np.sort(_unique(a)) >>>>>>> 1000 loops, best of 3: 320 us per loop >>>>>>> >>>>>>> But the hash table still wins in extracting the unique items only: >>>>>>> >>>>>>> In [11]: %timeit _unique(a) >>>>>>> 10000 loops, best of 3: 187 us per loop >>>>>>> >>>>>>> Where the hash table shines is in more elaborate situations. If you >>>>>>> keep the first index where it was found, and the number of repeats, in >>>>>>> the >>>>>>> hash table, you can get return_index and return_counts almost for free, >>>>>>> which means you are performing an extra 3x faster than with sorting. >>>>>>> return_inverse requires a little more trickery, so I won;t attempt to >>>>>>> quantify the improvement. But I wouldn't be surprised if, after fine >>>>>>> tuning >>>>>>> it, there is close to an order of magnitude overall improvement >>>>>>> >>>>>>> The spped-up for in1d is also nice: >>>>>>> >>>>>>> In [16]: a = np.random.randint(1000, size=(1000,)) >>>>>>> In [17]: b = np.random.randint(1000, size=(500,)) >>>>>>> In [18]: %timeit np.in1d(a, b) >>>>>>> 1000 loops, best of 3: 178 us per loop >>>>>>> In [19]: %timeit _in1d(a, b) >>>>>>> 10000 loops, best of 3: 30.1 us per loop >>>>>>> >>>>>>> Of course, there is no point in >>>>>>> >>>>>> >>>>>> Ooops!!! Hit the send button too quick. Not to extend myself too >>>>>> long: if we are going to rethink all of this, we should approach it with >>>>>> an >>>>>> open mind. Still, and this post is not helping with that either, I am >>>>>> afraid that we are discussing implementation details, but are missing a >>>>>> broader vision of what we want to accomplish and why. That vision of what >>>>>> numpy's grouping functionality, if any, should be, and how it complements >>>>>> or conflicts with what pandas is providing, should precede anything >>>>>> else. I >>>>>> now I haven't, but has anyone looked at how pandas implements grouping? >>>>>> Their documentation on the subject is well worth a read: >>>>>> >>>>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html >>>>>> >>>>>> Does numpy need to replicate this? What/why/how can we add to that? >>>>>> >>>>>> Jaime >>>>>> >>>>>> -- >>>>>> (\__/) >>>>>> ( O.o) >>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus >>>>>> planes de dominación mundial. >>>>>> >>>>>> _______________________________________________ >>>>>> NumPy-Discussion mailing list >>>>>> NumPy-Discussion@scipy.org >>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>>>> >>>>>> I would certainly not be opposed to having a hashing based indexing >>>>> mechanism; I think it would make sense design-wise to have a HashIndex >>>>> class with the same interface as the rest, and use that subclass in those >>>>> arraysetops where it makes sense. The 'how to' of indexing and its >>>>> applications are largely orthogonal I think (with some tiny performance >>>>> compromises which are worth the abstraction imo). For datasets which are >>>>> not purely random, have many unique items, and which do not fit into >>>>> cache, >>>>> I would expect sorting to come out on top, but indeed it depends on the >>>>> dataset. >>>>> >>>>> Yeah, the question how pandas does grouping, and whether we can do >>>>> better, is a relevant one. >>>>> >>>>> From what I understand, pandas relies on cython extensions to get >>>>> vectorized grouping functionality. This is no longer necessary since the >>>>> introduction of ufuncs in numpy. I don't know how the implementations >>>>> compare in terms of performance, but I doubt the difference is huge. >>>>> >>>>> I personally use grouping a lot in my code, and I don't like having to >>>>> use pandas for it. Most importantly, I don't want to go around creating a >>>>> dataframe for a single one-line hit-and-run association between keys and >>>>> values. The permanent association of different types of data and their >>>>> metadata which pandas offers is I think the key difference from numpy, >>>>> which is all about manipulating just plain ndarrays. Arguably, grouping >>>>> itself is a pretty elementary manipulating of ndarrays, and adding calls >>>>> to >>>>> DataFrame or Series inbetween a statement that could just be >>>>> simply group_by(keys).mean(values) feels wrong to me. As does including >>>>> pandas as a dependency just to use this small piece of >>>>> functionality. Grouping is a more general functionality than any >>>>> particular >>>>> method of organizing your data. >>>>> >>>>> In terms of features, adding transformations and filtering might be >>>>> nice too; I hadn't thought about it, but that is because unlike the >>>>> currently implemented features, the need has never arose for me. Im only a >>>>> small sample size, and I don't see any fundamental objection to adding >>>>> such >>>>> functionality though. It certainly raises the question as to where to draw >>>>> the line with pandas; but my rule of thumb is that if you can think of it >>>>> as an elementary operation on ndarrays, then it probably belongs in numpy. >>>>> >>>>> >>>> Oh I forgot to add: with an indexing mechanism based on sorting, unique >>>> values and counts also come 'for free', not counting the O(N) cost of >>>> actually creating those arrays. The only time an operating relying on an >>>> index incurs another nontrivial amount of overhead is in case its 'rank' or >>>> 'inverse' property is used, which invokes another argsort. But for the vast >>>> majority of grouping or set operations, these properties are never used. >>>> >>> >>> That extra argsort is now gone from master: >>> >>> https://github.com/numpy/numpy/pull/5012 >>> >>> Even with this improvement, returning any index typically makes >>> `np.unique` run at least 2x slower: >>> >>> In [1]: import numpy as np >>> In [2]: a = np.random.randint(100, size=(1000,)) >>> In [3]: %timeit np.unique(a) >>> 10000 loops, best of 3: 37.3 us per loop >>> In [4]: %timeit np.unique(a, return_inverse=True) >>> 10000 loops, best of 3: 62.1 us per loop >>> In [5]: %timeit np.unique(a, return_index=True) >>> 10000 loops, best of 3: 72.8 us per loop >>> In [6]: %timeit np.unique(a, return_counts=True) >>> 10000 loops, best of 3: 56.4 us per loop >>> In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, >>> return_coun >>> ts=True) >>> 10000 loops, best of 3: 112 us per loop >>> >>> -- >>> (\__/) >>> ( O.o) >>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus >>> planes de dominación mundial. >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >> > Yeah, I looked at the numpy implementation, and it seems these speed > differences are simply the result of the extra O(N) costs involved, so my > implementation would have the same characteristics. If another array copy > or two has meaningful impact on performance, then you are done as far as > optimization within numpy is concerned, id say. You could fuse those loops > on the C level, but as you said I think its better to think about these > kind of optimizations once we have a more complete picture of the > functionality we want. > > Good call on removing the extra argsort, that hadn't occurred to me. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion