I should clarify: I am speaking about my implementation, I havnt looked at the numpy implementation for a while so im not sure what it is up to. Note that by 'almost free', we are still talking about three passes over the whole array plus temp allocations, but I am assuming a use-case where the various sorts involved are the dominant cost, which I imagine they are, for out-of-cache sorts. Perhaps this isn't too realistic an assumption about the average use case though, I don't know. Though I suppose its a reasonable guideline to assume that either the dataset is big, or performance isn't that big a concern in the first place.
On Thu, Sep 4, 2014 at 7:55 PM, Jaime Fernández del Río < jaime.f...@gmail.com> wrote: > On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < > hoogendoorn.ee...@gmail.com> wrote: > >> >> On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < >> hoogendoorn.ee...@gmail.com> wrote: >> >>> >>> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < >>> jaime.f...@gmail.com> wrote: >>> >>>> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < >>>> jaime.f...@gmail.com> wrote: >>>> >>>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >>>>> hoogendoorn.ee...@gmail.com> wrote: >>>>> >>>>>> Not sure about the hashing. Indeed one can also build an index of a >>>>>> set by means of a hash table, but its questionable if this leads to >>>>>> improved performance over performing an argsort. Hashing may have better >>>>>> asymptotic time complexity in theory, but many datasets used in practice >>>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is >>>>>> higher. But more importantly, using a hash-table guarantees poor cache >>>>>> behavior for many operations using this index. By contrast, sorting may >>>>>> (but need not) make one random access pass to build the index, and may >>>>>> (but >>>>>> need not) perform one random access to reorder values for grouping. But >>>>>> insofar as the keys are better behaved than pure random, this coherence >>>>>> will be exploited. >>>>>> >>>>> >>>>> If you want to give it a try, these branch of my numpy fork has hash >>>>> table based implementations of unique (with no extra indices) and in1d: >>>>> >>>>> https://github.com/jaimefrio/numpy/tree/hash-unique >>>>> >>>>> A use cases where the hash table is clearly better: >>>>> >>>>> In [1]: import numpy as np >>>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d >>>>> >>>>> In [3]: a = np.random.randint(10, size=(10000,)) >>>>> In [4]: %timeit np.unique(a) >>>>> 1000 loops, best of 3: 258 us per loop >>>>> In [5]: %timeit _unique(a) >>>>> 10000 loops, best of 3: 143 us per loop >>>>> In [6]: %timeit np.sort(_unique(a)) >>>>> 10000 loops, best of 3: 149 us per loop >>>>> >>>>> It typically performs between 1.5x and 4x faster than sorting. I >>>>> haven't profiled it properly to know, but there may be quite a bit of >>>>> performance to dig out: have type specific comparison functions, optimize >>>>> the starting hash table size based on the size of the array to avoid >>>>> reinsertions... >>>>> >>>>> If getting the elements sorted is a necessity, and the array contains >>>>> very few or no repeated items, then the hash table approach may even >>>>> perform worse,: >>>>> >>>>> In [8]: a = np.random.randint(10000, size=(5000,)) >>>>> In [9]: %timeit np.unique(a) >>>>> 1000 loops, best of 3: 277 us per loop >>>>> In [10]: %timeit np.sort(_unique(a)) >>>>> 1000 loops, best of 3: 320 us per loop >>>>> >>>>> But the hash table still wins in extracting the unique items only: >>>>> >>>>> In [11]: %timeit _unique(a) >>>>> 10000 loops, best of 3: 187 us per loop >>>>> >>>>> Where the hash table shines is in more elaborate situations. If you >>>>> keep the first index where it was found, and the number of repeats, in the >>>>> hash table, you can get return_index and return_counts almost for free, >>>>> which means you are performing an extra 3x faster than with sorting. >>>>> return_inverse requires a little more trickery, so I won;t attempt to >>>>> quantify the improvement. But I wouldn't be surprised if, after fine >>>>> tuning >>>>> it, there is close to an order of magnitude overall improvement >>>>> >>>>> The spped-up for in1d is also nice: >>>>> >>>>> In [16]: a = np.random.randint(1000, size=(1000,)) >>>>> In [17]: b = np.random.randint(1000, size=(500,)) >>>>> In [18]: %timeit np.in1d(a, b) >>>>> 1000 loops, best of 3: 178 us per loop >>>>> In [19]: %timeit _in1d(a, b) >>>>> 10000 loops, best of 3: 30.1 us per loop >>>>> >>>>> Of course, there is no point in >>>>> >>>> >>>> Ooops!!! Hit the send button too quick. Not to extend myself too long: >>>> if we are going to rethink all of this, we should approach it with an open >>>> mind. Still, and this post is not helping with that either, I am afraid >>>> that we are discussing implementation details, but are missing a broader >>>> vision of what we want to accomplish and why. That vision of what numpy's >>>> grouping functionality, if any, should be, and how it complements or >>>> conflicts with what pandas is providing, should precede anything else. I >>>> now I haven't, but has anyone looked at how pandas implements grouping? >>>> Their documentation on the subject is well worth a read: >>>> >>>> http://pandas.pydata.org/pandas-docs/stable/groupby.html >>>> >>>> Does numpy need to replicate this? What/why/how can we add to that? >>>> >>>> Jaime >>>> >>>> -- >>>> (\__/) >>>> ( O.o) >>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus >>>> planes de dominación mundial. >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> I would certainly not be opposed to having a hashing based indexing >>> mechanism; I think it would make sense design-wise to have a HashIndex >>> class with the same interface as the rest, and use that subclass in those >>> arraysetops where it makes sense. The 'how to' of indexing and its >>> applications are largely orthogonal I think (with some tiny performance >>> compromises which are worth the abstraction imo). For datasets which are >>> not purely random, have many unique items, and which do not fit into cache, >>> I would expect sorting to come out on top, but indeed it depends on the >>> dataset. >>> >>> Yeah, the question how pandas does grouping, and whether we can do >>> better, is a relevant one. >>> >>> From what I understand, pandas relies on cython extensions to get >>> vectorized grouping functionality. This is no longer necessary since the >>> introduction of ufuncs in numpy. I don't know how the implementations >>> compare in terms of performance, but I doubt the difference is huge. >>> >>> I personally use grouping a lot in my code, and I don't like having to >>> use pandas for it. Most importantly, I don't want to go around creating a >>> dataframe for a single one-line hit-and-run association between keys and >>> values. The permanent association of different types of data and their >>> metadata which pandas offers is I think the key difference from numpy, >>> which is all about manipulating just plain ndarrays. Arguably, grouping >>> itself is a pretty elementary manipulating of ndarrays, and adding calls to >>> DataFrame or Series inbetween a statement that could just be >>> simply group_by(keys).mean(values) feels wrong to me. As does including >>> pandas as a dependency just to use this small piece of >>> functionality. Grouping is a more general functionality than any particular >>> method of organizing your data. >>> >>> In terms of features, adding transformations and filtering might be nice >>> too; I hadn't thought about it, but that is because unlike the currently >>> implemented features, the need has never arose for me. Im only a small >>> sample size, and I don't see any fundamental objection to adding such >>> functionality though. It certainly raises the question as to where to draw >>> the line with pandas; but my rule of thumb is that if you can think of it >>> as an elementary operation on ndarrays, then it probably belongs in numpy. >>> >>> >> Oh I forgot to add: with an indexing mechanism based on sorting, unique >> values and counts also come 'for free', not counting the O(N) cost of >> actually creating those arrays. The only time an operating relying on an >> index incurs another nontrivial amount of overhead is in case its 'rank' or >> 'inverse' property is used, which invokes another argsort. But for the vast >> majority of grouping or set operations, these properties are never used. >> > > That extra argsort is now gone from master: > > https://github.com/numpy/numpy/pull/5012 > > Even with this improvement, returning any index typically makes > `np.unique` run at least 2x slower: > > In [1]: import numpy as np > In [2]: a = np.random.randint(100, size=(1000,)) > In [3]: %timeit np.unique(a) > 10000 loops, best of 3: 37.3 us per loop > In [4]: %timeit np.unique(a, return_inverse=True) > 10000 loops, best of 3: 62.1 us per loop > In [5]: %timeit np.unique(a, return_index=True) > 10000 loops, best of 3: 72.8 us per loop > In [6]: %timeit np.unique(a, return_counts=True) > 10000 loops, best of 3: 56.4 us per loop > In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, > return_coun > ts=True) > 10000 loops, best of 3: 112 us per loop > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes > de dominación mundial. > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion