On Thu, Sep 4, 2014 at 10:39 AM, Eelco Hoogendoorn < hoogendoorn.ee...@gmail.com> wrote:
> > On Thu, Sep 4, 2014 at 10:31 AM, Eelco Hoogendoorn < > hoogendoorn.ee...@gmail.com> wrote: > >> >> On Wed, Sep 3, 2014 at 6:46 PM, Jaime Fernández del Río < >> jaime.f...@gmail.com> wrote: >> >>> On Wed, Sep 3, 2014 at 9:33 AM, Jaime Fernández del Río < >>> jaime.f...@gmail.com> wrote: >>> >>>> On Wed, Sep 3, 2014 at 6:41 AM, Eelco Hoogendoorn < >>>> hoogendoorn.ee...@gmail.com> wrote: >>>> >>>>> Not sure about the hashing. Indeed one can also build an index of a >>>>> set by means of a hash table, but its questionable if this leads to >>>>> improved performance over performing an argsort. Hashing may have better >>>>> asymptotic time complexity in theory, but many datasets used in practice >>>>> are very easy to sort (O(N)-ish), and the time-constant of hashing is >>>>> higher. But more importantly, using a hash-table guarantees poor cache >>>>> behavior for many operations using this index. By contrast, sorting may >>>>> (but need not) make one random access pass to build the index, and may >>>>> (but >>>>> need not) perform one random access to reorder values for grouping. But >>>>> insofar as the keys are better behaved than pure random, this coherence >>>>> will be exploited. >>>>> >>>> >>>> If you want to give it a try, these branch of my numpy fork has hash >>>> table based implementations of unique (with no extra indices) and in1d: >>>> >>>> https://github.com/jaimefrio/numpy/tree/hash-unique >>>> >>>> A use cases where the hash table is clearly better: >>>> >>>> In [1]: import numpy as np >>>> In [2]: from numpy.lib._compiled_base import _unique, _in1d >>>> >>>> In [3]: a = np.random.randint(10, size=(10000,)) >>>> In [4]: %timeit np.unique(a) >>>> 1000 loops, best of 3: 258 us per loop >>>> In [5]: %timeit _unique(a) >>>> 10000 loops, best of 3: 143 us per loop >>>> In [6]: %timeit np.sort(_unique(a)) >>>> 10000 loops, best of 3: 149 us per loop >>>> >>>> It typically performs between 1.5x and 4x faster than sorting. I >>>> haven't profiled it properly to know, but there may be quite a bit of >>>> performance to dig out: have type specific comparison functions, optimize >>>> the starting hash table size based on the size of the array to avoid >>>> reinsertions... >>>> >>>> If getting the elements sorted is a necessity, and the array contains >>>> very few or no repeated items, then the hash table approach may even >>>> perform worse,: >>>> >>>> In [8]: a = np.random.randint(10000, size=(5000,)) >>>> In [9]: %timeit np.unique(a) >>>> 1000 loops, best of 3: 277 us per loop >>>> In [10]: %timeit np.sort(_unique(a)) >>>> 1000 loops, best of 3: 320 us per loop >>>> >>>> But the hash table still wins in extracting the unique items only: >>>> >>>> In [11]: %timeit _unique(a) >>>> 10000 loops, best of 3: 187 us per loop >>>> >>>> Where the hash table shines is in more elaborate situations. If you >>>> keep the first index where it was found, and the number of repeats, in the >>>> hash table, you can get return_index and return_counts almost for free, >>>> which means you are performing an extra 3x faster than with sorting. >>>> return_inverse requires a little more trickery, so I won;t attempt to >>>> quantify the improvement. But I wouldn't be surprised if, after fine tuning >>>> it, there is close to an order of magnitude overall improvement >>>> >>>> The spped-up for in1d is also nice: >>>> >>>> In [16]: a = np.random.randint(1000, size=(1000,)) >>>> In [17]: b = np.random.randint(1000, size=(500,)) >>>> In [18]: %timeit np.in1d(a, b) >>>> 1000 loops, best of 3: 178 us per loop >>>> In [19]: %timeit _in1d(a, b) >>>> 10000 loops, best of 3: 30.1 us per loop >>>> >>>> Of course, there is no point in >>>> >>> >>> Ooops!!! Hit the send button too quick. Not to extend myself too long: >>> if we are going to rethink all of this, we should approach it with an open >>> mind. Still, and this post is not helping with that either, I am afraid >>> that we are discussing implementation details, but are missing a broader >>> vision of what we want to accomplish and why. That vision of what numpy's >>> grouping functionality, if any, should be, and how it complements or >>> conflicts with what pandas is providing, should precede anything else. I >>> now I haven't, but has anyone looked at how pandas implements grouping? >>> Their documentation on the subject is well worth a read: >>> >>> http://pandas.pydata.org/pandas-docs/stable/groupby.html >>> >>> Does numpy need to replicate this? What/why/how can we add to that? >>> >>> Jaime >>> >>> -- >>> (\__/) >>> ( O.o) >>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus >>> planes de dominación mundial. >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> I would certainly not be opposed to having a hashing based indexing >> mechanism; I think it would make sense design-wise to have a HashIndex >> class with the same interface as the rest, and use that subclass in those >> arraysetops where it makes sense. The 'how to' of indexing and its >> applications are largely orthogonal I think (with some tiny performance >> compromises which are worth the abstraction imo). For datasets which are >> not purely random, have many unique items, and which do not fit into cache, >> I would expect sorting to come out on top, but indeed it depends on the >> dataset. >> >> Yeah, the question how pandas does grouping, and whether we can do >> better, is a relevant one. >> >> From what I understand, pandas relies on cython extensions to get >> vectorized grouping functionality. This is no longer necessary since the >> introduction of ufuncs in numpy. I don't know how the implementations >> compare in terms of performance, but I doubt the difference is huge. >> >> I personally use grouping a lot in my code, and I don't like having to >> use pandas for it. Most importantly, I don't want to go around creating a >> dataframe for a single one-line hit-and-run association between keys and >> values. The permanent association of different types of data and their >> metadata which pandas offers is I think the key difference from numpy, >> which is all about manipulating just plain ndarrays. Arguably, grouping >> itself is a pretty elementary manipulating of ndarrays, and adding calls to >> DataFrame or Series inbetween a statement that could just be >> simply group_by(keys).mean(values) feels wrong to me. As does including >> pandas as a dependency just to use this small piece of >> functionality. Grouping is a more general functionality than any particular >> method of organizing your data. >> >> In terms of features, adding transformations and filtering might be nice >> too; I hadn't thought about it, but that is because unlike the currently >> implemented features, the need has never arose for me. Im only a small >> sample size, and I don't see any fundamental objection to adding such >> functionality though. It certainly raises the question as to where to draw >> the line with pandas; but my rule of thumb is that if you can think of it >> as an elementary operation on ndarrays, then it probably belongs in numpy. >> >> > Oh I forgot to add: with an indexing mechanism based on sorting, unique > values and counts also come 'for free', not counting the O(N) cost of > actually creating those arrays. The only time an operating relying on an > index incurs another nontrivial amount of overhead is in case its 'rank' or > 'inverse' property is used, which invokes another argsort. But for the vast > majority of grouping or set operations, these properties are never used. > That extra argsort is now gone from master: https://github.com/numpy/numpy/pull/5012 Even with this improvement, returning any index typically makes `np.unique` run at least 2x slower: In [1]: import numpy as np In [2]: a = np.random.randint(100, size=(1000,)) In [3]: %timeit np.unique(a) 10000 loops, best of 3: 37.3 us per loop In [4]: %timeit np.unique(a, return_inverse=True) 10000 loops, best of 3: 62.1 us per loop In [5]: %timeit np.unique(a, return_index=True) 10000 loops, best of 3: 72.8 us per loop In [6]: %timeit np.unique(a, return_counts=True) 10000 loops, best of 3: 56.4 us per loop In [7]: %timeit np.unique(a, return_index=True, return_inverse=True, return_coun ts=True) 10000 loops, best of 3: 112 us per loop -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion