Yes, I was aware of that. But the point would be to provide true vectorization on those operations.
The way I see it, numpy may not have to have a GroupBy implementation, but it should at least enable implementing one that is fast and efficient over any axis. On Wed, Aug 27, 2014 at 12:38 PM, Eelco Hoogendoorn < hoogendoorn.ee...@gmail.com> wrote: > i.e, if the grouped axis is small but the other axes are not, you could > write this, which avoids the python loop over the long axis that > np.vectorize would otherwise perform. > > import numpy as np > from grouping import group_by > keys = np.random.randint(0,4,10) > values = np.random.rand(10,2000) > for k,g in zip(*group_by(keys)(values)): > print k, g.mean(0) > > > > > On Wed, Aug 27, 2014 at 9:29 PM, Eelco Hoogendoorn < > hoogendoorn.ee...@gmail.com> wrote: > >> f.i., this works as expected as well (100 keys of 1d int arrays and 100 >> values of 1d float arrays): >> >> group_by(randint(0,4,(100,2))).mean(rand(100,2)) >> >> >> On Wed, Aug 27, 2014 at 9:27 PM, Eelco Hoogendoorn < >> hoogendoorn.ee...@gmail.com> wrote: >> >>> If I understand you correctly, the current implementation supports these >>> operations. All reductions over groups (except for median) are performed >>> through the corresponding ufunc (see GroupBy.reduce). This works on >>> multidimensional arrays as well, although this broadcasting over the >>> non-grouping axes is accomplished using np.vectorize. Actual vectorization >>> only happens over the axis being grouped over, but this is usually a long >>> axis. If it isn't, it is more efficient to perform a reduction by means of >>> splitting the array by its groups first, and then map the iterable of >>> groups over some reduction operation (as noted in the docstring of >>> GroupBy.reduce). >>> >>> >>> On Wed, Aug 27, 2014 at 8:29 PM, Jaime Fernández del Río < >>> jaime.f...@gmail.com> wrote: >>> >>>> Hi Eelco, >>>> >>>> I took a deeper look into your code a couple of weeks back. I don't >>>> think I have fully grasped what it allows completely, but I agree that some >>>> form of what you have there is highly desirable. Along the same lines, for >>>> sometime I have been thinking that the right place for a `groupby` in numpy >>>> is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a >>>> multidimensional version of `np.bincount(groups, weights=arr)`. You would >>>> then need a more powerful version of `np.unique` to produce the `groups`, >>>> but that is something that Joe Kington's old PR was very close to >>>> achieving, that should probably be resurrected as well. But yes, there >>>> seems to be material for a NEP here, and some guidance from one of the >>>> numpy devs would be helpful in getting this somewhere. >>>> >>>> Jaime >>>> >>>> >>>> On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn < >>>> hoogendoorn.ee...@gmail.com> wrote: >>>> >>>>> It wouldn't hurt to have this function, but my intuition is that its >>>>> use will be minimal. If you are already working with sorted arrays, you >>>>> already have a flop cost on that order of magnitude, and the optimized >>>>> merge saves you a factor two at the very most. Using numpy means you are >>>>> sacrificing factors of two and beyond relative to pure C left right and >>>>> center anyway, so if this kind of thing matters to you, you probably wont >>>>> be working in numpy in the first place. >>>>> >>>>> That said, I share your interest in overhauling arraysetops. I see >>>>> many opportunities for expanding its functionality. There is a question >>>>> that amounts to 'how do I do group-by in numpy' on stackoverflow almost >>>>> every week. That would have my top priority, but also things like >>>>> extending >>>>> np.unique to things like graph edges, or other more complex input, is very >>>>> often useful to me. >>>>> >>>>> Ive written up a draft <http://pastebin.com/c5WLWPbp>a while ago >>>>> which accomplishes all of the above and more. It reimplements functions >>>>> like np.unique around a common Index object. This index object >>>>> encapsulates >>>>> the precomputation (sorting) required for efficient set-ops on different >>>>> datatypes, and provides a common interface to obtain the kind of >>>>> information you are talking about (which is used extensively internally in >>>>> the implementation of group_by, for instance). >>>>> >>>>> ie, this functionality allows you to write neat things like >>>>> group_by(randint(0,9,(100,2))).median(rand(100)) >>>>> >>>>> But I have the feeling much more could be done in this direction, and >>>>> I feel this draft could really use a bit of back and forth. If we are >>>>> going >>>>> to completely rewrite arraysetops, we might as well do it right. >>>>> >>>>> >>>>> On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río < >>>>> jaime.f...@gmail.com> wrote: >>>>> >>>>>> A request was open in github to add a `merge` function to numpy that >>>>>> would merge two sorted 1d arrays into a single sorted 1d array. I have >>>>>> been >>>>>> playing around with that idea for a while, and have a branch in my numpy >>>>>> fork that adds a `mergesorted` function to `numpy.lib`: >>>>>> >>>>>> >>>>>> https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a >>>>>> >>>>>> I drew inspiration from C++ STL algorithms, and merged into a single >>>>>> function what merge, set_union, set_intersection, set_difference and >>>>>> set_symmetric_difference do there. >>>>>> >>>>>> My first thought when implementing this was to not make it a public >>>>>> function, but use it under the hood to speed-up some of the functions of >>>>>> `arraysetops.py`, which are now merging two already sorted functions by >>>>>> doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my >>>>>> testing, but the speed-ups weren't that great. >>>>>> >>>>>> One other thing I saw value in for some of the `arraysetops.py` >>>>>> functions, but couldn't fully figure out, was in providing extra output >>>>>> aside from the merged arrays, either in the form of indices, or of >>>>>> boolean >>>>>> masks, indicating which items of the original arrays made it into the >>>>>> merged one, and/or where did they end up in it. >>>>>> >>>>>> Since there is at least one other person out there that likes it, is >>>>>> there any more interest in such a function? If yes, any comments on what >>>>>> the proper interface for extra output should be? Although perhaps the >>>>>> best >>>>>> is to leave that out for starters and see what use people make of it, if >>>>>> any. >>>>>> >>>>>> Jaime >>>>>> >>>>>> -- >>>>>> (\__/) >>>>>> ( O.o) >>>>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus >>>>>> planes de dominación mundial. >>>>>> >>>>>> _______________________________________________ >>>>>> NumPy-Discussion mailing list >>>>>> NumPy-Discussion@scipy.org >>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> NumPy-Discussion mailing list >>>>> NumPy-Discussion@scipy.org >>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>>> >>>>> >>>> >>>> >>>> -- >>>> (\__/) >>>> ( O.o) >>>> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus >>>> planes de dominación mundial. >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> >>> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion