Hi Eelco, I took a deeper look into your code a couple of weeks back. I don't think I have fully grasped what it allows completely, but I agree that some form of what you have there is highly desirable. Along the same lines, for sometime I have been thinking that the right place for a `groupby` in numpy is as a method of ufuncs, so that `np.add.groupby(arr, groups)` would do a multidimensional version of `np.bincount(groups, weights=arr)`. You would then need a more powerful version of `np.unique` to produce the `groups`, but that is something that Joe Kington's old PR was very close to achieving, that should probably be resurrected as well. But yes, there seems to be material for a NEP here, and some guidance from one of the numpy devs would be helpful in getting this somewhere.
Jaime On Wed, Aug 27, 2014 at 10:35 AM, Eelco Hoogendoorn < hoogendoorn.ee...@gmail.com> wrote: > It wouldn't hurt to have this function, but my intuition is that its use > will be minimal. If you are already working with sorted arrays, you already > have a flop cost on that order of magnitude, and the optimized merge saves > you a factor two at the very most. Using numpy means you are sacrificing > factors of two and beyond relative to pure C left right and center anyway, > so if this kind of thing matters to you, you probably wont be working in > numpy in the first place. > > That said, I share your interest in overhauling arraysetops. I see many > opportunities for expanding its functionality. There is a question that > amounts to 'how do I do group-by in numpy' on stackoverflow almost every > week. That would have my top priority, but also things like extending > np.unique to things like graph edges, or other more complex input, is very > often useful to me. > > Ive written up a draft <http://pastebin.com/c5WLWPbp>a while ago which > accomplishes all of the above and more. It reimplements functions like > np.unique around a common Index object. This index object encapsulates the > precomputation (sorting) required for efficient set-ops on different > datatypes, and provides a common interface to obtain the kind of > information you are talking about (which is used extensively internally in > the implementation of group_by, for instance). > > ie, this functionality allows you to write neat things like > group_by(randint(0,9,(100,2))).median(rand(100)) > > But I have the feeling much more could be done in this direction, and I > feel this draft could really use a bit of back and forth. If we are going > to completely rewrite arraysetops, we might as well do it right. > > > On Wed, Aug 27, 2014 at 7:02 PM, Jaime Fernández del Río < > jaime.f...@gmail.com> wrote: > >> A request was open in github to add a `merge` function to numpy that >> would merge two sorted 1d arrays into a single sorted 1d array. I have been >> playing around with that idea for a while, and have a branch in my numpy >> fork that adds a `mergesorted` function to `numpy.lib`: >> >> >> https://github.com/jaimefrio/numpy/commit/ce5d480afecc989a36e5d2bf4ea1d1ba58a83b0a >> >> I drew inspiration from C++ STL algorithms, and merged into a single >> function what merge, set_union, set_intersection, set_difference and >> set_symmetric_difference do there. >> >> My first thought when implementing this was to not make it a public >> function, but use it under the hood to speed-up some of the functions of >> `arraysetops.py`, which are now merging two already sorted functions by >> doing `np.sort(np.concatenate((a, b)))`. I would need to revisit my >> testing, but the speed-ups weren't that great. >> >> One other thing I saw value in for some of the `arraysetops.py` >> functions, but couldn't fully figure out, was in providing extra output >> aside from the merged arrays, either in the form of indices, or of boolean >> masks, indicating which items of the original arrays made it into the >> merged one, and/or where did they end up in it. >> >> Since there is at least one other person out there that likes it, is >> there any more interest in such a function? If yes, any comments on what >> the proper interface for extra output should be? Although perhaps the best >> is to leave that out for starters and see what use people make of it, if >> any. >> >> Jaime >> >> -- >> (\__/) >> ( O.o) >> ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes >> de dominación mundial. >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion