Thanks. Prompted by that stackoverflow question, and similar problems I had to deal with myself, I started working on a much more general extension to numpy's functionality in this space. Like you noted, things get a little panda-y, but I think there is a lot of panda's functionality that could or should be part of the numpy core, a robust set of grouping operations in particular.
see pastebin here: http://pastebin.com/c5WLWPbp Ive posted about it on this list before, but without apparent interest; and I havnt gotten around to getting this up to professional standards yet either. But there is a lot more that could be done in this direction. Note that the count functionality in the stackoverflow answer is relatively indirect and inefficient, using the inverse_index and such. A much more efficient method is obtained by the code used here. On Tue, Aug 12, 2014 at 5:57 PM, Warren Weckesser < [email protected]> wrote: > > > > On Tue, Aug 12, 2014 at 11:35 AM, Warren Weckesser < > [email protected]> wrote: > >> I created a pull request (https://github.com/numpy/numpy/pull/4958) that >> defines the function `count_unique`. `count_unique` generates a >> contingency table from a collection of sequences. For example, >> >> In [7]: x = [1, 1, 1, 1, 2, 2, 2, 2, 2] >> >> In [8]: y = [3, 4, 3, 3, 3, 4, 5, 5, 5] >> >> In [9]: (xvals, yvals), counts = count_unique(x, y) >> >> In [10]: xvals >> Out[10]: array([1, 2]) >> >> In [11]: yvals >> Out[11]: array([3, 4, 5]) >> >> In [12]: counts >> Out[12]: >> array([[3, 1, 0], >> [1, 1, 3]]) >> >> >> It can be interpreted as a multi-argument generalization of `np.unique(x, >> return_counts=True)`. >> >> It overlaps with Pandas' `crosstab`, but I think this is a pretty >> fundamental counting operation that fits in numpy. >> >> Matlab's `crosstab` (http://www.mathworks.com/help/stats/crosstab.html) >> and R's `table` perform the same calculation (with a few more bells and >> whistles). >> >> >> For comparison, here's Pandas' `crosstab` (same `x` and `y` as above): >> >> In [28]: import pandas as pd >> >> In [29]: xs = pd.Series(x) >> >> In [30]: ys = pd.Series(y) >> >> In [31]: pd.crosstab(xs, ys) >> Out[31]: >> col_0 3 4 5 >> row_0 >> 1 3 1 0 >> 2 1 1 3 >> >> >> And here is R's `table`: >> >> > x <- c(1,1,1,1,2,2,2,2,2) >> > y <- c(3,4,3,3,3,4,5,5,5) >> > table(x, y) >> y >> x 3 4 5 >> 1 3 1 0 >> 2 1 1 3 >> >> >> Is there any interest in adding this (or some variation of it) to numpy? >> >> >> Warren >> >> > > While searching StackOverflow in the numpy tag for "count unique", I just > discovered that I basically reinvented Eelco Hoogendoorn's code in his > answer to > http://stackoverflow.com/questions/10741346/numpy-frequency-counts-for-unique-values-in-an-array. > Nice one, Eelco! > > Warren > > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
