On Mon, Apr 16, 2012 at 8:08 PM, Tony Yu <[email protected]> wrote: > > > On Mon, Apr 16, 2012 at 6:01 PM, Skipper Seabold <[email protected]> > wrote: >> >> On Mon, Apr 16, 2012 at 5:51 PM, Tony Yu <[email protected]> wrote: >> > >> > >> > On Mon, Apr 16, 2012 at 5:27 PM, Skipper Seabold <[email protected]> >> > wrote: >> >> >> >> Hi, >> >> >> >> I have a pull request here [1] to add a cut function similar to R's >> >> [2]. It seems there are often requests for similar functionality. It's >> >> something I'm making use of for my own work and would like to use in >> >> statstmodels and in generating instances of pandas' Factor class, but >> >> is this generally something people would find useful to warrant its >> >> inclusion in numpy? It will be even more useful I think with an enum >> >> dtype in numpy. >> >> >> >> If you aren't familiar with cut, here's a potential use case. Going >> >> from a continuous to a categorical variable. >> >> >> >> Given a continuous variable >> >> >> >> [~/] >> >> [8]: age = np.random.randint(15,70, size=100) >> >> >> >> [~/] >> >> [9]: age >> >> [9]: >> >> array([58, 32, 20, 25, 34, 69, 52, 27, 20, 23, 51, 61, 39, 54, 39, 44, >> >> 27, >> >> 17, 29, 18, 66, 25, 44, 21, 54, 32, 50, 60, 25, 41, 68, 25, 42, >> >> 69, >> >> 50, 69, 24, 69, 69, 48, 30, 20, 18, 15, 50, 48, 44, 27, 57, 52, >> >> 40, >> >> 27, 58, 45, 44, 32, 54, 19, 36, 32, 55, 17, 55, 15, 19, 29, 22, >> >> 25, >> >> 36, 44, 29, 53, 37, 31, 51, 39, 21, 66, 25, 26, 20, 17, 41, 50, >> >> 27, >> >> 23, 62, 69, 65, 34, 38, 61, 39, 34, 38, 35, 18, 36, 29, 26]) >> >> >> >> Give me a variable where people are in age groups (lower bound is not >> >> inclusive) >> >> >> >> [~/] >> >> [10]: groups = [14, 25, 35, 45, 55, 70] >> >> >> >> [~/] >> >> [11]: age_cat = np.cut(age, groups) >> >> >> >> [~/] >> >> [12]: age_cat >> >> [12]: >> >> array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, >> >> 1, >> >> 3, >> >> 1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4, >> >> 4, >> >> 3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1, >> >> 3, >> >> 3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3, >> >> 5, >> >> 3, 2, 3, 2, 1, 3, 2, 2]) >> >> >> >> Skipper >> >> >> >> [1] https://github.com/numpy/numpy/pull/248 >> >> [2] http://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html >> > >> > >> > Is this the same as `np.searchsorted` (with reversed arguments)? >> > >> > In [292]: np.searchsorted(groups, age) >> > Out[292]: >> > array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1, >> > 3, >> > 1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4, >> > 4, >> > 3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1, >> > 3, >> > 3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3, >> > 5, >> > 3, 2, 3, 2, 1, 3, 2, 2]) >> > >> >> That's news to me, and I don't know how I missed it. > > > Actually, the only reason I remember searchsorted is because I also > implemented a variant of it before finding that it existed. >
It's certainly not an obvious name for the behavior I wanted at least with my background. Ie., I want something that works on the data not the bins/groups. And it's not referenced in histogram or digitize, though now that I wade back through some threads I see people pointing to it. It also appears to be faster than my implementation with digitize with a quick look. >> >> It looks like >> there is overlap, but cut will also do binning for equal width >> categorization >> >> [~/] >> [21]: np.cut(age, 6) >> [21]: >> array([5, 2, 1, 2, 3, 6, 5, 2, 1, 1, 4, 6, 3, 5, 3, 4, 2, 1, 2, 1, 6, 2, >> 4, >> 1, 5, 2, 4, 5, 2, 3, 6, 2, 3, 6, 4, 6, 1, 6, 6, 4, 2, 1, 1, 1, 4, 4, >> 4, 2, 5, 5, 3, 2, 5, 4, 4, 2, 5, 1, 3, 2, 5, 1, 5, 1, 1, 2, 1, 2, 3, >> 4, 2, 5, 3, 2, 4, 3, 1, 6, 2, 2, 1, 1, 3, 4, 2, 1, 6, 6, 6, 3, 3, 6, >> 3, 3, 3, 3, 1, 3, 2, 2]) >> >> and explicitly handles the case with constant x >> >> [~/] >> [26]: x = np.ones(100)*6 >> >> [~/] >> [27]: np.cut(x, 5) >> [27]: >> array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, >> 3, >> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, >> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, >> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, >> 3, 3, 3, 3, 3, 3, 3, 3]) >> >> I guess I could patch searchsorted. Thoughts? >> >> Skipper > > > Hmm, ... I'm not sure if these other call signatures map as well to the name > "searchsorted"; i.e. "cut" makes more sense in these cases. > > On the other hand, it seems these cases could be handled by `np.digitize` > (although they aren't currently). Hmm,... why doesn't the above call to > `cut` match (what I assume to be) the equivalent call to `np.digitize`: > > In [302]: np.digitize(age, np.linspace(age.min(), age.max(), 6)) > Out[302]: > array([4, 2, 1, 1, 2, 6, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1, 3, > 1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 6, 4, 6, 1, 6, 6, 4, 2, 1, 1, 1, 4, 4, > 3, 2, 4, 4, 3, 2, 4, 3, 3, 2, 4, 1, 2, 2, 4, 1, 4, 1, 1, 2, 1, 1, 2, > 3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 6, 5, 2, 3, 5, > 3, 2, 3, 2, 1, 2, 2, 2]) > > It's unfortunate that `digitize` and `histogram` have one call signature, > but `searchsorted` has the reverse; in that sense, I like `cut` better. > I actually extended digitize to work the way I wanted with the sole intention to implement cut. https://github.com/numpy/numpy/pull/245 I agree about the call signature. As I mentioned, the way my work flow goes, I have the data first then think about the groups rather than thinking about doing an action on the groups themselves. In this way, I still think having cut is beneficial. Skipper _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
