Hi,

I have a pull request here [1] to add a cut function similar to R's
[2]. It seems there are often requests for similar functionality. It's
something I'm making use of for my own work and would like to use in
statstmodels and in generating instances of pandas' Factor class, but
is this generally something people would find useful to warrant its
inclusion in numpy? It will be even more useful I think with an enum
dtype in numpy.

If you aren't familiar with cut, here's a potential use case. Going
from a continuous to a categorical variable.

Given a continuous variable

[~/]
[8]: age = np.random.randint(15,70, size=100)

[~/]
[9]: age
[9]:
array([58, 32, 20, 25, 34, 69, 52, 27, 20, 23, 51, 61, 39, 54, 39, 44, 27,
       17, 29, 18, 66, 25, 44, 21, 54, 32, 50, 60, 25, 41, 68, 25, 42, 69,
       50, 69, 24, 69, 69, 48, 30, 20, 18, 15, 50, 48, 44, 27, 57, 52, 40,
       27, 58, 45, 44, 32, 54, 19, 36, 32, 55, 17, 55, 15, 19, 29, 22, 25,
       36, 44, 29, 53, 37, 31, 51, 39, 21, 66, 25, 26, 20, 17, 41, 50, 27,
       23, 62, 69, 65, 34, 38, 61, 39, 34, 38, 35, 18, 36, 29, 26])

Give me a variable where people are in age groups (lower bound is not inclusive)

[~/]
[10]: groups = [14, 25, 35, 45, 55, 70]

[~/]
[11]: age_cat = np.cut(age, groups)

[~/]
[12]: age_cat
[12]:
array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1, 3,
       1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4, 4,
       3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1, 3,
       3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3, 5,
       3, 2, 3, 2, 1, 3, 2, 2])

Skipper

[1] https://github.com/numpy/numpy/pull/248
[2] http://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to