Peter Dalgaard wrote: > melissa cline wrote: >> Hello, >> >> I'm trying to bin a quantity into 2-3 bins for calculating entropy and >> mutual information. One of the approaches I'm exploring is the cut() >> function, which is what the mutualInfo function in binDist uses. When it's >> called in the format cut(data, breaks=n), it somehow splits the data into n >> distinct bins. Can anyone tell me how cut() decides where to cut? >> >> > This is one case where reading the actual R code is easier that > explaining what it does. From cut.default > > if (length(breaks) == 1) { > if (is.na(breaks) | breaks < 2) > stop("invalid number of intervals") > nb <- as.integer(breaks + 1) > dx <- diff(rx <- range(x, na.rm = TRUE)) > if (dx == 0) > dx <- rx[1] > breaks <- seq.int(rx[1] - dx/1000, rx[2] + dx/1000, length.out = nb) > } > > so basically it takes the range, extends it a bit and splits in into > <breaks> equally long segments. > > (For the sometimes more attractive option of splitting into groups of > roughly equal size, there is cut2 in the Hmisc package, or use quantile()) >
It can be a bit dangerous to use quantile() to provide breaks for cut(), because quantiles can be non-unique, which cut() doesn't like: > x1 <- c(1,1,1,1,1,1,1,1,1,2) > cut(x1, breaks=quantile(x1, (0:2)/2)) Error in cut.default(x1, breaks = quantile(x1, (0:2)/2)) : 'breaks' are not unique > However, cut2() in Hmisc handles this situation gracefully: > library(Hmisc) Attaching package: 'Hmisc' The following object(s) are masked from package:base : format.pval, round.POSIXt, trunc.POSIXt, units > cut2(x1, g=2) [1] 1 1 1 1 1 1 1 1 1 2 Levels: 1 2 > (Additionally, a potentially dangerous peculiarity of quantile() for this kind of purpose is that its return values can be out of order (i.e., diff(quantile(...))<0, at rounding error level), but this doesn't actually upset cut() in R because cut() sorts the breaks prior to using them.) -- Tony Plate ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.