On 19-May-08 10:00:10, Peter Dalgaard wrote: > (Ted Harding) wrote: >> Hi Folks, >> I'd like to know how hist() decides how many cells to use >> when it ignores my "suggestion" to use say 'hist(...,breaks=50)'. >> >> More specifically, I have the results of 10000 simulations, >> each returning an 8-vector, therefore 8 variables each with >> 10000 values. Some of these 8 have somewhat skew distributions. >> Say one of these 8 variables is X. >> >> I ask for H <- hist(X,breaks=50), and get a histogram which >> usually has a different number of cells than what I intended. >> >> For instance, for one of these simulations, the 8 different >> values of length(H$breaks) are: >> >> 70, 44, 38, 68, 50, 40, 46, 45 >> >> ?hist tells me >> >> A) >> breaks: one of: >> * a vector giving the breakpoints between histogram >> cells, >> * a single number giving the number of cells for the >> histogram, >> * a character string naming an algorithm to compute the >> number of cells (see Details), >> * a function to compute the number of cells. >> >> In the last three cases the number is a suggestion only. >> >> B) >> The default for 'breaks' is '"Sturges"': see 'nclass.Sturges'. >> >> If I look at the code for nclass.Sturges() I see >> >> function (x) ceiling(log2(length(x)) + 1) >> >> and, for length(X) = 10000, this gives 15. This is not related >> to any of the numbers of breaks I actually got, in any way obvious >> to me. >> >> So: >> Question 1: hist() has apparently ignored my "suggestion" of >> "break=50". Why? What is the criterion for ignoring? >> >> Question 2: Presumably, if it ignores the "suggestion", it >> does something else, of its choice. I would then, perhaps, >> expect it to fall back to its default, which is (allegedly) >> Sturges. But the result from nclass.Sturges looks different >> from what it actually did. So what did it actually do, and >> how did it decide on this? >> > No, it is not ignoring you. > > Try > > hist(rnorm(10000)) > length(hist(rnorm(10000),breaks=50)$breaks) > > and repeat a dozen of times or so. Chances are that you'll mostly see > lengths around 40, but definitely more than the 17 or so that you'll > see without the breaks=50. Next, try > > diff(hist(rnorm(10000),breaks=50)$breaks) > > and notice that this is usually 0.2, although if you repeat enough > times, you might get a couple of cases with 0.1 and a length of > 75(-ish). > > Get it? Otherwise look at help(pretty) since this is what is doing the > work. > > -p
Thanks for the pointer to 'pretty', whose role is not mentioned in "?hist". I shall study this! (I still don't "get it"!) In your example above I generally get 38-40 breaks (with 50 requested), but once (in about 30 repetitions) I got 72, as you point out. I then tried it with 1.1*rnorm(10000), and got 42-51; then with 1.2*rnorm(10000), and got 46-51; then with 1.3*rnorm(10000), and got 47-61. It seems there is a slightly unstable relationship between the urge to honour the requested "n=50", and the desire to achieve "nice" numerical values (on the scale of 10) for the values of the breakpoints. Thanks. Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 19-May-08 Time: 12:00:28 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.