Re: [R] How hist() decides breaks?

Ted Harding Mon, 19 May 2008 04:02:27 -0700

On 19-May-08 10:00:10, Peter Dalgaard wrote:
> (Ted Harding) wrote:
>> Hi Folks,
>> I'd like to know how hist() decides how many cells to use
>> when it ignores my "suggestion" to use say 'hist(...,breaks=50)'.
>>
>> More specifically, I have the results of 10000 simulations,
>> each returning an 8-vector, therefore 8 variables each with
>> 10000 values. Some of these 8 have somewhat skew distributions.
>> Say one of these 8 variables is X.
>>
>> I ask for H <- hist(X,breaks=50), and get a histogram which
>> usually has a different number of cells than what I intended.
>>
>> For instance, for one of these simulations, the 8 different
>> values of length(H$breaks) are:
>>
>>   70, 44, 38, 68, 50, 40, 46, 45
>>
>> ?hist tells me
>>
>> A)
>>   breaks: one of:
>>     *  a vector giving the breakpoints between histogram
>>        cells,
>>     *  a single number giving the number of cells for the
>>        histogram,
>>     *  a character string naming an algorithm to compute the
>>        number of cells (see Details),
>>     *  a function to compute the number of cells.
>>
>>     In the last three cases the number is a suggestion only. 
>>
>> B)
>>   The default for 'breaks' is '"Sturges"': see 'nclass.Sturges'.
>>
>> If I look at the code for nclass.Sturges() I see
>>
>>   function (x) ceiling(log2(length(x)) + 1)
>>
>> and, for length(X) = 10000, this gives 15. This is not related
>> to any of the numbers of breaks I actually got, in any way obvious
>> to me.
>>
>> So:
>> Question 1: hist() has apparently ignored my "suggestion" of
>>   "break=50". Why? What is the criterion for ignoring?
>>
>> Question 2: Presumably, if it ignores the "suggestion", it
>>   does something else, of its choice. I would then, perhaps,
>>   expect it to fall back to its default, which is (allegedly)
>>   Sturges. But the result from nclass.Sturges looks different
>>   from what it actually did. So what did it actually do, and
>>   how did it decide on this?
>>   
> No, it is not ignoring you.
> 
> Try
> 
> hist(rnorm(10000))
> length(hist(rnorm(10000),breaks=50)$breaks)
> 
> and repeat a dozen of times or so. Chances are that you'll mostly see
> lengths around 40, but definitely more than the 17 or so that you'll
> see without the breaks=50. Next, try
> 
> diff(hist(rnorm(10000),breaks=50)$breaks)
> 
> and notice that this is usually 0.2, although if you repeat enough
> times, you might get a couple of cases with 0.1 and a length of
> 75(-ish).
> 
> Get it? Otherwise look at help(pretty) since this is what is doing the
> work.
> 
>     -p


Thanks for the pointer to 'pretty', whose role is not mentioned
in "?hist". I shall study this! (I still don't "get it"!)

In your example above I generally get 38-40 breaks (with 50
requested), but once (in about 30 repetitions) I got 72, as
you point out.

I then tried it with 1.1*rnorm(10000), and got 42-51;
then with 1.2*rnorm(10000), and got 46-51;
then with 1.3*rnorm(10000), and got 47-61.

It seems there is a slightly unstable relationship between
the urge to honour the requested "n=50", and the desire to
achieve "nice" numerical values (on the scale of 10) for
the values of the breakpoints.

Thanks.
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 094 0861
Date: 19-May-08                                       Time: 12:00:28
------------------------------ XFMail ------------------------------

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How hist() decides breaks?

Reply via email to