Re: [R] Approximating discrete distribution by continuous distribution

peter dalgaard Tue, 22 Jan 2013 06:22:20 -0800

On Jan 22, 2013, at 13:45 , Prof Brian Ripley wrote:

> On 22/01/2013 11:49, Michael Haenlein wrote:
>> Dear all,
>> 
>> I have a discrete distribution showing how age is distributed across a
>> population using a certain set of bands:
>> 
>> Age <- matrix(c(74045062, 71978405, 122718362, 40489415), ncol=1,
>> dimnames=list(c("<18", "18-34", "35-64", "65+"),c()))
>> Age_dist <- Age/sum(Age)
>> 
>> For example I know that 23.94% of all people are between 0-18 years, 23.28%
>> between 18-34 years and so forth.
>> 
>> I would like to find a continuous approximation of this discrete
>> distribution in order to estimate the probability that a person is for
>> example 16 years old.
>> 
>> Is there some automatic way in R through which this can be done? I tried a
>> Kernel density estimation of the histogram but this does not seem to
>> provide what I'm looking for.
> 
> This is not really an R question, but a statistics one.  It is almost 
> guesswork: if for example these were drivers in the UK, the answer is 0.  So 
> you need to supply some information about the shape of the distribution of 
> <18 year olds.
> 
> You have estimates of the cumulative distribution function at c(0, 18, 35, 
> 65, Inf) (or some better upper limit).  You want to interpolate it.  You 
> could use linear interpolation (approx[fun]) or a monotone spline 
> interpolation (spline[fun]) or any other interpolation method which meets 
> your needs.  But whatever you use, you will supplying a lot of information 
> not actually in your data.



Agreed. The linear interpolation method is sometimes described as the "sum 
polygon", and sort of assumes that there is a uniform distribution of ages in 
each range. I.e., the number of 16 year olds would be 1/18 of the 0-17 y.o. 
However, I'd feel somewhat uneasy about doing this with such wide age-bands.

There is also the option of fitting a standard distribution like the Weibull to 
the data and using that. The mle() function should do this if you write out the 
log-likelihood using something like 

dmultinom(Age, prob=diff(pweibull(c(0,18,15,65,Inf), shape, scale), log=TRUE)

With a quarter of a billion observations, the fit might be less than perfect, 
but on the other hand, extracting more than two parameters from four data 
points sound a bit ominous.

-- 
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd....@cbs.dk  Priv: pda...@gmail.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Approximating discrete distribution by continuous distribution

Reply via email to