On Jan 22, 2013, at 13:45 , Prof Brian Ripley wrote: > On 22/01/2013 11:49, Michael Haenlein wrote: >> Dear all, >> >> I have a discrete distribution showing how age is distributed across a >> population using a certain set of bands: >> >> Age <- matrix(c(74045062, 71978405, 122718362, 40489415), ncol=1, >> dimnames=list(c("<18", "18-34", "35-64", "65+"),c())) >> Age_dist <- Age/sum(Age) >> >> For example I know that 23.94% of all people are between 0-18 years, 23.28% >> between 18-34 years and so forth. >> >> I would like to find a continuous approximation of this discrete >> distribution in order to estimate the probability that a person is for >> example 16 years old. >> >> Is there some automatic way in R through which this can be done? I tried a >> Kernel density estimation of the histogram but this does not seem to >> provide what I'm looking for. > > This is not really an R question, but a statistics one. It is almost > guesswork: if for example these were drivers in the UK, the answer is 0. So > you need to supply some information about the shape of the distribution of > <18 year olds. > > You have estimates of the cumulative distribution function at c(0, 18, 35, > 65, Inf) (or some better upper limit). You want to interpolate it. You > could use linear interpolation (approx[fun]) or a monotone spline > interpolation (spline[fun]) or any other interpolation method which meets > your needs. But whatever you use, you will supplying a lot of information > not actually in your data.
Agreed. The linear interpolation method is sometimes described as the "sum polygon", and sort of assumes that there is a uniform distribution of ages in each range. I.e., the number of 16 year olds would be 1/18 of the 0-17 y.o. However, I'd feel somewhat uneasy about doing this with such wide age-bands. There is also the option of fitting a standard distribution like the Weibull to the data and using that. The mle() function should do this if you write out the log-likelihood using something like dmultinom(Age, prob=diff(pweibull(c(0,18,15,65,Inf), shape, scale), log=TRUE) With a quarter of a billion observations, the fit might be less than perfect, but on the other hand, extracting more than two parameters from four data points sound a bit ominous. -- Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.