On Thu, May 03, 2012 at 03:08:00PM +0200, Kehl Dániel wrote: > Dear List-members, > > I have a problem where I have to estimate a mean, or a sum of a > population but for some reason it contains a huge amount of zeros. > I cannot give real data but I constructed a toy example as follows > > N1 <- 100000 > N2 <- 3000 > x1 <- rep(0,N1) > x2 <- rnorm(N2,300,100) > x <- c(x1,x2) > > n <- 1000 > > x_sample <- sample(x,n,replace=FALSE) > > I want to estimate the sum of x based on x_sample (not knowing N1 and N2 > but their sum (N) only). > The sample mean has a huge standard deviation I am looking for a better > estimator.
Hi. I do not know the exact answer, but let me formulate the following observation. If the question is redefined to estimate the mean of nonzero numbers, then an estimate is mean(x_sample[x_sample != 0]). Its standard deviation in your situation may be estimated as res <- rep(NA, times=1000) for (i in seq.int(along=res)) { x_sample <- sample(x,n,replace=FALSE) res[i] <- mean(x_sample[x_sample != 0]) } sd(res) [1] 18.72677 # this varies with the seed a bit The observation is that this cannot be improved much, since the estimate is based on a very small sample. The average size of the sample of nonzero values is N2/(N1+N2)*n = 29.1. So, the standard deviation should be something close to 100/sqrt(29.1) = 18.5376. Petr Savicky. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.