On Tue, Jul 19, 2011 at 12:30 AM, Joshua Wiley <jwiley.ps...@gmail.com> wrote: >>>> [snip] I guess that I must have a data frame to plot a histogram. >>> >>> Not at all! >>> >>> ## a *vector* of 100 million observation >>> x <- rnorm(10^8) >>> ## a histogram for it (see attached for the result from my system) >>> hist(x) >>> >>> No data frame required. I would not try this straight in anything but >>> traditional graphics for a 100 million observation vector, but if you >>> wanted it made in ggplot2 or something, you could prebin the data and >>> THEN plot bars corresponding to the bins. >> >> Thanks, Joshua, for your answer. >> >> True: A vector is enough to supply data for hist(). But my point is: >> Can a histogram be drawn without having all data on the computer >> memory? You partially answer this question by suggesting to prebind >> the data. Can this prebinning process be done transparently but chunk >> by chunk of data underneath? > > Sure, as long as you can figure out some basic details about the full > dataset. Just define your breaks, and then for chunks of the data at > a time, count how many fall into any particular bin. Once you are > done, add up all the counts for each bin, and voila. > > ## Get these values from the full data (using SQL) > x <- rnorm(1000) > n <- length(x) > minx <- min(x) > maxx <- max(x) > > ## Sturges style breaks > breaks <- pretty(c(minx, maxx), n = ceiling(log2(n) + 1)) > nB <- length(breaks) > > fuzz <- rep(1e-07 * median(diff(breaks)), nB) > fuzz[1] <- fuzz[1] * -1 > fuzzybreaks <- breaks + fuzz > > chunks <- 10 > > counts <- matrix(NA, nrow = chunks, ncol = nB - 1, > dimnames = list(paste("Sec", 1:chunks, sep = ''), > as.character(fuzzybreaks[-1]))) > > for(i in 1:chunks) { > index <- seq(1, n/chunks) + (n/chunks * (i - 1)) > counts[i, ] <- hist(x[index], breaks = fuzzybreaks)$counts > } > > ## The heights of your bars > colSums(counts) > ## results using hist() on x all at once > hist(x)$counts > > You would not even need to know the number of chunks you were going to > split your data into before hand, I just did it for convenience and to > instatiate a full sized matrix to hold the results. If you are > selecting subsets of your data using SQL rather than R, it becomes > even simpler. Once you have your fuzzybreaks, you just keep calling > hist on your new data with using the predefined breaks and saving the > results. Still, I do not break about 4.5 GB of memory used to just > plot a histogram on a 100 million observation vector, and it is > difficult to imagine the shape of the distribution changing > appreciably using a random sample of 100 million observations. It > also takes less than 10 seconds to calculate and draw the histogram on > my computer. The point being, I suspect you will spend more time > getting everything setup and working than seems worth it because you > can easily and quickly create a histogram on so large of vectors > already, the distribution is unlikely to vary anyway. Whatever floats > your boat, though.
Thanks again, Joshua. Your approach is quite interesting. Paul ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.