I often need to work with large vectors whose distribution I want to summarize by Q-Q plots. Since the vectors are large, I use a subset of quantiles, e.g.
quantile(x, probs = ppoints(1000)) Unfortunately, this seemed to be taking too long for large x (much longer than 'sort'). I initially thought maybe quantile was doing something sophisticated (which I don't really need with a large data set), so I would write something simple myself. I did, but then I noticed that 'quantile' was doing essentially the same thing, with the exception that it was calling 'sort' with a non-null 'partial' argument. ?sort says: If 'partial' is not 'NULL', it is taken to contain indices of elements of 'x' which are to be placed in their correct positions by partial sorting. After the sort, the values specified in 'partial' are in their correct position in the sorted array. Any values smaller than these values are guaranteed to have a smaller index in the sorted array and any values which are greater are guaranteed to have a bigger index in the sorted array. This is included for efficiency, and many of the options are not available for partial sorting. However, rather than being efficient, this seems to considerably slow things down (I haven't checked memory efficiency). Consider the following code (imitating what 'quantile' does by default): probs <- ppoints(1000) for (i in seq(5000, 50000, 5000)) { x <- rnorm(i) n <- length(x) index <- 1 + (n - 1) * probs lo <- floor(index) hi <- ceiling(index) keep <- as.integer(unique(c(lo, hi))) cat(system.time(y1 <- sort(x, partial = keep))[1]) cat("\t") cat(system.time(y2 <- sort(x))[1]) cat("\t") cat(round(max(abs( y1 - y2 )), digits = 3)) cat("\t") cat(max(abs( y1[keep] - y2[keep] ))) cat("\n") } The first two columns in the output are timings for 'sort' with and without 'partial', the last two columns are just a rough check that partial sorting is doing what it claims. With R-2.2: 0.78 0 0.031 0 1.64 0.01 0.565 0 2.59 0.01 0.646 0 3.44 0.01 0.487 0 4.4 0.01 0.569 0 5.26 0.01 0.642 0 6.29 0.01 1.071 0 7.18 0.02 0.566 0 8.18 0.02 1.094 0 9.01 0.03 0.89 0 > version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Patched major 2 minor 2.0 year 2005 month 10 day 16 svn rev 35911 language R This also holds on (a faster machine running) r-devel 0.39 0 0.176 0 0.85 0.01 0.62 0 1.32 0 0.193 0 1.8 0 0.949 0 2.29 0 0.73 0 2.77 0 1.185 0 3.25 0.01 0.813 0 3.75 0.01 1.171 0 4.21 0.01 0.827 0 4.74 0.01 0.372 0 > version _ platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status Under development (unstable) major 2 minor 3.0 year 2005 month 11 day 25 svn rev 36468 language R Speed improves when the number of 'partial' indices is small, but even for 10 indices plain 'sort' is faster. Am I missing something? I haven't checked if NA's make a difference or if there might be memory usage issues, but even so, could 'quantile' at least have an option to disable partial sorting? Deepayan ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel