On Thu, Apr 14, 2011 at 8:19 PM, Ravi Varadhan <rvarad...@jhmi.edu> wrote: > Bill's code is insanely fast! > > f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y))) > > n1 <- 1e07 > n2 <- 10^c(1,2,3,4,5,6,7) > tt <- rep(NA, 7) > x <- rnorm(n1) > for (i in 1:length(n2)){ > y <- runif(n2[i]) > tt[i] <- system.time(a1 <- f2(x, y))[3] > } > >> tt > [1] 0.70 0.86 1.03 1.28 1.54 4.99 12.07 >> > > I would be surprised if this can be beaten even in C/C++.
Have you looked at the definition of the findInterval function, especially the part where it calls the C function? I still think that it can be done faster if the x vector is sorted in addition to the y vector, although admittedly the speed gain will not be as much. And, of course, Bill's solution depends on knowing R or S-PLUS well enough to remember that there is a findInterval function. > From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] On Behalf > Of Douglas Bates [ba...@stat.wisc.edu] > Sent: Thursday, April 14, 2011 6:22 PM > To: William Dunlap > Cc: r-help@r-project.org; Sunduz Keles; rcpp-devel; Kevin Ummel > Subject: Re: [R] Find number of elements less than some number: > Elegant/fastsolution needed > > My colleague Sunduz Keles once mentioned a similar problem to me. She > had a large sample from a reference distribution and a test sample > (both real-valued in her case) and she wanted, for each element of the > test sample, the proportion of the reference sample that was less than > the element. It's a type of empirical p-value calculation. > > I forget the exact sizes of the samples (do you remember, Sunduz?) but > they were in the tens of thousands or larger. Solutions in R tended > to involve comparing every element in the test sample to every element > in the reference sample but, of course, that is unnecessary. If you > sort both samples then you can start the comparisons for a particular > element of the test sample at the element of the reference sample > where the last comparison failed. > > I was able to write a very short C++ function using the Rcpp package > that provided about a 1000-fold increase in speed relative to the best > I could do in R. I don't have the script on this computer so I will > post it tomorrow when I am back on the computer at the office. > > Apologies for cross-posting to the Rcpp-devel list but I am doing so > because this might make a good example of the usefulness of Rcpp and > inline. > > On Thu, Apr 14, 2011 at 4:45 PM, William Dunlap <wdun...@tibco.com> wrote: >>> -----Original Message----- >>> From: r-help-boun...@r-project.org >>> [mailto:r-help-boun...@r-project.org] On Behalf Of Kevin Ummel >>> Sent: Thursday, April 14, 2011 12:35 PM >>> To: r-help@r-project.org >>> Subject: [R] Find number of elements less than some number: >>> Elegant/fastsolution needed >>> >>> Take vector x and a subset y: >>> >>> x=1:10 >>> >>> y=c(4,5,7,9) >>> >>> For each value in 'x', I want to know how many elements in >>> 'y' are less than 'x'. >>> >>> An example would be: >>> >>> sapply(x,FUN=function(i) {length(which(y<i))}) >>> [1] 0 0 0 0 1 2 2 3 3 4 >>> >>> But this solution is far too slow when x and y have lengths >>> in the millions. >>> >>> I'm certain an elegant (and computationally efficient) >>> solution exists, but I'm in the weeds at this point. >> >> If x is sorted findInterval(x, y) would do it for <= but you >> want strict <. Try >> f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y))) >> where your version is >> f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))}) >> E.g., >> > x <- sort(sample(1e6, size=1e5)) >> > y <- sample(1e6, size=1e4, replace=TRUE) >> > system.time(r0 <- f0(x,y)) >> user system elapsed >> 7.900 0.310 8.211 >> > system.time(r2 <- f2(x,y)) >> user system elapsed >> 0.000 0.000 0.004 >> > identical(r0, r2) >> [1] TRUE >> >> Bill Dunlap >> Spotfire, TIBCO Software >> wdunlap tibco.com >> >>> >>> Any help is much appreciated. >>> >>> Kevin >>> >>> University of Manchester >>> >>> >>> >>> >>> >>> >>> >>> >>> Take two vectors x and y, where y is a subset of x: >>> >>> x=1:10 >>> >>> y=c(2,5,6,9) >>> >>> If y is removed from x, the original x values now have a new >>> placement (index) in the resulting vector (new): >>> >>> new=x[-y] >>> >>> index=1:length(new) >>> >>> The challenge is: How can I *quickly* and *efficiently* >>> deduce the new 'index' value directly from the original 'x' >>> value -- using only 'y' as an input? >>> >>> In practice, I have very large matrices containing the 'x' >>> values, and I need to convert them to the corresponding >>> 'index' if the 'y' values are removed. >>> >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.