Hi, I've got two suggestions how to speed up median() about 50%. For all iterative methods calling median() in the loops this has a major impact. The second suggestion will apply to other methods too.
This is what the functions look like today: > median function (x, na.rm = FALSE) { if (is.factor(x) || mode(x) != "numeric") stop("need numeric data") if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(NA) n <- length(x) if (n == 0) return(NA) half <- (n + 1)/2 if (n%%2 == 1) { sort(x, partial = half)[half] } else { sum(sort(x, partial = c(half, half + 1))[c(half, half + 1)])/2 } } <environment: namespace:stats> Suggestion 1: Replace the sort() calls with the .Internal(psort(x, partial)). This will avoid unnecessary overhead, especially an expensive second check for NAs using any(is.na(x)). Simple benchmarking with x <- rnorm(10e6) system.time(median(x))/system.time(median2(x)) where median2() is the function with the above replacements, gives about 20-25% speed up. Suggestion 2: Create a has.na(x) function to replace any(is.na(x)) that returns TRUE as soon as a NA value is detected. In the best case it returns after the first index with TRUE, in the worst case it returns after the last index N with FALSE. The cost for is.na(x) is always O(N), and any() in the best case O(1) and in the worst case O(N) (if any() is implemented as I hope). An has.na() function would be very useful elsewhere too. An poor mans alternative to (2), is to have a third alternative to 'na.rm', say, NA, which indicates that we know that there are no NAs in 'x'. The original median() is approx 50% slower (naive benchmarking) than a version with the above two improvements, if passing a large 'x' with no NAs; median2 <- function (x, na.rm = FALSE) { if (is.factor(x) || mode(x) != "numeric") stop("need numeric data") if (is.na(na.rm)) { } else if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(NA) n <- length(x) if (n == 0) return(NA) half <- (n + 1)/2 if (n%%2 == 1) { .Internal(psort(x, half))[half] } else { sum(.Internal(psort(x, c(half, half + 1)))[c(half, half + 1)])/2 } } x <- rnorm(10e5) K <- 10 t0 <- system.time({ for (kk in 1:K) y <- median(x); }) print(t0) # [1] 1.82 0.14 1.98 NA NA t1 <- system.time({ for (kk in 1:K) y <- median2(x, na.rm=NA); }) print(t1) # [1] 1.25 0.06 1.34 NA NA print(t0/t1) # [1] 1.456000 2.333333 1.477612 NA NA BTW, without having checked the source code, it looks like is.na() is unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on a vector without NAs. On the other hand, is.na(sum(x)) becomes awfully slow if 'x' contains NAs. /Henrik ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel