On 4/10/2006 7:22 PM, Thomas Lumley wrote: > On Mon, 10 Apr 2006, Henrik Bengtsson wrote: > >> Hi, >> >> I've got two suggestions how to speed up median() about 50%. For all >> iterative methods calling median() in the loops this has a major >> impact. The second suggestion will apply to other methods too. > > I'm surprised this has a major impact -- in your example it takes much > longer to generate the ten million numbers than to find the median. > >> Suggestion 1: >> Replace the sort() calls with the .Internal(psort(x, partial)). This >> will avoid unnecessary overhead, especially an expensive second check >> for NAs using any(is.na(x)). Simple benchmarking with >> >> x <- rnorm(10e6) >> system.time(median(x))/system.time(median2(x)) >> >> where median2() is the function with the above replacements, gives >> about 20-25% speed up. > > There's something that seems a bit undesirable about having median() call > the .Internal function for sort(). > >> Suggestion 2: >> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE >> as soon as a NA value is detected. In the best case it returns after >> the first index with TRUE, in the worst case it returns after the last >> index N with FALSE. The cost for is.na(x) is always O(N), and any() >> in the best case O(1) and in the worst case O(N) (if any() is >> implemented as I hope). An has.na() function would be very useful >> elsewhere too. > > This sounds useful (though it has missed the deadline for 2.3.0). > > It won't help if the typical case is no missing values, as you suggest, > but it will be faster when there are missing values.
I think it would help even in that case if the vector is large, because it avoids allocating and disposing of the logical vector of the same length as x. >> BTW, without having checked the source code, it looks like is.na() is >> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on >> a vector without NAs. On the other hand, is.na(sum(x)) becomes >> awfully slow if 'x' contains NAs. >> > > I don't think it is unnecessarily slow. It has to dispatch methods and > it has to make sure that matrix structure is preserved. After that the > code is just > > case REALSXP: > for (i = 0; i < n; i++) > LOGICAL(ans)[i] = ISNAN(REAL(x)[i]); > break; > > and it's hard to see how that can be improved. It does suggest that a > faster anyNA() function would have to not be generic. If it's necessary to make it not generic to achieve the speedup, I don't think it's worth doing. If anyNA is written not to be generic I'd guess a very common error will be to apply it to a dataframe and get a misleading "FALSE" answer. If we do that, I predict that the total amount of r-help time wasted on it will exceed the CPU time saved by orders of magnitude. Duncan Murdoch ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel