On 4/10/2006 8:08 PM, Thomas Lumley wrote: > On Mon, 10 Apr 2006, Duncan Murdoch wrote: > >> On 4/10/2006 7:22 PM, Thomas Lumley wrote: >>> On Mon, 10 Apr 2006, Henrik Bengtsson wrote: >>> >>>> Suggestion 2: >>>> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE >>>> as soon as a NA value is detected. In the best case it returns after >>>> the first index with TRUE, in the worst case it returns after the last >>>> index N with FALSE. The cost for is.na(x) is always O(N), and any() >>>> in the best case O(1) and in the worst case O(N) (if any() is >>>> implemented as I hope). An has.na() function would be very useful >>>> elsewhere too. >>> This sounds useful (though it has missed the deadline for 2.3.0). >>> >>> It won't help if the typical case is no missing values, as you suggest, but >>> it will be faster when there are missing values. >> I think it would help even in that case if the vector is large, because it >> avoids allocating and disposing of the logical vector of the same length as >> x. > > That makes sense. I have just tried, and for vectors of length ten > million it does make a measurable difference. > > >>>> BTW, without having checked the source code, it looks like is.na() is >>>> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on >>>> a vector without NAs. On the other hand, is.na(sum(x)) becomes >>>> awfully slow if 'x' contains NAs. >>>> >>> I don't think it is unnecessarily slow. It has to dispatch methods and it >>> has to make sure that matrix structure is preserved. After that the code >>> is just >>> >>> case REALSXP: >>> for (i = 0; i < n; i++) >>> LOGICAL(ans)[i] = ISNAN(REAL(x)[i]); >>> break; >>> >>> and it's hard to see how that can be improved. It does suggest that a >>> faster anyNA() function would have to not be generic. >> If it's necessary to make it not generic to achieve the speedup, I don't >> think it's worth doing. If anyNA is written not to be generic I'd guess a >> very common error will be to apply it to a dataframe and get a misleading >> "FALSE" answer. If we do that, I predict that the total amount of r-help >> time wasted on it will exceed the CPU time saved by orders of magnitude. >> > > I wasn't proposing that it should be stupid, just not generic. It could > support data frames (sum(), does, for example). If it didn't support data > frames it should certainly give an error rather than the wrong answer, but > if we are seriously trying to avoid delays around 0.1 seconds then going > through the generic function mechanism may be a problem.
If it's not dataframes, it will be something else. I think it's highly desirable that any(is.na(x)) == anyNA(x) within base packages, and we should make it straightforward to maintain this identity in contributed packages. By the way, I think Bill's suggestion of calling it anyMissing makes a lot of sense. Duncan ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel