Hi Duncan, Thanks a ton -- I appreciate your taking the time to investigate this, and especially even checking into the IEEE standard to clarify.
Cheers, Kevin On Mon, Feb 10, 2014 at 11:54 AM, Rainer M Krug <rai...@krugs.de> wrote: > > > On 02/10/14, 19:07 , Duncan Murdoch wrote: >> On 10/02/2014 10:21 AM, Tim Hesterberg wrote: >>> This isn't quite what you were asking, but might inform your choice. >>> >>> R doesn't try to maintain the distinction between NA and NaN when >>> doing calculations, e.g.: >>> > NA + NaN >>> [1] NA >>> > NaN + NA >>> [1] NaN >>> So for the aggregate package, I didn't attempt to treat them differently. >> >> This looks like a bug to me. In 32 bit 3.0.2 and R-patched I see >> >>> NA + NaN >> [1] NA >>> NaN + NA >> [1] NA > > But under 3.0.2 patched 64 bit on Maverick: > >> version > _ > platform x86_64-apple-darwin10.8.0 > arch x86_64 > os darwin10.8.0 > system x86_64, darwin10.8.0 > status Patched > major 3 > minor 0.2 > year 2014 > month 01 > day 07 > svn rev 64692 > language R > version.string R version 3.0.2 Patched (2014-01-07 r64692) > nickname Frisbee Sailing >> NA+NaN > [1] NA >> NaN+NA > [1] NaN > >> >> This seems more reasonable to me. NA should propagate. (I can see an >> argument for NaN for the answer here, as I can't think of any possible >> non-missing value that would give anything else when added to NaN, but >> the answer should not depend on the order of operands.) >> >> However, I get the same as you in 64 bit 3.0.2. All calculations I've >> shown are on 64 bit Windows 7. >> >> Duncan Murdoch >> >> >>> >>> The aggregate package is available at >>> http://www.timhesterberg.net/r-packages >>> >>> Here is the inst/doc/missingValues.txt file from that package: >>> >>> -------------------------------------------------- >>> Copyright 2012 Google Inc. All Rights Reserved. >>> Author: Tim Hesterberg <roc...@google.com> >>> Distributed under GPL 2 or later. >>> >>> >>> Handling of missing values and not-a-numbers. >>> >>> >>> Here I'll note how this package handles missing values. >>> I do it the way R handles them, rather than the more strict way that >>> S+ does. >>> >>> First, for terminology, >>> NaN = "not-a-number", e.g. the result of 0/0 >>> NA = "missing value" or "true missing value", e.g. survey >>> non-response >>> xx = I'll uses this for the union of those, or "missing value of >>> any kind". >>> >>> For background, at the hardware level there is an IEEE standard that >>> specifies that certain bit patterns are NaN, and specifies that >>> operations involving an NaN result in another NaN. >>> >>> That standard doesn't say anything about missing values, which are >>> important in statistics. >>> >>> So what R and S+ do is to pick one of the bit patterns and declare >>> that to be a NA. In other words, the NA bit pattern is a subset of >>> the NaN bit patterns. >>> >>> At the user level, the reverse seems to hold. >>> You can assign either NA or NaN to an object. >>> But: >>> is.na(x) returns TRUE for both >>> is.nan(x) returns TRUE for NaN and FALSE for NA >>> Based on that, you'd think that NaN is a subset of NA. >>> To tell whether something is a true missing value do: >>> (is.na(x) & !is.nan(x)) >>> >>> The S+ convention is that any operation involving NA results in an NA; >>> otherwise any operation involving NaN results in NaN. >>> >>> The R convention is that any operation involving xx results in an xx; >>> a missing value of any kind results in another missing value of any >>> kind. R considers NA and NaN equivalent for testing purposes: >>> all.equal(NA_real_, NaN) >>> gives TRUE. >>> >>> Some R functions follow the S+ convention, e.g. the Math2 functions >>> in src/main/arithmetic.c use this macro: >>> #define if_NA_Math2_set(y,a,b) \ >>> if (ISNA (a) || ISNA (b)) y = NA_REAL; \ >>> else if (ISNAN(a) || ISNAN(b)) y = R_NaN; >>> >>> Other R functions, like the basic arithmetic operations +-/*^, >>> do not (search for PLUSOP in src/main/arithmetic.c). >>> They just let the hardware do the calculations. >>> As a result, you can get odd results like >>> > is.nan(NA_real_ + NaN) >>> [1] FALSE >>> > is.nan(NaN + NA_real_) >>> [1] TRUE >>> >>> The R help files help(is.na) and help(is.nan) suggest that >>> computations involving NA and NaN are indeterminate. >>> >>> It is faster to use the R convention; most operations are just >>> handled by the hardware, without extra work. >>> >>> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA >>> and NaN are removed. >>> >>> >>> >>> >>> >There is one NA but mulitple NaNs. >>> > >>> >And please re-read 'man memcmp': your cast is wrong. >>> > >>> >On 10/02/2014 06:52, Kevin Ushey wrote: >>> >> Hi R-devel, >>> >> >>> >> I have a question about the differentiation between NA and NaN values >>> >> as implemented in R. In arithmetic.c, we have >>> >> >>> >> int R_IsNA(double x) >>> >> { >>> >> if (isnan(x)) { >>> >> ieee_double y; >>> >> y.value = x; >>> >> return (y.word[lw] == 1954); >>> >> } >>> >> return 0; >>> >> } >>> >> >>> >> ieee_double is just used for type punning so we can check the final >>> >> bits and see if they're equal to 1954; if they are, x is NA, if >>> >> they're not, x is NaN (as defined for R_IsNaN). >>> >> >>> >> My question is -- I can see a substantial increase in speed (on my >>> >> computer, in certain cases) if I replace this check with >>> >> >>> >> int R_IsNA(double x) >>> >> { >>> >> return memcmp( >>> >> (char*)(&x), >>> >> (char*)(&NA_REAL), >>> >> sizeof(double) >>> >> ) == 0; >>> >> } >>> >> >>> >> IIUC, there is only one bit pattern used to encode R NA values, so >>> >> this should be safe. But I would like to be sure: >>> >> >>> >> Is there any guarantee that the different functions in R would return >>> >> NA as identical to the bit pattern defined for NA_REAL, for a given >>> >> architecture? Similarly for NaN value(s) and R_NaN? >>> >> >>> >> My guess is that it is possible some functions used internally by R >>> >> might encode NaN values differently; ie, setting the lower word to a >>> >> value different than 1954 (hence being NaN, but potentially not >>> >> identical to R_NaN), or perhaps this is architecture-dependent. >>> >> However, NA should be one specific bit pattern (?). And, I wonder if >>> >> there is any guarantee that the different functions used in R would >>> >> return an NaN value as identical to R_NaN (which appears to be the >>> >> 'IEEE NaN')? >>> >> >>> >> (interested parties can see + run a simple benchmark from the gist at >>> >> https://gist.github.com/kevinushey/8911432) >>> >> >>> >> Thanks, >>> >> Kevin >>> >> >>> >> ______________________________________________ >>> >> R-devel@r-project.org mailing list >>> >> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> >>> > >>> > >>> >-- >>> >Brian D. Ripley, rip...@stats.ox.ac.uk >>> >Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ >>> >University of Oxford, Tel: +44 1865 272861 (self) >>> >1 South Parks Road, +44 1865 272866 (PA) >>> >Oxford OX1 3TG, UK Fax: +44 1865 272595 >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation > Biology, UCT), Dipl. Phys. (Germany) > > Centre of Excellence for Invasion Biology > Stellenbosch University > South Africa > > Tel : +33 - (0)9 53 10 27 44 > Cell: +33 - (0)6 85 62 59 98 > Fax : +33 - (0)9 58 10 27 44 > > Fax (D): +49 - (0)3 21 21 25 22 44 > > email: rai...@krugs.de > > Skype: RMkrug > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel