+1 Avi Gross via R-devel <r-devel@r-project.org> wrote:
> Arguably, R was not developed to satisfy some needs in the way intended. > > When I have had to work with datasets from some of the social sciences I have > had to adapt to subtleties in how they did things with software like SPSS in > which an NA was done using an out of bounds marker like 999 or "." or even a > blank cell. The problem is that R has a concept where data such as integers > or floating point numbers is not stored as text normally but in their own > formats and a vector by definition can only contain ONE data type. So the > various forms of NA as well as Nan and Inf had to be grafted on to be > considered VALID to share the same storage area as if they sort of were an > integer or floating point number or text or whatever. > > It does strike me as possible to simply have a column that is something like > a factor that can contain as many NA excuses as you wish such as "NOT > ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN > LATER" to "I DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This > additional column would presumably only have content when the other column > has an NA. Your queries and other changes would work on something like a > data.frame where both such columns coexisted. > > Note reading in data with multiple NA reasons may take extra work. If your > errors codes are text, it will all become text. If the errors are 999 and 998 > and 997, it may all be treated as numeric and you may not want to convert all > such codes to an NA immediately. Rather, you would use the first > vector/column to make the second vector and THEN replace everything that > should be an NA with an actual NA and reparse the entire vector to become > properly numeric unless you like working with text and will convert to > numbers as needed on the fly. > > Now this form of annotation may not be pleasing but I suggest that an > implementation that does allow annotation may use up space too. Of course, if > your NA values are rare and space is only used then, you might save space. > But if you could make a factor column and have it use the smallest int it can > get as a basis, it may be a way to save on space. > > People who have done work with R, especially those using the tidyverse, are > quite used to using one column to explain another. So if you are asked to say > tabulate what percent of missing values are due to reasons A/B/C then the > added columns works fine for that calculation too. > > > -----Original Message----- > From: R-devel <r-devel-boun...@r-project.org> On Behalf Of Adrian Du?a > Sent: Sunday, May 23, 2021 2:04 PM > To: Tomas Kalibera <tomas.kalib...@gmail.com> > Cc: r-devel <r-devel@r-project.org> > Subject: Re: [Rd] 1954 from NA > > Dear Tomas, > > I understand that perfectly, but that is fine. > The payload is not going to be used in any computations anyways, it is > strictly an information carrier that differentiates between different types > of (tagged) NA values. > > Having only one NA value in R is extremely limiting for the social sciences, > where multiple missing values may exist, because respondents: > - did not know what to respond, or > - did not want to respond, or perhaps > - the question did not apply in a given situation etc. > > All of these need to be captured, stored, and most importantly treated as if > they would be regular missing values. Whether the payload might be lost in > computations makes no difference: they were supposed to be "missing values" > anyways. > > The original question is how the payload is currently stored: as an unsigned > int of 32 bits, or as an unsigned short of 16 bits. If the R internals would > not be affected (and I see no reason why they would be), it would allow an > entire universe for the social sciences that is not currently available and > which all other major statistical packages do offer. > > Thank you very much, your attention is greatly appreciated, Adrian > > On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera <tomas.kalib...@gmail.com> > wrote: > > > TLDR: tagging R NAs is not possible. > > > > External software should not depend on how R currently implements NA, > > this may change at any time. Tagging of NA is not supported in R (if > > it were, it would have been documented). It would not be possible to > > implement such tagging reliably with the current implementation of NA in R. > > > > NaN payload propagation is not standardized. Compilers are free to and > > do optimize code not preserving/achieving any specific propagation. > > CPUs/FPUs differ in how they propagate in binary operations, some zero > > the payload on any operation. Virtualized environments, binary > > translations, etc, may not preserve it in any way, either. ?NA has > > disclaimers about this, an NA may become NaN (payload lost) even in > > unary operations and also in binary operations not involving other NaN/NAs. > > > > Writing any new software that would depend on that anything specific > > happens to the NaN payloads would not be a good idea. One can only > > reliably use the NaN payload bits for storage, that is if one avoids > > any computation at all, avoids passing the values to any external code > > unaware of such tagging (including R), etc. If such software wants any > > NaN to be understood as NA by R, it would have to use the documented R > > API for this (so essentially translating) - but given the problems > > mentioned above, there is really no point in doing that, because such > > NAs become NaNs at any time. > > > > Best > > Tomas > > > > On 5/23/21 9:56 AM, Adrian Dușa wrote: > > > Dear R devs, > > > > > > I am probably missing something obvious, but still trying to > > > understand > > why > > > the 1954 from the definition of an NA has to fill 32 bits when it > > normally > > > doesn't need more than 16. > > > > > > Wouldn't the code below achieve exactly the same thing? > > > > > > typedef union > > > { > > > double value; > > > unsigned short word[4]; > > > } ieee_double; > > > > > > > > > #ifdef WORDS_BIGENDIAN > > > static CONST int hw = 0; > > > static CONST int lw = 3; > > > #else /* !WORDS_BIGENDIAN */ > > > static CONST int hw = 3; > > > static CONST int lw = 0; > > > #endif /* WORDS_BIGENDIAN */ > > > > > > > > > static double R_ValueOfNA(void) > > > { > > > volatile ieee_double x; > > > x.word[hw] = 0x7ff0; > > > x.word[lw] = 1954; > > > return x.value; > > > } > > > > > > This question has to do with the tagged NA values from package > > > haven, on which I want to improve. Every available bit counts, > > > especially if multi-byte characters are going to be involved. > > > > > > Best wishes, > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel