+1

Avi Gross via R-devel <r-devel@r-project.org> wrote:

> Arguably, R was not developed to satisfy some needs in the way intended.
> 
> When I have had to work with datasets from some of the social sciences I have 
> had to adapt to subtleties in how they did things with software like SPSS in 
> which an NA was done using an out of bounds marker like 999 or "." or even a 
> blank cell. The problem is that R has a concept where data such as integers 
> or floating point numbers is not stored as text normally but in their own 
> formats and a vector by definition can only contain ONE data type. So the 
> various forms of NA as well as Nan and Inf had to be grafted on to be 
> considered VALID to share the same storage area as if they sort of were an 
> integer or floating point number or text or whatever.
> 
> It does strike me as possible to simply have a column that is something like 
> a factor that can contain as many NA excuses as you wish such as "NOT 
> ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN 
> LATER" to "I DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This 
> additional column would presumably only have content when the other column 
> has an NA. Your queries and other changes would work on something like a 
> data.frame where both such columns coexisted.
> 
> Note reading in data with multiple NA reasons may take extra work. If your 
> errors codes are text, it will all become text. If the errors are 999 and 998 
> and 997, it may all be treated as numeric and you may not want to convert all 
> such codes to an NA immediately. Rather, you would use the first 
> vector/column to make the second vector and THEN replace everything that 
> should be an NA with an actual NA and reparse the entire vector to become 
> properly numeric unless you like working with text and will convert to 
> numbers as needed on the fly.
> 
> Now this form of annotation may not be pleasing but I suggest that an 
> implementation that does allow annotation may use up space too. Of course, if 
> your NA values are rare and space is only used then, you might save space. 
> But if you could make a factor column and have it use the smallest int it can 
> get as a basis, it may be a way to save on space.
> 
> People who have done work with R, especially those using the tidyverse, are 
> quite used to using one column to explain another. So if you are asked to say 
> tabulate what percent of missing values are due to reasons A/B/C then the 
> added columns works fine for that calculation too.
> 
> 
> -----Original Message-----
> From: R-devel <r-devel-boun...@r-project.org> On Behalf Of Adrian Du?a
> Sent: Sunday, May 23, 2021 2:04 PM
> To: Tomas Kalibera <tomas.kalib...@gmail.com>
> Cc: r-devel <r-devel@r-project.org>
> Subject: Re: [Rd] 1954 from NA
> 
> Dear Tomas,
> 
> I understand that perfectly, but that is fine.
> The payload is not going to be used in any computations anyways, it is 
> strictly an information carrier that differentiates between different types 
> of (tagged) NA values.
> 
> Having only one NA value in R is extremely limiting for the social sciences, 
> where multiple missing values may exist, because respondents:
> - did not know what to respond, or
> - did not want to respond, or perhaps
> - the question did not apply in a given situation etc.
> 
> All of these need to be captured, stored, and most importantly treated as if 
> they would be regular missing values. Whether the payload might be lost in 
> computations makes no difference: they were supposed to be "missing values" 
> anyways.
> 
> The original question is how the payload is currently stored: as an unsigned 
> int of 32 bits, or as an unsigned short of 16 bits. If the R internals would 
> not be affected (and I see no reason why they would be), it would allow an 
> entire universe for the social sciences that is not currently available and 
> which all other major statistical packages do offer.
> 
> Thank you very much, your attention is greatly appreciated, Adrian
> 
> On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera <tomas.kalib...@gmail.com>
> wrote:
> 
> > TLDR: tagging R NAs is not possible.
> >
> > External software should not depend on how R currently implements NA, 
> > this may change at any time. Tagging of NA is not supported in R (if 
> > it were, it would have been documented). It would not be possible to 
> > implement such tagging reliably with the current implementation of NA in R.
> >
> > NaN payload propagation is not standardized. Compilers are free to and 
> > do optimize code not preserving/achieving any specific propagation.
> > CPUs/FPUs differ in how they propagate in binary operations, some zero 
> > the payload on any operation. Virtualized environments, binary 
> > translations, etc, may not preserve it in any way, either. ?NA has 
> > disclaimers about this, an NA may become NaN (payload lost) even in 
> > unary operations and also in binary operations not involving other NaN/NAs.
> >
> > Writing any new software that would depend on that anything specific 
> > happens to the NaN payloads would not be a good idea. One can only 
> > reliably use the NaN payload bits for storage, that is if one avoids 
> > any computation at all, avoids passing the values to any external code 
> > unaware of such tagging (including R), etc. If such software wants any 
> > NaN to be understood as NA by R, it would have to use the documented R 
> > API for this (so essentially translating) - but given the problems 
> > mentioned above, there is really no point in doing that, because such 
> > NAs become NaNs at any time.
> >
> > Best
> > Tomas
> >
> > On 5/23/21 9:56 AM, Adrian Dușa wrote:
> > > Dear R devs,
> > >
> > > I am probably missing something obvious, but still trying to 
> > > understand
> > why
> > > the 1954 from the definition of an NA has to fill 32 bits when it
> > normally
> > > doesn't need more than 16.
> > >
> > > Wouldn't the code below achieve exactly the same thing?
> > >
> > > typedef union
> > > {
> > >      double value;
> > >      unsigned short word[4];
> > > } ieee_double;
> > >
> > >
> > > #ifdef WORDS_BIGENDIAN
> > > static CONST int hw = 0;
> > > static CONST int lw = 3;
> > > #else  /* !WORDS_BIGENDIAN */
> > > static CONST int hw = 3;
> > > static CONST int lw = 0;
> > > #endif /* WORDS_BIGENDIAN */
> > >
> > >
> > > static double R_ValueOfNA(void)
> > > {
> > >      volatile ieee_double x;
> > >      x.word[hw] = 0x7ff0;
> > >      x.word[lw] = 1954;
> > >      return x.value;
> > > }
> > >
> > > This question has to do with the tagged NA values from package 
> > > haven, on which I want to improve. Every available bit counts, 
> > > especially if multi-byte characters are going to be involved.
> > >
> > > Best wishes,
> >
> > ______________________________________________
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to