On Mon, May 24, 2021 at 4:40 PM Bertram, Alexander via R-devel < r-devel@r-project.org> wrote:
> Dear Adrian, > SPSS and other packages handle this problem in a very similar way to what I > described: they store additional metadata for each variable. You can see > this in the way that SPSS organizes it's file format: each "variable" has > additional metadata that indicate how specific values of the variable, > encoded as an integer or a floating point should be handled in analysis. > Before you actually run a crosstab in SPSS, the metadata is (presumably) > applied to the raw data to arrive at an in memory buffer on which the > actual model is fitted, etc. > As far as I am aware, SAS and Stata use "very high" and "very low" values to signal a missing value. Basically, the same solution using a different sign bit (not creating attributes metadata, though). Something similar to the IEEE-754 representation for the NaN: 0x7ff0000000000000 only using some other "high" word: 0x7fe0000000000000 If I understand this correctly, compilers are likely to mess around with the payload from the 0x7ff0... stuff, which endangers even the most basic R structure like a real NA. Perhaps using a different high word such as 0x7fe would be stable, since compilers won't confuse it with a NaN. And then any payload would be "safe" for any specific purpose. Not sure how SPSS manage its internals, but if they do it that way they manage it in a standard procedural way. Now, since R's NA payload is at risk, and if your solution is "good" for specific social science missing data, would you recommend R creators to adopt it for a regular NA...? We're looking for a general purpose solution that would create as little additional work as possible for the end users. Your solution is already implemented in the package "labelled" with the function user_na_to_na() before doing any statistical analysis. That still requires users to pay attention to details which the software should take care of automatically. Best, Adrian The 20 line solution in R looks like this: > > > df <- data.frame(q1 = c(1, 10, 50, 999), q2 = c("Yes", "No", "Don't know", > "Interviewer napping"), stringsAsFactors = FALSE) > attr(df$q1, 'missing') <- 999 > attr(df$q2, 'missing') <- c("Don't know", "Interviewer napping") > > excludeMissing <- function(df) { > for(q in names(df)) { > v <- df[[q]] > mv <- attr(v, 'missing') > if(!is.null(mv)) { > df[[q]] <- ifelse(v %in% mv, NA, v) > } > } > df > } > > table(excludeMissing(df)) > > If you want to preserve the missing attribute when subsetting the vectors > then you will have to take the example further by adding a class and > `[.withMissing` functions. This might bring the whole project to a few > hundred lines, but the rules that apply here are well defined and well > understood, giving you a proper basis on which to build. And perhaps the > vctrs package might make this even simpler, take a look. > > Best, > Alex > > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel