[Rd] FW: 1954 from NA

Avi Gross via R-devel Mon, 24 May 2021 19:51:57 -0700

 

Adrian,

Agreed. To do what you said hundreds of columns of data by doubling it is 
indeed a pain just to get what you want. There are straightforward ways 
especially if you use tidyverse packages rather than base R. Just a warning, 
this message is a tad long for anyone not interested to skip.

But there is a caution about trying to use a feature nobody wanted changed 
until you came along. R has all kinds of dependencies on existing ways of 
looking at an NA value such as asking is.na(SOMETHING) or the many function 
like a mean where they handle mean(SOMETHING, na.rm=TRUE) or the way ggplot 
graphs skip items that are NA and so on. Any solution you come up with to 
enlarge the kinds of NA may break some of that and then you will have no right 
to complain.

What does your data look like? I mean for example if all the data in a column 
is small integers say under a thousand, you can pick some number like 10,000 
and store some NA categories as 10,000 + 1 then 10,000 + 2 and so on. THEN you 
have to be careful, as noted above, to remove all such values in other contexts 
by either making a copy where all numbers above 10,000 are changed to an NA for 
the duration or you take subsets of the data that exclude those.

Floating point can also be done that way or by using a negative number or other 
tricks.

Character DATA obviously can have reserved words that will not happen in the 
rest of the DATA such as NA*NA:1 and NA*NA:2 or whatever makes sense to you. 
Ideally this can be something you can remove all at once with something like 
perhaps a regular expression when needed. If you use a factor to store such a 
field, as if often a good idea, there are ways to force the indexes of your 
NA-like fields to be whatever you want, such as the first 10 or even last 10, 
perhaps letting you play games when they need to be hidden or removed or 
refactored into a plain NA. It adds complexity and may break in unexpected ways.

And, I shudder to say this, more modern usage allows you to change normal 
vectors into list variables as columns. So you can replace a single column (or 
as many as you want) by a tuple column where the first part may be your data 
including just NA when needed and the second item would  be something else like 
a more specific reason for any items where the first is NA. Heck, you can add a 
series of Boolean entries in the list where the second to the last each encode 
TRUE if it has a particular excuse and you can even have multiple excuses (or 
none) for an entry. I repeat, I shudder, simply because many other normally 
used R functions are not expecting list columns and you may need to call them 
indirectly with something that extracts only the part needed first.

R does have some inconsistencies in how it handles some things such as name 
tags associated with parts of a vector. Some functions preserve the attributes 
used this way and others do not.  But if you want to emulate the same tricks 
normally used in making factors and matrices or giving column names, you can do 
something like this that might work. My EXAMPLE below makes a vector of a dozen 
sequential numbers as the VALUE and hides an attribute with month names to sort 
of match: It then changes every third to be NA:

temp <- 1:12

attr(temp, "Month") <- month.abb

temp[3] <- temp[6] <- temp[9] <- temp[12] <- NA

The current value of temp may look like this:

> temp

[1]  1  2 NA  4  5 NA  7  8 NA 10 11 NA

attr(,"Month")

[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

So it has months attached as PLACEHOLDERS and four different NA values. To see 
an NA value’s REASON, the two have the same index so:

> attr(temp, "Month")[is.na(temp)]

[1] "Mar" "Jun" "Sep" "Dec"

The above asked to see what text is associated with each NA. You can use many 
techniques like the above to find out why a particular item is an NA. If you 
want to know why the sixth item is NA, with R using index of 6 as it starts 
with 1:

> attr(temp, "Month")[6]

[1] "Jun"

And it can work both ways. If I now were to change the above to say store an NA 
in the Months variable (renamed by you to Reason or something) except for other 
entries saying “RanOutOfTime”,  “DidNotUnderstandQuestion” and so on, you could 
search the attribute first and get the index numbers of which questions matched 
and other such things.

There may well be a well-reasoned package as just described and perhaps some 
that do not use as much space. The above very rough implementation just hides a 
second vector attached loosely tied to the first vector in a way that may be 
invisible to most other functionality. But it can easily have problems as so 
many things make  new vectors and remove your change. Consider just doubling 
the odd vector I created:

> temp2 <- c(temp, temp)

> temp2

[1]  1  2 NA  4  5 NA  7  8 NA 10 11 NA  1  2 NA  4  5 NA  7  8 NA 10 11 NA

The annotation is gone!

Now if you do something a tad more normal like re-use the names() feature, 
maybe you can preserve it in more cases:

temp <- 1:12

names(temp) <- month.abb

temp[3] <- temp[6] <- temp[9] <- temp[12] <- NA

> temp

[1]  1  2 NA  4  5 NA  7  8 NA 10 11 NA

attr(,"Month")

[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

> temp

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 

1   2  NA   4   5  NA   7   8  NA  10  11  NA

Now NAMES used this way can be preserved sometimes. For example some functions 
have arguments like this:

> temp2 <- c(temp, temp, use.names=TRUE)

> temp2

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug 
Sep Oct Nov Dec 

1   2  NA   4   5  NA   7   8  NA  10  11  NA   1   2  NA   4   5  NA   7   8  
NA  10  11  NA

So, it may well be you can play such games with your input but doing that for 
hundreds of columns may be a tad of work that can be automated easily enough if 
all the columns are similar, such as in repeats of data in a time series. As 
noted, R functions that read in DATA expect all items in a column to be of the 
same underlying type or an NA. If your data has text giving a REASON, and you 
know exactly what reasons are allowed with any remaining values to be left as 
is, you might do something like this in pseudo code.

Say column_orig looks like this: 5, 1, bad, 2, worse, 1, 2, 5, bad, 6, worse, 
missing, 2

Your stuff may be read in as CHARACTER and look like:

> column_orig

[1] "5"       "1"       "bad"     "2"       "worse"   "1"       "2"       "5"   
    "bad"    

[10] "6"       "worse"   "missing" "2"

So, you can process the above with something like an ifelse() to make a 
temporary version VERY carefully as ifelse does not preserve name attributes!

> names.temp <- ifelse(column_orig %in% c("bad", "worse", "missing"), 
> column_orig, NA)

> column_orig <- ifelse(column_orig %in% c("bad", "worse", "missing"), NA, 
> column_orig)

> column_orig <- as.numeric(column_orig)

> names(column_orig) <- names.temp

> column_orig

<NA>    <NA>     bad    <NA>   worse    <NA>    <NA>    <NA>     bad    <NA>   
worse missing    <NA> 

  5       1      NA       2      NA       1       2       5      NA       6     
 NA      NA       2

(the above may not show up formatted right in the email but shows the names on 
the first line and the data on the second. Wherever the data is NA, the reason 
is in the name.

Again, I am just playing with your specified need and pointing out ways R may 
partially support them but probably far from ideal as you are trying to do 
something it probably was never designed for. I suspect the philosophy behind 
using a tibble instead of a data.frame may preserve your meta-info better.

But if all you want is to know the reason for a missing observation while using 
little space, there may be other ways to consider such as making a sparse 
matrix from the original data if missing values are rare enough. Sure, it might 
have 600 columns and umpteen rows, but you can store a small integer or even a 
byte in each entry and perhaps skip any row that has nothing missing. If you 
later need the info and the data has not been scrambled such as by removing 
rows or columns or sorting, you can easily find it. Or, if you simply add one 
more column with some form of unique sequence number or ID and maintain it, you 
can always index back to find what you want, WITHOUT all the warnings mentioned 
above.

If memory is a huge concern, consider ways you can massage your original data 
to conserve what you need then save THAT to a file on disk and remove the extra 
space use for garbage collection. When and IF you ever need that info at some 
later date, the form you chose can be read back in. But you need to be careful 
as such meta-info is lost unless you use a method that conserves it. Do not 
save it as a CSV file, for example, but as something R uses and can read back 
in the same way.

Or, you can try to make your own twists on changing how NA works and take lots 
of risks as it is not doing something published and guaranteed. 

I think I can now politely bow out of this topic and wish you luck with 
whatever you choose. It may even be using something other than R!

From: Adrian Dușa <dusa.adr...@unibuc.ro <mailto:dusa.adr...@unibuc.ro> > 
Sent: Monday, May 24, 2021 5:26 AM
To: Avi Gross <avigr...@verizon.net <mailto:avigr...@verizon.net> >
Cc: r-devel <r-devel@r-project.org <mailto:r-devel@r-project.org> >
Subject: Re: [Rd] 1954 from NA

Hmm...

If it was only one column then your solution is neat. But with 5-600 variables, 
each of which can contain multiple missing values, to double this number of 
variables just to describe NA values seems to me excessive.

Not to mention we should be able to quickly convert / import / export from one 
software package to another. This would imply maintaining some sort of metadata 
reference of which explanatory additional factor describes which original 
variable.

All of this strikes me as a lot of hassle compared to storing some information 
within a tagged NA value... I just need a little bit more bits to play with.

Best wishes,

Adrian

On Sun, May 23, 2021 at 10:21 PM Avi Gross via R-devel <r-devel@r-project.org 
<mailto:r-devel@r-project.org> > wrote:

Arguably, R was not developed to satisfy some needs in the way intended.

When I have had to work with datasets from some of the social sciences I have 
had to adapt to subtleties in how they did things with software like SPSS in 
which an NA was done using an out of bounds marker like 999 or "." or even a 
blank cell. The problem is that R has a concept where data such as integers or 
floating point numbers is not stored as text normally but in their own formats 
and a vector by definition can only contain ONE data type. So the various forms 
of NA as well as Nan and Inf had to be grafted on to be considered VALID to 
share the same storage area as if they sort of were an integer or floating 
point number or text or whatever.

It does strike me as possible to simply have a column that is something like a 
factor that can contain as many NA excuses as you wish such as "NOT ANSWERED" 
to "CANNOT READ THE SQUIGLE" to "NOT SURE" to "WILL BE FILLED IN LATER" to "I 
DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID QUESTIONS". This additional column 
would presumably only have content when the other column has an NA. Your 
queries and other changes would work on something like a data.frame where both 
such columns coexisted.

Note reading in data with multiple NA reasons may take extra work. If your 
errors codes are text, it will all become text. If the errors are 999 and 998 
and 997, it may all be treated as numeric and you may not want to convert all 
such codes to an NA immediately. Rather, you would use the first vector/column 
to make the second vector and THEN replace everything that should be an NA with 
an actual NA and reparse the entire vector to become properly numeric unless 
you like working with text and will convert to numbers as needed on the fly.

Now this form of annotation may not be pleasing but I suggest that an 
implementation that does allow annotation may use up space too. Of course, if 
your NA values are rare and space is only used then, you might save space. But 
if you could make a factor column and have it use the smallest int it can get 
as a basis, it may be a way to save on space.

People who have done work with R, especially those using the tidyverse, are 
quite used to using one column to explain another. So if you are asked to say 
tabulate what percent of missing values are due to reasons A/B/C then the added 
columns works fine for that calculation too.

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania

https://adriandusa.eu

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] FW: 1954 from NA

Reply via email to