Re: [Rd] [External] Re: 1954 from NA

Avi Gross via R-devel Tue, 25 May 2021 12:03:27 -0700

That helps get more understanding of what you want to do, Adrian. Getting 
anyone to switch is always a challenge but changing R enough to tempt them may 
be a bigger challenge. His is an old story. I was the first adopter for C++ in 
my area and at first had to have my code be built with an all C project making 
me reinvent some wheels so the same “make” system knew how to build the two 
compatibly and link them. Of course, they all eventually had to join me in a 
later release but I had moved forward by then.

I have changed (or more accurately added) lots of languages in my life and 
continue to do so. The biggest challenge is not to just adapt and use it 
similarly to the previous ones already mastered but to understand WHY someone 
designed the language this way and what kind of idioms are common and useful 
even if that means a new way of thinking. But, of course, any “older” language 
has evolved and often drifted in multiple directions. Many now borrow heavily 
from others even when the philosophy is different and often the results are not 
pretty. Making major changes in R might have serious impacts on existing 
programs including just by making them fail as they run out of memory.

If you look at R, there is plenty you can do in base R, sometimes by standing 
on your head. Yet you see package after package coming along that offers not 
just new things but sometimes a reworking and even remodeling of old things. R 
has a base graphics system I now rarely use and another called lattice I have 
no reason to use again because I can do so much quite easily in ggplot. 
Similarly, the evolving tidyverse group of packages approaches things from an 
interesting direction to the point where many people mainly use it and not base 
R. So if they were to teach a class in how to gather your data and analyze it 
and draw pretty pictures, the students might walk away thinking they had 
learned R but actually have learned these packages.

Your scenario seems related to a common scenario of how we can have values that 
signal beyond some range in an out-of-band manner. Years ago we had functions 
in languages like C that would return a -1 on failure when only non-negative 
results were otherwise possible. That can work fine but fails in cases when any 
possible value in the range can be returned. We have languages that deal with 
this kind of thing using error handling constructs like exceptions.  Sometimes 
you bundle up multiple items into a structure and return that with one element 
of the structure holding some kind of return status and another holding the 
payload. A variation on this theme, as in languages like GO is to have function 
that return multiple values with one of them containing nil on success and an 
error structure on failure.

The situation we have here that seems to be of concern to you is that you would 
like each item in a structure to have attributes that are recognized and 
propagated as it is being processed. Older languages tended not to even have a 
concept so basic types simply existed and two instances of the number 5 might 
even be the same underlying one or two strings with the same contents and so 
on. You could of course play the game of making a struct, as mentioned above, 
but then you needed your own code to do all the handling as nothing else knew 
it contained multiple items and which ones had which purpose.

R did add generalized attributes and some are fairly well integrated or at 
least partially. “Names” were discussed as not being easy to keep around. 
Factors used their own tagging method that seems to work fairly well but 
probably not everywhere. But what you want may be more general and not built on 
similar foundations.

I look at languages like Python that are arguably more object-oriented now than 
R is and in some ways can be extended better, albeit not in others. If I wanted 
to create an object to hold the number 5 and I add methods to the object that 
allow it to participate in various ways with other objects using the hidden 
payload but also sometimes using the hidden payload, then I might pair it with 
the string “five” but also with dozens of other strings for the word 
representing 5 in many languages. So I might have it act like a number in 
numerical situations and like text when someone is using it in writing a novel 
in any of many languages.

You seem to want to have the original text visible that gives a reason 
something is missing (or something like that) but have the software TREAT it 
like it is missing in calculations. In effect, you want is.na() to be a bit 
more like is.numeric() or is.character() and care more about the TYPE of what 
is being stored. An item may contain a 999 and yet not be seen as a number but 
as an NA. The problem I see is that you also may want the item to be a string 
like “DELETED” and yet include it in the vector that R insists can only hold 
integers. R does have a built-in data structure called a list that indeed 
allows that. You can easily store data as a list of lists rather than a list of 
vectors and many other structures. Some of those structures might handle your 
needs BUT may only work properly if you build your own packages as with  the 
tidyverse and break as soon as any other functions encountered them!

But then you would arguably no longer be in R but in your own universe based on 
R.

I have written much code that does things a bit sideways. For example, I might 
have a treelike structure in which you do some form of search till you 
encounter a leaf node and return that value to be used in a calculation. To 
perform a calculation using multiple trees such as taking an average, you 
always use find_value(tree) and never hand over the tree itself. As I think I 
pointed out earlier, you can do things like that in many places and hand over a 
variation of your data. In the ggplot example, you might have:

ggplot(data=mydata, aes(x=abs(col1), y=convert_string_to_numeric(col2)) …

Ggplot would not use the original data in plotting but the view it is asked to 
use. The function I made up above would know what values are some form of NA 
and convert all others like “12.3” to numeric form. BUT it would not act as 
simply or smoothly as when your data is already in the format everyone else 
uses.

So how does R know what something is? Presumably there is some overhead 
associated with a vector or some table that records the type. A list presumably 
depends on each internal item to have such a type. So maybe what you want is 
for each item in a vector to have a type where one type is some for of NA. But 
as noted, R does often not give a damn about an NA and happily uses it to 
create more nonsense. The mean of a bunch of numbers that includes one or more 
copies of things like NA (or NaN or inf) can pollute them all. Generally R is 
not designed to give a darn. When people complain, they may get mean to add an 
na.rm=TRUE or remove them some way before asking for a mean or perhaps reset 
them to something like zero.

So if you want to leave your variables in place with assorted meanings but a 
tag saying they are to be treated as NA, much in R might have to change. Your 
suggested approach though is not yet clear but might mean doing something 
analogous to using extra bits and hoping nobody will notice.

So, the solution is both blindingly obvious and even more blindingly stupid. 
Use complex numbers! All normal content shall be stored as numbers like 5.3+0i 
and any variant on NA shall be stored as something like 0+3i where 3 means an 
NA of type 3.

OK, humor aside, since the social sciences do not tend to even know what 
complex numbers are, this should provide another dimension to hide lots of 
meaningless info. Heck, you could convert  message like “LATE” into some 
numeric form. Assuming an English centered world (which I do not!) you could 
store it with L replaced by 12 and A by 01 and so on so the imaginary component 
might look like 0+12011905i and easily decoded back into LATE when needed. 
Again, not a serious proposal. The storage probably would be twice the size of 
a numeric albeit you can extract the real part when needed for normal 
calculations and the imaginary part when you want to know about NA type or 
whatever. 

What R really is missing is quaternions and octonions which are the only two 
other variations on complex numbers that are possible and are sort of complex 
numbers on steroids with either three or seven distinct square roots of 
minus-one  so they allow storage along additional axes in other dimensions.

Yes, I am sure someone wrote a package for that! LOL!

Ah, here is one: https://cran.r-project.org/web/packages/onion/onion.pdf

I will end by saying my experience is that enticing people to do something new 
is just a start. After they start, you often get lots of complaints and 
requests for help and even requests to help them move back! Unless you make 
some popular package everyone runs to, NOBODY else will be able to help them on 
some things. The reality is that some of the more common tasks these people do 
are sometimes already optimized for them and often do not make them know more. 
I have had to use these systems and for some common tasks they are easy. Dialog 
boxes can pop up and let you checks off various options and off you go. No need 
to learn lots of programming details like the names of various functions that 
do a Tukey test and what arguments they need and what errors might have to be 
handled and so on. I know SPSS often produces LOTS of output including many 
things you do not wat and then lets you remove parts you don’t need or even 
know what they mean. Sure, R can have similar functionality but often you are 
expected to sort of stitch various parts together as well as ADD your own bits. 
I love that and value being able to be creative. In my experience, most normal 
people just want to get the job done and be fairly certain others accept the 
results ad then do other activities they are better suited for, or at least 
think they are.

There are intermediates I have used where I let them do various kinds of 
processing on SPSS and save the result in some format I can read into R for 
additional processing. The latter may not be stuff that requires keeping track 
of multiple NA equivalents. Of course if you want to save the results and move 
them back, that is  a challenge. Hybrid approaches may tempt them to try 
something and maybe later do more and more and move over.

From: Adrian Dușa <dusa.adr...@unibuc.ro> 
Sent: Tuesday, May 25, 2021 2:17 AM
To: Avi Gross <avigr...@verizon.net>
Cc: r-devel <r-devel@r-project.org>
Subject: Re: [Rd] [External] Re: 1954 from NA

Dear Avi,

Thank you so much for the extended messages, I read them carefully.

While partially offering a solution (I've already been there), it creates 
additional work for the user, and some of that is unnecessary.

What I am trying to achieve is best described in this draft vignette:

devtools::install_github("dusadrian/mixed")

vignette("mixed")

Once a value is declared to be missing, the user should not do anything else 
about it. Despite being present, the value should automatically be treated as 
missing by the software. That is the way it's done in all major statistical 
packages like SAS, Stata and even SPSS.

My end goal is to make R attractive for my faculty peers (and beyond), almost 
all of whom are massively using SPSS and sometimes Stata. But in order to 
convince them to (finally) make the switch, I need to provide similar 
functionality, not additional work.

Re. your first part of the message, I am definitely not trying to change the R 
internals. The NA will still be NA, exactly as currently defined.

My initial proposal was based on the observation that the 1954 payload was 
stored as an unsigned int (thus occupying 32 bits) when it is obvious it 
doesn't need more than 16. That was the only proposed modification, and 
everything else stays the same.

I now learned, thanks to all contributors in this list, that building something 
around that payload is risky because we do not know exactly what the compilers 
will do. One possible solution that I can think of, while (still) maintaining 
the current functionality around the NA, is to use a different high word for 
the NA that would not trigger compilation issues. But I have absolutely no idea 
what that implies for the other inner workings of R.

I very much trust the R core will eventually find a robust solution, they've 
solved much more complicated problems than this. I just hope the current thread 
will push the idea of tagged NAs on the table, for when they will discuss this.

Once that will be solved, and despite the current advice discouraging this 
route, I believe tagging NAs is a valuable idea that should not be discarded.

After all, the NA is nothing but a tagged NaN.

All the best,

Adrian

On Tue, May 25, 2021 at 7:05 AM Avi Gross via R-devel <r-devel@r-project.org 
<mailto:r-devel@r-project.org> > wrote:

I was thinking about how one does things in a language that is properly 
object-oriented versus R that makes various half-assed attempts at being such.

Clearly in some such languages you can make an object that is a wrapper that 
allows you to save an item that is the main payload as well as anything else 
you want. You might need a way to convince everything else to allow you to make 
things like lists and vectors and other collections of the objects and perhaps 
automatically unbox them for many purposes. As an example in a language like 
Python, you might provide methods so that adding A and B actually gets the 
value out of A and/or B and adds them properly.  But there may be too many edge 
cases to handle and some software may not pay attention to what you want 
including some libraries written in other languages.

I mention Python for the odd reason that it is now possible to combine Python 
and R in the same program and sort of switch back and forth between data 
representations. This may provide some openings for preserving and accessing 
metadata when needed.

Realistically, if R was being designed from scratch TODAY, many things might be 
done differently. But I recall it being developed at Bell Labs for purposes 
where it was sort of revolutionary at the time (back when it was S) and 
designed to do things in a vectorized way and probably primarily for the kinds 
of scientific and mathematical operations where a single NA (of several types 
depending on the data) was enough when augmented by a few things like a Nan and 
Inf and -Inf. I doubt they seriously saw a need for an unlimited number of NA 
that were all the same AND also all different that they felt had to be 
built-in. As noted, had they had a reason to make it fully object-oriented too 
and made the base types such as integer into full-fledged objects with room for 
additional metadata, then things may be different. I note I have seen languages 
which have both a data type called integer as lower case and Integer as upper 
case. One of them is regularly boxed and unboxed automagically when used in a 
context that needs the other. As far as efficiency goes, this invisibly adds 
many steps. So do languages that sometimes take a variable that is a pointer 
and invisibly reference it to provide the underlying field rather than make you 
do extra typing and so on.

So is there any reason only an NA should have such meta-data? Why not have 
reasons associated with Inf stating it was an Inf because you asked for one or 
the result of a calculation such as dividing by Zero (albeit maybe that might 
be a NaN) and so on. Maybe I could annotate integers with whether they are 
prime or even  versus odd  or a factor of 144 or anything else I can imagine. 
But at some point, the overhead from allowing all this can become substantial. 
I was amused at how python allows a function to be annotated including by 
itself since it is an object. So it can store such metadata perhaps in an 
attached dictionary so a complex costly calculation can have the results cached 
and when you ask for the same thing in the same session, it checks if it has 
done it and just returns the result in linear time. But after a while, how many 
cached results can there be?

-----Original Message-----
From: R-devel <r-devel-boun...@r-project.org 
<mailto:r-devel-boun...@r-project.org> > On Behalf Of luke-tier...@uiowa.edu 
<mailto:luke-tier...@uiowa.edu> 
Sent: Monday, May 24, 2021 9:15 AM
To: Adrian Dușa <dusa.adr...@unibuc.ro <mailto:dusa.adr...@unibuc.ro> >
Cc: Greg Minshall <minsh...@umich.edu <mailto:minsh...@umich.edu> >; r-devel 
<r-devel@r-project.org <mailto:r-devel@r-project.org> >
Subject: Re: [Rd] [External] Re: 1954 from NA

On Mon, 24 May 2021, Adrian Dușa wrote:

> On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minsh...@umich.edu 
> <mailto:minsh...@umich.edu> > wrote:
>
>> [...]
>> if you have 500 columns of possibly-NA'd variables, you could have 
>> one column of 500 "bits", where each bit has one of N values, N being 
>> the number of explanations the corresponding column has for why the 
>> NA exists.
>>

PLEASE DO NOT DO THIS!

It will not work reliably, as has been explained to you ad nauseam in this 
thread.

If you distribute code that does this it will only lead to bug reports on R 
that will waste R-core time.

As Alex explained, you can use attributes for this. If you need operations to 
preserve attributes across subsetting you can define subsetting methods that do 
that.

If you are dead set on doing something in C you can try to develop an ALTREP 
class that provides augmented missing value information.

Best,

luke

>
> The mere thought of implementing something like that gives me shivers. 
> Not to mention such a solution should also be robust when subsetting, 
> splitting, column and row binding, etc. and everything can be lost if 
> the user deletes that particular column without realising its importance.
>
> Social science datasets are much more alive and complex than one might 
> first think: there are multi-wave studies with tens of countries, and 
> aggregating such data is already a complex process to add even more 
> complexity on top of that.
>
> As undocumented as they may be, or even subject to change, I think the 
> R internals are much more reliable that this.
>
> Best wishes,
> Adrian
>
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tier...@uiowa.edu 
<mailto:luke-tier...@uiowa.edu> 
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
R-devel@r-project.org <mailto:R-devel@r-project.org>  mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org <mailto:R-devel@r-project.org>  mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania

https://adriandusa.eu

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [External] Re: 1954 from NA

Reply via email to