I was thinking about how one does things in a language that is properly 
object-oriented versus R that makes various half-assed attempts at being such.

Clearly in some such languages you can make an object that is a wrapper that 
allows you to save an item that is the main payload as well as anything else 
you want. You might need a way to convince everything else to allow you to make 
things like lists and vectors and other collections of the objects and perhaps 
automatically unbox them for many purposes. As an example in a language like 
Python, you might provide methods so that adding A and B actually gets the 
value out of A and/or B and adds them properly.  But there may be too many edge 
cases to handle and some software may not pay attention to what you want 
including some libraries written in other languages.

I mention Python for the odd reason that it is now possible to combine Python 
and R in the same program and sort of switch back and forth between data 
representations. This may provide some openings for preserving and accessing 
metadata when needed.

Realistically, if R was being designed from scratch TODAY, many things might be 
done differently. But I recall it being developed at Bell Labs for purposes 
where it was sort of revolutionary at the time (back when it was S) and 
designed to do things in a vectorized way and probably primarily for the kinds 
of scientific and mathematical operations where a single NA (of several types 
depending on the data) was enough when augmented by a few things like a Nan and 
Inf and -Inf. I doubt they seriously saw a need for an unlimited number of NA 
that were all the same AND also all different that they felt had to be 
built-in. As noted, had they had a reason to make it fully object-oriented too 
and made the base types such as integer into full-fledged objects with room for 
additional metadata, then things may be different. I note I have seen languages 
which have both a data type called integer as lower case and Integer as upper 
case. One of them is regularly boxed and unboxed automagically when used in a 
context that needs the other. As far as efficiency goes, this invisibly adds 
many steps. So do languages that sometimes take a variable that is a pointer 
and invisibly reference it to provide the underlying field rather than make you 
do extra typing and so on.

So is there any reason only an NA should have such meta-data? Why not have 
reasons associated with Inf stating it was an Inf because you asked for one or 
the result of a calculation such as dividing by Zero (albeit maybe that might 
be a NaN) and so on. Maybe I could annotate integers with whether they are 
prime or even  versus odd  or a factor of 144 or anything else I can imagine. 
But at some point, the overhead from allowing all this can become substantial. 
I was amused at how python allows a function to be annotated including by 
itself since it is an object. So it can store such metadata perhaps in an 
attached dictionary so a complex costly calculation can have the results cached 
and when you ask for the same thing in the same session, it checks if it has 
done it and just returns the result in linear time. But after a while, how many 
cached results can there be?

-----Original Message-----
From: R-devel <r-devel-boun...@r-project.org> On Behalf Of 
luke-tier...@uiowa.edu
Sent: Monday, May 24, 2021 9:15 AM
To: Adrian Dușa <dusa.adr...@unibuc.ro>
Cc: Greg Minshall <minsh...@umich.edu>; r-devel <r-devel@r-project.org>
Subject: Re: [Rd] [External] Re: 1954 from NA

On Mon, 24 May 2021, Adrian Dușa wrote:

> On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minsh...@umich.edu> wrote:
>
>> [...]
>> if you have 500 columns of possibly-NA'd variables, you could have 
>> one column of 500 "bits", where each bit has one of N values, N being 
>> the number of explanations the corresponding column has for why the 
>> NA exists.
>>

PLEASE DO NOT DO THIS!

It will not work reliably, as has been explained to you ad nauseam in this 
thread.

If you distribute code that does this it will only lead to bug reports on R 
that will waste R-core time.

As Alex explained, you can use attributes for this. If you need operations to 
preserve attributes across subsetting you can define subsetting methods that do 
that.

If you are dead set on doing something in C you can try to develop an ALTREP 
class that provides augmented missing value information.

Best,

luke



>
> The mere thought of implementing something like that gives me shivers. 
> Not to mention such a solution should also be robust when subsetting, 
> splitting, column and row binding, etc. and everything can be lost if 
> the user deletes that particular column without realising its importance.
>
> Social science datasets are much more alive and complex than one might 
> first think: there are multi-wave studies with tens of countries, and 
> aggregating such data is already a complex process to add even more 
> complexity on top of that.
>
> As undocumented as they may be, or even subject to change, I think the 
> R internals are much more reliable that this.
>
> Best wishes,
> Adrian
>
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to