Re: [Rd] 1954 from NA

Avi Gross via R-devel Mon, 24 May 2021 19:52:39 -0700

Adrian,


This is an aside. I note in many machine-learning algorithms they actually do 
something along the lines being discussed. They may take an item like a 
paragraph of words or an email message  and add thousands of columns with each 
one being a Boolean specifying if a particular word is in or not in that item. 
They may then run an analysis trying to heuristically match known SPAM items so 
as to be able to predict if new items might be SPAM. Some may even have a 
column for words taken two or more at a time such as “must” followed by “have” 
or “Your”, “last”, “chance” resulting> column_orig

<NA>    <NA>     bad    <NA>   worse    <NA>    <NA>    <NA>     bad    <NA>   
worse missing    <NA> 

  5       1      NA       2      NA       1       2       5      NA       6     
 NA      NA       2  in even more columns. The software than does the analysis 
can work on remarkably large such collections including in some cases taking 
multiple approaches at the same problem and choosing among them in some way.

 

In your case, yes, adding lots of columns seems like added work. But in data 
science, often the easiest way to do some complex things is to loop over 
selected existing columns and create multiple sets of additional columns that 
simplify later calculations by just using these values rather than some 
multi-line complex condition. I have as an example run statistical analyses 
where I have a Boolean column if the analysis failed (as in I caught it using 
try() or else it would kill my process) and another if I was told it did not 
converge properly and yet another column if it failed some post-tests. It 
simplified some queries that excluded rows where any one of the above was TRUE. 
I also stored columns for metrics like RMSEA and chi-squared values, sometimes 
dozens. And for each of the above, I actually had a set of columns for various 
models such as linear versus quadratic and more. Worse, as the analysis 
continued, more derived columns were added as various measures of the above 
results were compared to each other so the different models could be compared 
as in how often each was better. Careful choices of naming conventions and nice 
features of the tidyverse made it fairly simple to operate on many columns in 
the same way fairly easily such as all columns whose names start with a string 
or end with …

 

And, yes, for some efficiency, I often made a narrower version of the above 
with just the fields I needed and was careful not to remove what I might need 
later.

 

So it can be done and fairly trivially if you know what you are doing. If the 
names of all your original columns that behave this way look like *.orig and 
others look different, you can ask for a function to be applied to just those 
that produces another set with the same prefixes but named *.converted and yet 
another called *.annotation and so on. You may want to remove the originals to 
save space but you get the idea. The fact there are six hundred means little 
with such a design as the above can be done in probably a dozen lines of code 
to all of them at once.

 

For me, the above is way less complex than what you want to do and can have 
benefits. For example, if you make a graph of points from my larger 
tibble/data.frame using ggplot(), you can do things like specify what color to 
use for a point using a variable that contains the reason the data was missing 
(albeit that assumes the missing part is not what is being graphed) or add text 
giving the reason just above each such point. Your method of faking multiple 
things YOU claim are an NA may not make it doable in the above example.

 

From: Adrian Dușa <dusa.adr...@unibuc.ro <mailto:dusa.adr...@unibuc.ro> > 
Sent: Monday, May 24, 2021 8:18 AM
To: Greg Minshall <minsh...@umich.edu <mailto:minsh...@umich.edu> >
Cc: Avi Gross <avigr...@verizon.net <mailto:avigr...@verizon.net> >; r-devel 
<r-devel@r-project.org <mailto:r-devel@r-project.org> >
Subject: Re: [Rd] 1954 from NA

 

On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minsh...@umich.edu 
<mailto:minsh...@umich.edu> > wrote:

[...]
if you have 500 columns of possibly-NA'd variables, you could have one
column of 500 "bits", where each bit has one of N values, N being the
number of explanations the corresponding column has for why the NA
exists.

 

The mere thought of implementing something like that gives me shivers. Not to 
mention such a solution should also be robust when subsetting, splitting, 
column and row binding, etc. and everything can be lost if the user deletes 
that particular column without realising its importance.

 

Social science datasets are much more alive and complex than one might first 
think: there are multi-wave studies with tens of countries, and aggregating 
such data is already a complex process to add even more complexity on top of 
that.

 

As undocumented as they may be, or even subject to change, I think the R 
internals are much more reliable that this.

 

Best wishes,

Adrian

 

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania

https://adriandusa.eu


        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] 1954 from NA

Reply via email to