Re: [R] Summary of variables with NA, empty

Lopez, Dan Wed, 24 Oct 2012 08:36:18 -0700

The examples I gave--Null, Empty string, white space, etc where just examples 
based on SPSS Modeler's Data Audit node.

I just want something that both identifies the columns having missing values-- 
regardless of what they technically are stored as(NA or a field with space bar 
hit a couple of times,etc) -- and tabulates based on what type of missing 
value. This is a basic data exploration step that I thought just maybe comes 
standard in R and that I just don't know of yet.

Hmisc::describe is good and may have to suffice. "Missing" for the example 
below using Hmisc::describe was 0 although there was a "". And I know that's 
because of the technical difference. 

#EXAMPLE data - just one column in this case
> dput(sample(mydata$COMMUTE_BIN,100))
structure(c(2L, 5L, 3L, 2L, 6L, 3L, 2L, 3L, 4L, 2L, 2L, 4L, 3L, 
4L, 3L, 3L, 3L, 6L, 2L, 2L, 2L, 4L, 6L, 4L, 2L, 3L, 2L, 2L, 6L, 
3L, 2L, 6L, 3L, 2L, 3L, 4L, 4L, 4L, 5L, 7L, 3L, 5L, 2L, 3L, 2L, 
2L, 6L, 7L, 7L, 4L, 3L, 3L, 2L, 2L, 2L, 5L, 2L, 2L, 2L, 2L, 2L, 
2L, 5L, 2L, 3L, 3L, 6L, 4L, 6L, 2L, 7L, 4L, 6L, 2L, 3L, 2L, 2L, 
2L, 3L, 2L, 3L, 4L, 3L, 5L, 3L, 4L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 
4L, 5L, 3L, 2L, 2L, 3L, 1L), .Label = c("", "<15", "15 - 24", 
"25 - 34", "35 - 44", "45 - 54", "55+"), class = "factor")

As David mentioned maybe I will have to create my own function. Maybe something 
similar to what I got here for identifying a factor columns, column labels and 
number of levels.
#EXAMPLE of formula I will probably need to create for identifying and listing 
column names and counts of NA and "" and other "missing" in a dataframe or 
table. In this case however I am listing factor columns and excluding columns 
w/ >32 levels
set.seed(1)
dat1<- 
data.frame(col1=factor(sample(1:25,10,replace=TRUE)),col2=sample(letters[1:10],10,replace=TRUE),col3=factor(rep(1:5,each=2)))
PrintLvls2 <- function(x) {print(data.frame(Lvls=sapply(x[sapply(x,function(x) 
is.factor(x)&&length(levels(x))<=32)],nlevels), 

Names=sapply(x[sapply(x, function(x) is.factor(x)&&length(levels(x))<=32)], 

function(y) paste0(levels(y), collapse=", "))), right=FALSE)}
> PrintLvls2(dat1)
     Lvls Names                          
col1 9    2, 6, 7, 10, 15, 16, 17, 23, 24
col2 7    b, c, d, e, g, h, j            
col3 5    1, 2, 3, 4, 5 

Thanks.
Dan

-----Original Message-----
From: Bert Gunter [mailto:[email protected]] 
Sent: Tuesday, October 23, 2012 3:15 PM
To: David Winsemius
Cc: Lopez, Dan; R help ([email protected])
Subject: Re: [R] Summary of variables with NA, empty

To highlight:

"Basically all Null values" is a meaningless phrase in R. ?Null ?NA ?NaN have 
**very specific meanings** in R and have nothing to do with the various sorts 
of whitespace characters that David mentions (spaces, tabs...). If you wish to 
use R, you **must** understand the distinctions (the Intro to R tutorial 
discusses some of this -- have you read it?).

There is functionality to test for these sorts of things (is.na, is.null, etc). 
You need to put in the effort to learn about this if you mean to use R in any 
serious way, as these will occur in either data I/O (NA's) or data manipulation 
(e.g. 0/0)

-- Bert

On Tue, Oct 23, 2012 at 2:44 PM, David Winsemius <[email protected]> wrote:
>
> On Oct 23, 2012, at 11:17 AM, Lopez, Dan wrote:
>
>> Hi,
>>
>> Is there a function I can use on my dataframe to give me a concise summary 
>> of variables that are NA,blank,etc? Basically all Null values, Empty 
>> strings, white space, blank values. Ideally it would look something like the 
>> below:
>>
>> # it should only includes the fields with NAs, blanks, etc. Added bonus 
>> would be to include column Index.
>> #Valid Records = records that are not NA, blank,etc #ColIndex - what 
>> place is column in the original dataframe...1,2,3, ...xth
>>
>>                Valid Records  Null (NA?)        Empty String      White 
>> Space       Blank Value        ColIndex
>
> Would a "Valid Record" be defined by grep([^ ], column)? ... i.e. has 
> a non-space character in it What is a "ColIndex"?
> How is an "Empty String" different than "White Space" or a "Blank Value"
>
>
>
>> Var1                       52        8                                       
>>                                  2
>> Var2                       40           20                                   
>>         10                           10                                      
>>      3
>> Var3                       58                                                
>>            2                                                                 
>>              20
>> ..
>>
>
> I generally use describe from package:Hmisc. There are other versions of 
> describe in other packages. It's not going to classify items composed 
> entirely of a varying number of spaces and other non-character items like 
> tabs as a single group. And it's unclear what you will use as an operational 
> definition to separate blanks and white-space. You will probably need to code 
> that yourself. You might want to look at the code for Hmisc::describe as a 
> starting point.
>
>
>> I now there is summary() but I am not sure if that always displays NAs and 
>> blanks especially with factor variables that have several levels (lumps them 
>> in 'Other' when I run the entire dataframe).
>
>
>> In these instances I can run the individual field separately and see all 
>> levels but that would be inefficient to do for a dataframe with over 50 
>> variables.
>
> How were you going to "run the individual field"? If you show us code, there 
> might be more rapid progress. It would probably be very easy to turn that 
> into a function that could then be "run" with `lapply`.
>>
>>
> --
>
> David Winsemius, MD
> Alameda, CA, USA
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Summary of variables with NA, empty

Reply via email to