[Rd] Apparent bug in summary.data.frame() with columns of Date class and NA's present

Marc Schwartz Mon, 08 Feb 2016 14:05:26 -0800

Hi all,

Based upon an exchange with Göran Broström on R-Help today:


  https://stat.ethz.ch/pipermail/r-help/2016-February/435992.html

there appears to be a bug in summary.data.frame() in the case where a data 
frame contains Date class columns that contain NA's and other columns, if 
present, do not.

Example, modified from R-Help:

x <- c(18000000, 18810924, 19091227, 19027233, 19310526, 19691228, NA)
x.Date <- as.Date(as.character(x), format = "%Y%m%d")

DF.Dates <- data.frame(Col1 = x.Date)

> summary(x.Date)
        Min.      1st Qu.       Median         Mean      3rd Qu. 
"1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" 
        Max.         NA's 
"1969-12-28"          "3" 


# NA's missing from output
> summary(DF.Dates)
      Col1           
 Min.   :1881-09-24  
 1st Qu.:1902-12-04  
 Median :1920-09-10  
 Mean   :1923-04-12  
 3rd Qu.:1941-01-17  
 Max.   :1969-12-28  


DF.Dates$x1 <- 1:7

> DF.Dates
        Col1 x1
1       <NA>  1
2 1881-09-24  2
3 1909-12-27  3
4       <NA>  4
5 1931-05-26  5
6 1969-12-28  6
7       <NA>  7

# NA's still missing
> summary(DF.Dates)
      Col1                  x1     
 Min.   :1881-09-24   Min.   :1.0  
 1st Qu.:1902-12-04   1st Qu.:2.5  
 Median :1920-09-10   Median :4.0  
 Mean   :1923-04-12   Mean   :4.0  
 3rd Qu.:1941-01-17   3rd Qu.:5.5  
 Max.   :1969-12-28   Max.   :7.0  


DF.Dates$x2 <- c(1:6, NA)

# NA's show if another column has any
> summary(DF.Dates)
      Col1                  x1            x2      
 Min.   :1881-09-24   Min.   :1.0   Min.   :1.00  
 1st Qu.:1902-12-04   1st Qu.:2.5   1st Qu.:2.25  
 Median :1920-09-10   Median :4.0   Median :3.50  
 Mean   :1923-04-12   Mean   :4.0   Mean   :3.50  
 3rd Qu.:1941-01-17   3rd Qu.:5.5   3rd Qu.:4.75  
 Max.   :1969-12-28   Max.   :7.0   Max.   :6.00  
 NA's   :3                          NA's   :1     


The behavior appears to occur because summary.Date() assigns an "NAs" attribute 
internally that contains the count of NA's in the source Date vector:

 x <- summary.default(unclass(object), digits = digits, ...)
 if (m <- match("NA's", names(x), 0)) {
       NAs <- as.integer(x[m])
       x <- x[-m]
       attr(x, "NAs") <- NAs
   }

rather than the count being retained as an actual element in the result vector, 
as in summary.default():

       nas <- is.na(object)
       object <- object[!nas]
       qq <- stats::quantile(object)
       qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits)
       names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", 
           "Max.")
       if (any(nas)) 
           c(qq, `NA's` = sum(nas))
       else qq


This results in an apparent (but not real) error in the value of the variable 
'nr' within summary.date.frame(), which is used to set the length of the result 
created within that function:

   nr <- if (nv) 
       max(unlist(lapply(z, NROW)))
   else 0

'nr' is used later in the function to set the length of the initial result 
vector 'sms', which in turn is assigned back to the result list 'z'.

In the case of my example above, where the NA's are not printed, 'nr' is 6, 
rather than 7. 6 is correct, since that is the actual length of the result 
vector from summary.Date(), even though the printed output of the result, would 
appear to contain 7 elements, including the NA count, because of the behavior 
of print.summaryDefault().

This results in an apparent truncation of the result, with a loss of the "NAs" 
attribute from summary.Date(), when the result is returned by 
summary.data.frame().

If the source vector is numeric, as per my example above, then 'nr' is set to 7 
when NA's are present and the result is correctly printed along with the other 
columns.

The history of the difference in the manner in which the NA counts are stored 
in the different summary() methods is not clear and so I am not clear on how to 
consider a resolution.

At least three options seem possible and I have not fully thought through the 
implications of each yet:

1. Modify the code that creates and uses 'nr' in summary.data.frame(), to 
account for the NAs attribute from summary.Date().
2. Restore the NAs attribute later in the code, if present in the vector that 
results from summary.Date().
3. Modify the code in summary.Date() so that it mimics the approach in 
summary.default() relative to storing the NA count.

It is important to note that summary.POSIXct() has code similar to 
summary.Date() relative to the handling of NA's.

In addition, print.summaryDefault() contains checks for both Date and POSIXct 
classes and outputs accordingly. So the inter-dependencies of the handling of 
NA's across the methods are notable.

Thus, since there are likely to be other implications for the choice of 
resolution that I am not considering here and I am likely to  be missing some 
nuances here, I defer to others for comments/corrections.

Thanks and regards,

Marc Schwartz

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Apparent bug in summary.data.frame() with columns of Date class and NA's present

Reply via email to