Thanks Duncan, Comments and a proposed bug fix in-line below:
> -----Original Message----- > From: Duncan Murdoch [mailto:murd...@stats.uwo.ca] > Sent: Wednesday, April 29, 2009 5:10 PM > To: Steven McKinney > Cc: R-help@r-project.org > Subject: Re: [R] Corrupt data frame construction - bug? > > On 29/04/2009 6:41 PM, Steven McKinney wrote: > > Hi useRs, > > > > A recent coding infelicity along these lines yielded a corrupt data > > frame. > > > > foo <- matrix(1:12, nrow = 3) > > bar <- data.frame(foo) > > bar$NewCol <- foo[foo[, 1] == 4, 4] > > bar > > lapply(bar, length) > > > > > > > > > >> foo <- matrix(1:12, nrow = 3) > >> bar <- data.frame(foo) > >> bar$NewCol <- foo[foo[, 1] == 4, 4] > >> bar > > X1 X2 X3 X4 NewCol > > 1 1 4 7 10 <NA> > > 2 2 5 8 11 <NA> > > 3 3 6 9 12 <NA> > > Warning message: > > In format.data.frame(x, digits = digits, na.encode = FALSE) : > > corrupt data frame: columns will be truncated or padded with NAs > >> lapply(bar, length) > > $X1 > > [1] 3 > > > > $X2 > > [1] 3 > > > > $X3 > > [1] 3 > > > > $X4 > > [1] 3 > > > > $NewCol > > [1] 0 > > > > > > Is this a bug in the data.frame machinery? > > If an attempt is made to add a new column to a data frame, and the > new > > object does not have length = number of rows of data frame, or cannot > > be made to have such length via recycling, shouldn't an error be > > thrown? > > > > Instead in this example I end up with a "corrupt data frame" having > > one zero-length column. > > > > > > Should this be reported as a bug, or did I misinterpret the > > documentation? > > I don't think "$" uses any data.frame machinery. You are working at a > lower level. > > If you had added the new column using > > bar <- data.frame(bar, NewCol=foo[foo[, 1] == 4, 4]) > > you would have seen the error: > > Error in data.frame(bar, NewCol = foo[foo[, 1] == 4, 4]) : > arguments imply differing number of rows: 3, 0 > > But since you treated it as a list, it let you go ahead and create > something that was labelled as a data.frame but wasn't. This is one of > the reasons some people prefer S4 methods: it's easier to protect > against people who mislabel things. > I did some more digging on '$' - there is a data.frame method for it: > getAnywhere("$<-.data.frame" ) A single object matching '$<-.data.frame' was found It was found in the following places package:base registered S3 method for $<- from namespace base namespace:base with value function (x, i, value) { cl <- oldClass(x) class(x) <- NULL nrows <- .row_names_info(x, 2L) if (!is.null(value)) { N <- NROW(value) if (N > nrows) stop(gettextf("replacement has %d rows, data has %d", N, nrows), domain = NA) if (N < nrows && N > 0L) if (nrows%%N == 0L && length(dim(value)) <= 1L) value <- rep(value, length.out = nrows) else stop(gettextf("replacement has %d rows, data has %d", N, nrows), domain = NA) if (is.atomic(value)) names(value) <- NULL } x[[i]] <- value class(x) <- cl return(x) }<environment: namespace:base> > I placed a browser() command before return(x) and did some poking around. It seems to me there's a bug in this function. It should be able to detect the problem I threw at it, and throw an error as you point out is thrown by the other data.frame assign method. I modified the rows if (N < nrows && N > 0L) if (nrows%%N == 0L && length(dim(value)) <= 1L) to read if (N < nrows) if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L) as in "$<-.data.frame" <- function (x, i, value) { cl <- oldClass(x) class(x) <- NULL nrows <- .row_names_info(x, 2L) if (!is.null(value)) { N <- NROW(value) if (N > nrows) stop(gettextf("replacement has %d rows, data has %d", N, nrows), domain = NA) if (N < nrows) if (N > 0L && nrows%%N == 0L && length(dim(value)) <= 1L) value <- rep(value, length.out = nrows) else stop(gettextf("replacement has %d rows, data has %d", N, nrows), domain = NA) if (is.atomic(value)) names(value) <- NULL } x[[i]] <- value class(x) <- cl return(x) } Now it detects the problem I created, in the fashion you demonstrated above for the replacement using data.frame(). > foo <- matrix(1:12, nrow = 3) > bar <- data.frame(foo) > bar$NewCol <- foo[foo[, 1] == 4, 4] Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) : replacement has 0 rows, data has 3 It doesn't appear to stumble on weird data frames (these from the ?data.frame help page) > L3 <- LETTERS[1:3] > (d <- data.frame(cbind(x=1, y=1:10), fac=sample(L3, 10, replace=TRUE))) > (d0 <- d[, FALSE]) # NULL data frame with 10 rows > (d.0 <- d[FALSE, ]) # <0 rows> data frame (3 cols) > (d00 <- d0[FALSE,]) # NULL data frame with 0 rows > d0$NewCol <- foo[foo[, 1] == 4, 4] Error in `$<-.data.frame`(`*tmp*`, "NewCol", value = integer(0)) : replacement has 0 rows, data has 10 ### Catches this problem above alright. > d.0$NewCol <- foo[foo[, 1] == 4, 4] > d.0 [1] x y fac NewCol <0 rows> (or 0-length row.names) ### Lets the above one through alright. > d00$NewCol <- foo[foo[, 1] == 4, 4] > > d00 [1] NewCol <0 rows> (or 0-length row.names) ### Lets the above one through alright. Would the above modification work to fix this problem? > Duncan Murdoch > > > > > > > > > > >> sessionInfo() > > R version 2.9.0 (2009-04-17) > > powerpc-apple-darwin8.11.1 > > > > locale: > > en_CA.UTF-8/en_CA.UTF-8/C/C/en_CA.UTF-8/en_CA.UTF-8 > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] nlme_3.1-90 > > > > loaded via a namespace (and not attached): > > [1] grid_2.9.0 lattice_0.17-22 tools_2.9.0 > > > > > > Also occurs on Windows box with R 2.8.1 > > > > > > > > Steven McKinney > > > > Statistician > > Molecular Oncology and Breast Cancer Program British Columbia Cancer > > Research Centre > > > > email: smckinney +at+ bccrc +dot+ ca > > > > tel: 604-675-8000 x7561 > > > > BCCRC > > Molecular Oncology > > 675 West 10th Ave, Floor 4 > > Vancouver B.C. > > V5Z 1L3 > > Canada > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.