>>>>> "TH" == Tim Hesterberg <[EMAIL PROTECTED]> >>>>> on Thu, 3 Jul 2008 17:04:24 -0700 writes:
TH> I made a couple of a changes from the previous version: TH> - don't use functions anyMissing or notSorted (which aren't in base R) TH> - don't check for dup.row.names attribute (need to modify other functions TH> before that is useful) TH> I have not tested this with a wide variety of inputs; I'm assuming that TH> you have some regression tests. yes, we do (and they are part of the R source). TH> Here are the file differences. Let me know if you'd like a different TH> format. TH> $ diff -c dataframe.R dataframe2.R TH> *** dataframe.R Thu Jul 3 15:48:12 2008 TH> --- dataframe2.R Thu Jul 3 16:36:46 2008 ................... context diff is fine (I typically use '-u' but that's not important). >From your patch, I've currently ended in this "patch" : --- dataframe.R.~19~ 2008-07-03 02:13:21.000163000 +0200 +++ dataframe.R 2008-07-05 13:02:33.000029000 +0200 @@ -579,14 +579,18 @@ ## row names might have NAs. if(is.null(rows)) rows <- attr(xx, "row.names") rows <- rows[i] - if((ina <- any(is.na(rows))) | (dup <- any(duplicated(rows)))) { - ## both will coerce integer 'rows' to character: - if (!dup && is.character(rows)) dup <- "NA" %in% rows - if(ina) - rows[is.na(rows)] <- "NA" - if(dup) - rows <- make.unique(as.character(rows)) - } + + ## Do not want to check for duplicates if don't need to + noDuplicateRowNames <- + (is.logical(i) || (li <- length(i)) < 2 || + (is.numeric(i) && (min(0, i, na.rm=TRUE) < 0 || + (!any(is.na(i)) && all(i[-li] < i[-1L]))))) + ## TODO: is.unsorted(., strict=FALSE/TRUE) + if(any(is.na(rows))) + rows[is.na(rows)] <- "NA" # coerces to integer + if(!noDuplicateRowNames && any(duplicated(rows))) + rows <- make.unique(as.character(rows)) # coerces to integer + ## new in 1.8.0 -- might have duplicate columns if(any(duplicated(nm <- names(x)))) names(x) <- make.unique(nm) if(is.null(rows)) rows <- attr(xx, "row.names")[i] TH> Here's some code for testing, and timings ................. I've rationalized (wrote functions) and slightly extended your tests, they are now public at ftp://ftp.stat.math.ethz.ch/U/maechler/R/data.frame-TH-ex.R Unfortunately, they show that "the speedup" is negative in some cases, e.g. for the 'i <- 1:n' case for n <- 1000 or 10000. I've replicated every system.time() 12 times, to get a sense of the precision, and that's still the conclusion. In other words, your proposed 'noDuplicateRowNames' computations are sometimes more expensive than the duplicated(.) call they replace. To me, that means that the whole exercise was probbaly in vain: We are not making the code more complicated if it's not a uniform improvement. Too bad..... Martin Maechler, ETH Zurich ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel