On Aug 16, 2013, at 1:54 PM, McGehee, Robert wrote: > R-Devel, > I store and retrieve a large amount of financial data (millions of rows) in a > PostgreSQL database keyed by date (and represented in R by class Date). > Unfortunately, I frequently find that a great deal of processing time is > spent converting dates from character representations to Date class > representations in R, presumably because strptime is not fast for large > vectors (>10,000 elements). I'd like to suggest a patch that speeds up the > date conversion considerably for most every large date vectors (up to 400x in > some real life cases). >
This is more of a comment: if you want speed and have a standard date format, you can use fastPOSIXct from fasttime. The real bottleneck are system calls that do the conversion and fasttime is avoiding them by doing fast string parsing instead: > system.time(dt1 <- as.Date.character(dtch)) user system elapsed 31.513 0.046 31.559 > system.time(dt1 <- as.Date(fasttime::fastPOSIXct(dtch))) user system elapsed 0.055 0.018 0.074 Cutting back to unique dates may works for some applications (not for any of ours because we are always dealing with timestamps - but that's why we use POSIXct and not Date), but I'd argue that you may as well do it right away in your specialized application instead. Cheers, Simon > I suspect most everyone with large vectors of class Date will find that most > of their values are duplicated (repeatedly). (There are, after all, only > 36,524 days in a century.) Given this, as.Date.character can be sped up > substantially for large vectors by only calling strptime on unique dates and > then filling in the calculated values for the entire vector. Since the time > savings can be several minutes in real-life cases, I think this enhancement > should certainly be considered. Also, in a worst case scenario of a long > vector with only one duplicated value, the suggested change does not slow > down the calculation. > > Here's a proof of concept: > as.Date.character2 <- function(x, ...) { > if (anyDuplicated(x)) { > ux <- unique(x) > idx <- match(x, ux) > y <- as.Date.character(ux, ...) > return(y[idx]) > } > as.Date.character(x, ...) > } > > ## Example1: Construct a 1-million length character vector of 1000 unique > dates > ## By considering only unique values, speed is >250x faster > >> dtch <- format(sample(Sys.Date()-1:1000, 1e6, replace=TRUE)) >> system.time(dt1 <- as.Date.character(dtch)) > user system elapsed > 12.630 23.628 36.262 >> system.time(dt2 <- as.Date.character2(dtch)) > user system elapsed > 0.117 0.019 0.136 >> identical(dt1, dt2) > [1] TRUE > > > ## Example2: In a "worst case" scenario of a 1,000,002 length character of > 1,000,001 unique dates > ## the new function is not any slower (within error). >> dtch <- format(c(Sys.Date(), Sys.Date()+-5e5:5e5)) >> system.time(dt1 <- as.Date.character(dtch)) > user system elapsed > 20.264 25.584 45.855 >> system.time(dt2 <- as.Date.character2(dtch)) > user system elapsed > 20.525 24.809 45.335 >> identical(dt1, dt2) > [1] TRUE > > Alternatively, this logic should be built in to strptime itself. > > Robert > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel