On Aug 16, 2013, at 1:54 PM, McGehee, Robert wrote:

> R-Devel,
> I store and retrieve a large amount of financial data (millions of rows) in a 
> PostgreSQL database keyed by date (and represented in R by class Date). 
> Unfortunately, I frequently find that a great deal of processing time is 
> spent converting dates from character representations to Date class 
> representations in R, presumably because strptime is not fast for large 
> vectors (>10,000 elements). I'd like to suggest a patch that speeds up the 
> date conversion considerably for most every large date vectors (up to 400x in 
> some real life cases).
> 

This is more of a comment: if you want speed and have a standard date format, 
you can use fastPOSIXct from fasttime. The real bottleneck are system calls 
that do the conversion and fasttime is avoiding them by doing fast string 
parsing instead:

> system.time(dt1 <- as.Date.character(dtch))
   user  system elapsed 
 31.513   0.046  31.559 
> system.time(dt1 <- as.Date(fasttime::fastPOSIXct(dtch)))
   user  system elapsed 
  0.055   0.018   0.074 

Cutting back to unique dates may works for some applications (not for any of 
ours because we are always dealing with timestamps - but that's why we use 
POSIXct and not Date), but I'd argue that you may as well do it right away in 
your specialized application instead.

Cheers,
Simon



> I suspect most everyone with large vectors of class Date will find that most 
> of their values are duplicated (repeatedly). (There are, after all, only 
> 36,524 days in a century.) Given this, as.Date.character can be sped up 
> substantially for large vectors by only calling strptime on unique dates and 
> then filling in the calculated values for the entire vector. Since the time 
> savings can be several minutes in real-life cases, I think this enhancement 
> should certainly be considered. Also, in a worst case scenario of a long 
> vector with only one duplicated value, the suggested change does not slow 
> down the calculation.
> 
> Here's a proof of concept:
> as.Date.character2 <- function(x, ...) {
>    if (anyDuplicated(x)) {
>        ux <- unique(x)
>        idx <- match(x, ux)
>        y <- as.Date.character(ux, ...)
>        return(y[idx])
>    }
>    as.Date.character(x, ...)
> }
> 
> ## Example1: Construct a 1-million length character vector of 1000 unique 
> dates
> ## By considering only unique values, speed is >250x faster
> 
>> dtch <- format(sample(Sys.Date()-1:1000, 1e6, replace=TRUE))
>> system.time(dt1 <- as.Date.character(dtch))
>   user  system elapsed 
> 12.630  23.628  36.262
>> system.time(dt2 <- as.Date.character2(dtch))
>   user  system elapsed 
>  0.117   0.019   0.136 
>> identical(dt1, dt2)
> [1] TRUE
> 
> 
> ## Example2: In a "worst case" scenario of a 1,000,002 length character of 
> 1,000,001 unique dates
> ## the new function is not any slower (within error).
>> dtch <- format(c(Sys.Date(), Sys.Date()+-5e5:5e5))
>> system.time(dt1 <- as.Date.character(dtch))
>   user  system elapsed 
> 20.264  25.584  45.855
>> system.time(dt2 <- as.Date.character2(dtch))
>   user  system elapsed 
> 20.525  24.809  45.335 
>> identical(dt1, dt2)
> [1] TRUE
> 
> Alternatively, this logic should be built in to strptime itself.
> 
> Robert
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to