Re: [R] Selecting ranges of dates from a dataframe

David Winsemius Fri, 11 Mar 2011 06:00:54 -0800


On Mar 11, 2011, at 8:41 AM, Benjamin Stier wrote:

Hi Francisco,
Thanks for your solution. It runs pretty fast compared to my forloop. Here
is a comparison of system.time():

system.time(splitVals <- by(serv, dates, aggregateDf ))
  user  system elapsed
 1.129   0.218   1.348

system.time(... my long for loop...)
  user  system elapsed
276.987   1.544 278.698
I also tried Davids solution with "aggregate", but I can't get it toworkbecause I have to add as.numeric() into the sum(), since the data isvery big.

This comment doesn't make any sense. Unless you have character vectorsthat because of malformed values need coercion (which was NOT part ofthe example posed) then `sum` should not need any pre-processing orpost-processing with `as.numeric`.


> serv <- read.delim("cut.inp")
> serv$datum <- strptime(serv$datum,  "%Y-%m-%d %H:%M:%S")
> dates.serv <- unique(strptime(serv$datum, format="%Y-%m-%d"))

> aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y-%m-%d")), sum)

     Group.1    read    write
1 2011-01-29 1021439 11726356
2 2011-01-30 1089534  4634910

Perhaps what you really needed was to read the file with colClasses todefine the date-time and numeric fields properly. Try this:

serv <- read.delim("cut.inp", colClasses=c("POSIXct", "integer","integer", "numeric","numeric") )aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y-%m-%d")), sum)

I will now try to understand how the by()-function works and what itdoes.
Thanks again for helping me!

If you read the help(tapply) page you are told that both `by` and`aggregate` are just convenience functions using tapply "under thehood".

Regards,

Benjamin


On Thu, Mar 10, 2011 at 04:26:57PM +0000, Francisco Gochez wrote:
Benjamin,

A more elegant "R-style" solution would be to use one of R's "apply"/
aggregation routines, of which there are many. For example, the"by" functioncan split a data.frame by some factor/categorical variable(s), andthen apply afunction to each "slice". The result can then be pieced backtogether. Seebelow for an example in which this factor is simply a parallelvector of pure
dates:

# extract pure date component of time and date
dates <- format(serv$datum, "%Y-%m-%d")

# write auxilliary function to aggregate a "slice" of the data.frame
# x will be a "slice" of data from a single day
aggregateDf <- function(x)
{
    # return a one-row data.frame
data.frame(datum = format(x$datum[1], "%Y-%m-%d"), write = sum(x$write),
read = sum(x$read) )
}

# now process each "slice" of the serv data.frame using "by"
splitVals <- by(serv, dates, aggregateDf )

# bind back into a single data.frame
values <- do.call(rbind, splitVals)
The difference in execution speed is pretty negligible on mymachine, so it's a
more concise solution but I don't know if it is much faster.

HTH,

Francisco
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Selecting ranges of dates from a dataframe

Reply via email to