On Mar 11, 2011, at 8:41 AM, Benjamin Stier wrote:

Hi Francisco,

Thanks for your solution. It runs pretty fast compared to my for loop. Here
is a comparison of system.time():

system.time(splitVals <- by(serv, dates, aggregateDf ))
  user  system elapsed
 1.129   0.218   1.348

system.time(... my long for loop...)
  user  system elapsed
276.987   1.544 278.698


I also tried Davids solution with "aggregate", but I can't get it to work because I have to add as.numeric() into the sum(), since the data is very big.

This comment doesn't make any sense. Unless you have character vectors that because of malformed values need coercion (which was NOT part of the example posed) then `sum` should not need any pre-processing or post-processing with `as.numeric`.

> serv <- read.delim("cut.inp")
> serv$datum <- strptime(serv$datum,  "%Y-%m-%d %H:%M:%S")
> dates.serv <- unique(strptime(serv$datum, format="%Y-%m-%d"))
> aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y- %m-%d")), sum)
     Group.1    read    write
1 2011-01-29 1021439 11726356
2 2011-01-30 1089534  4634910

Perhaps what you really needed was to read the file with colClasses to define the date-time and numeric fields properly. Try this:

serv <- read.delim("cut.inp", colClasses=c("POSIXct", "integer", "integer", "numeric","numeric") ) aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y-%m- %d")), sum)





I will now try to understand how the by()-function works and what it does.
Thanks again for helping me!

If you read the help(tapply) page you are told that both `by` and `aggregate` are just convenience functions using tapply "under the hood".


Regards,

Benjamin


On Thu, Mar 10, 2011 at 04:26:57PM +0000, Francisco Gochez wrote:
Benjamin,

A more elegant "R-style" solution would be to use one of R's "apply"/
aggregation routines, of which there are many. For example, the "by" function can split a data.frame by some factor/categorical variable(s), and then apply a function to each "slice". The result can then be pieced back together. See below for an example in which this factor is simply a parallel vector of pure
dates:

# extract pure date component of time and date
dates <- format(serv$datum, "%Y-%m-%d")

# write auxilliary function to aggregate a "slice" of the data.frame
# x will be a "slice" of data from a single day
aggregateDf <- function(x)
{
    # return a one-row data.frame
data.frame(datum = format(x$datum[1], "%Y-%m-%d"), write = sum(x $write),
read = sum(x$read) )
}

# now process each "slice" of the serv data.frame using "by"
splitVals <- by(serv, dates, aggregateDf )

# bind back into a single data.frame
values <- do.call(rbind, splitVals)


The difference in execution speed is pretty negligible on my machine, so it's a
more concise solution but I don't know if it is much faster.

HTH,

Francisco

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to