On Mar 11, 2011, at 8:41 AM, Benjamin Stier wrote:
Hi Francisco,
Thanks for your solution. It runs pretty fast compared to my for
loop. Here
is a comparison of system.time():
system.time(splitVals <- by(serv, dates, aggregateDf ))
user system elapsed
1.129 0.218 1.348
system.time(... my long for loop...)
user system elapsed
276.987 1.544 278.698
I also tried Davids solution with "aggregate", but I can't get it to
work
because I have to add as.numeric() into the sum(), since the data is
very big.
This comment doesn't make any sense. Unless you have character vectors
that because of malformed values need coercion (which was NOT part of
the example posed) then `sum` should not need any pre-processing or
post-processing with `as.numeric`.
> serv <- read.delim("cut.inp")
> serv$datum <- strptime(serv$datum, "%Y-%m-%d %H:%M:%S")
> dates.serv <- unique(strptime(serv$datum, format="%Y-%m-%d"))
> aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y-
%m-%d")), sum)
Group.1 read write
1 2011-01-29 1021439 11726356
2 2011-01-30 1089534 4634910
Perhaps what you really needed was to read the file with colClasses to
define the date-time and numeric fields properly. Try this:
serv <- read.delim("cut.inp", colClasses=c("POSIXct", "integer",
"integer", "numeric","numeric") )
aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y-%m-
%d")), sum)
I will now try to understand how the by()-function works and what it
does.
Thanks again for helping me!
If you read the help(tapply) page you are told that both `by` and
`aggregate` are just convenience functions using tapply "under the
hood".
Regards,
Benjamin
On Thu, Mar 10, 2011 at 04:26:57PM +0000, Francisco Gochez wrote:
Benjamin,
A more elegant "R-style" solution would be to use one of R's "apply"/
aggregation routines, of which there are many. For example, the
"by" function
can split a data.frame by some factor/categorical variable(s), and
then apply a
function to each "slice". The result can then be pieced back
together. See
below for an example in which this factor is simply a parallel
vector of pure
dates:
# extract pure date component of time and date
dates <- format(serv$datum, "%Y-%m-%d")
# write auxilliary function to aggregate a "slice" of the data.frame
# x will be a "slice" of data from a single day
aggregateDf <- function(x)
{
# return a one-row data.frame
data.frame(datum = format(x$datum[1], "%Y-%m-%d"), write = sum(x
$write),
read = sum(x$read) )
}
# now process each "slice" of the serv data.frame using "by"
splitVals <- by(serv, dates, aggregateDf )
# bind back into a single data.frame
values <- do.call(rbind, splitVals)
The difference in execution speed is pretty negligible on my
machine, so it's a
more concise solution but I don't know if it is much faster.
HTH,
Francisco
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.