On Jul 4, 2009, at 6:21 AM, Vincent Vinh-Hung wrote:

Dear List:

I have a question related to a previous discussion
How to sum one column in a data frame keyed on other columns
https://stat.ethz.ch/pipermail/r-help/2006-December/122141.html
(George Nachman, Bill Venables)

The original query was to calculate the sum of visits for each unique
tuple of (url, time) from the data frame dat:

dat
         url time somethingirrelevant visits
1 www.foo.com 1:00                 xxx    100
2 www.foo.com 1:00                 yyy     50
3 www.foo.com 2:00                 xyz     25
4 www.bar.com 1:00                 xxx    200
5 www.bar.com 1:00                 zzz    200
6 www.foo.com 2:00                 xxx    500

The response gave:

tdat
         url time total_visits
4 www.bar.com 1:00          400
1 www.foo.com 1:00          150
3 www.foo.com 2:00          525

My question is how can I build a similar data frame but having also rows for
combinations of (url, time) that were not in dat?
In this example, the (url, time) without record of visit would be
(www.bar.com, 2:00):

ndat
         url time total_visits
www.bar.com 1:00          400
www.bar.com 2:00            0
www.foo.com 1:00          150
www.foo.com 2:00          525

I have tried to build a data frame with
adat <- data.frame (url = rep(unique(dat$url), each=length(unique(dat$time))),
        time=unique(dat$time), alt_visits=0)
ndat <- merge (adat, tdat, by=c("url", "time"), all = TRUE)
then replace the NA and remove the alt_visits column.
But this appears clumsy and quite slow with 10000 rows and 4 columns,
I would be most grateful for other suggestions,

?expand.grid

Perhaps:

> fulldat <- expand.grid( unique(dat$url), unique(dat$time) )
> names(fulldat) <- c("url", "time")
> merge(fulldat, tdat)
          url time total_visits
1 www.bar.com 1:00          400
2 www.foo.com 1:00          150
3 www.foo.com 2:00          525
> merge(fulldat, tdat, all=TRUE) # notice that default is to drop missing levels
          url time total_visits
1 www.bar.com 1:00          400
2 www.bar.com 2:00           NA
3 www.foo.com 1:00          150
4 www.foo.com 2:00          525

If they are not ordered, then you might want to sort the elements before they get sent to expand.grid for purposes of readability.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to