Hi, On Mon, Nov 26, 2012 at 4:57 PM, Sam Steingold <s...@gnu.org> wrote: [snip] >> Could you please copy paste the output of `(head(infl, 20))` as >> well as an approximation of what the result is that you want.
Don't know how "dput" got clipped in your reply from the quoted text I wrote, but I actually asked for `dput(head(infl, 20))` The dput makes a world of difference because I can easily copy/paste the output into R and get a working table. > this prints all the levels for all the factor columns and takes > megabytes. Try using droplevels, eg: R> dput(droplevels(head(infl, 20))) > --8<---------------cut here---------------start------------->8--- >> f <- data.frame(id=rep(1:3,4),country=rep(6:8,4),delay=1:12) >> f > id country delay > 1 1 6 1 > 2 2 7 2 > 3 3 8 3 > 4 1 6 4 > 5 2 7 5 > 6 3 8 6 > 7 1 6 7 > 8 2 7 8 > 9 3 8 9 > 10 1 6 10 > 11 2 7 11 > 12 3 8 12 >> f <- as.data.table(f) >> setkey(f,id) >> delays <- >> f[,list(min=min(delay),max=max(delay),count=.N,country=unique(country)),by="id"] >> delays > id min max count country > 1: 1 1 10 4 6 > 2: 2 2 11 4 7 > 3: 3 3 12 4 8 > --8<---------------cut here---------------end--------------->8--- > > this is still too slow, apparently because of unique. > how do I speed it up? I think I'm missing something. Your call to `min(delay)` and `max(delay)` will return the minimum and maximum delays within the particular "id" you are grouping by. I guess there must be several values for "country" within each "id" group -- do you really want the same min and max values to be replicated as many times as there are unique "country"s? Do you perhaps want to iterate over a combo of id and country? Anyway: if you don't use `unique` inside your calculation, I guess it goes significantly faster, like so: R> result <- f[, list(min=min(delay), max=max(delay), count=.N,country=country[1L]), by="share.id"] If that's bearable, and you really want the way you suggest (or, at least, what I'm interpreting), I wonder if this two-step would be faster? R> setkeyv(f, c('share.id', 'country')) R> r1 <- f[, list(min=min(delay), max=max(delay), count=.N), by='share.id'] R> result <- unique(f)[r1] ## I think -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.