On Sep 21, 2010, at 16:27 , Joshua Wiley wrote: > On Tue, Sep 21, 2010 at 3:09 AM, Matthew Dowle <mdo...@mdowle.plus.com> wrote: >> >> >> All the solutions in this thread so far use the lapply(split(...)) paradigm >> either directly or indirectly. That paradigm doesn't scale. That's the >> likely >> source of quite a few 'out of memory' errors and performance issues in R. > > This is a good point. It is not nearly as straightforward as the > syntax for data.table (which seems to order and select in one > step...very nice!), but this should be less memory intensive: > > tmp <- data.frame(index = gl(2,20), foo = rnorm(40)) > tmp <- tmp[order(tmp$index, tmp$foo) , ] > > # find location of first instance of each level and add 0:4 to it > x <- sapply(match(levels(tmp$index), tmp$index), `+`, 0:4) > > tmp[x, ] >
That will get you in trouble if any group has size less than 5, though. Something involving duplicated() could work; you "just" need to generate the sawtooth sequence: 0,1,2,3,4,0,1,2,3,4,5,6,0,1,2,... and select values less than or equal 4. I _think_ this should work (it does on the airquality dataframe, anyway): ix <- tmp$index s <- seq_along(ix) j <- diff(s[!duplicated(ix)]) s2 <- rep.int(0, length(s)) s2[!duplicated(ix)] <- c(1,j) d <- s - cumsum(s2) tmp[d < 5,] Or, another version of the same idea, giving "teeth" starting at 1 instead d <- s - c(0,cumsum(table(ix)))[factor(ix)] tmp[d <= 5, ] (There are times when I contemplate writing a DATAstep() function, this is one of those things that are straightforward in the SAS sequential processing paradigm. Of course there are things that are much more complicated in SAS, too.) >> >> data.table doesn't do that internally, and it's syntax is pretty easy. >> >>> tmp <- data.table(index = gl(2,20), foo = rnorm(40)) >> >>> tmp[, .SD[head(order(-foo),5)], by=index] >> index index.1 foo >> [1,] 1 1 1.9677303 >> [2,] 1 1 1.2731872 >> [3,] 1 1 1.1100931 >> [4,] 1 1 0.8194719 >> [5,] 1 1 0.6674880 >> [6,] 2 2 1.2236383 >> [7,] 2 2 0.9606766 >> [8,] 2 2 0.8654497 >> [9,] 2 2 0.5404112 >> [10,] 2 2 0.3373457 >>> >> >> As you can see it currently repeats the group column which is a >> shame (on the to do list to fix). >> >> Matthew >> >> http://datatable.r-forge.r-project.org/ >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Joshua Wiley > Ph.D. Student, Health Psychology > University of California, Los Angeles > http://www.joshuawiley.com/ > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.