See data.table:::duplist which does that (or at least very similar) in C, for multiple columns too.
Matthew http://datatable.r-forge.r-project.org/ "peter dalgaard" <pda...@gmail.com> wrote in message news:660991c3-b52b-4d58-b819-eadc95ecc...@gmail.com... > > On Sep 21, 2010, at 16:27 , Joshua Wiley wrote: > >> On Tue, Sep 21, 2010 at 3:09 AM, Matthew Dowle <mdo...@mdowle.plus.com> >> wrote: >>> >>> >>> All the solutions in this thread so far use the lapply(split(...)) >>> paradigm >>> either directly or indirectly. That paradigm doesn't scale. That's the >>> likely >>> source of quite a few 'out of memory' errors and performance issues in >>> R. >> >> This is a good point. It is not nearly as straightforward as the >> syntax for data.table (which seems to order and select in one >> step...very nice!), but this should be less memory intensive: >> >> tmp <- data.frame(index = gl(2,20), foo = rnorm(40)) >> tmp <- tmp[order(tmp$index, tmp$foo) , ] >> >> # find location of first instance of each level and add 0:4 to it >> x <- sapply(match(levels(tmp$index), tmp$index), `+`, 0:4) >> >> tmp[x, ] >> > > That will get you in trouble if any group has size less than 5, though. > > Something involving duplicated() could work; you "just" need to generate > the sawtooth sequence: 0,1,2,3,4,0,1,2,3,4,5,6,0,1,2,... and select values > less than or equal 4. I _think_ this should work (it does on the > airquality dataframe, anyway): > > ix <- tmp$index > > s <- seq_along(ix) > j <- diff(s[!duplicated(ix)]) > s2 <- rep.int(0, length(s)) > s2[!duplicated(ix)] <- c(1,j) > d <- s - cumsum(s2) > > tmp[d < 5,] > > Or, another version of the same idea, giving "teeth" starting at 1 instead > > d <- s - c(0,cumsum(table(ix)))[factor(ix)] > tmp[d <= 5, ] > > > > (There are times when I contemplate writing a DATAstep() function, this is > one of those things that are straightforward in the SAS sequential > processing paradigm. Of course there are things that are much more > complicated in SAS, too.) > > >>> >>> data.table doesn't do that internally, and it's syntax is pretty easy. >>> >>>> tmp <- data.table(index = gl(2,20), foo = rnorm(40)) >>> >>>> tmp[, .SD[head(order(-foo),5)], by=index] >>> index index.1 foo >>> [1,] 1 1 1.9677303 >>> [2,] 1 1 1.2731872 >>> [3,] 1 1 1.1100931 >>> [4,] 1 1 0.8194719 >>> [5,] 1 1 0.6674880 >>> [6,] 2 2 1.2236383 >>> [7,] 2 2 0.9606766 >>> [8,] 2 2 0.8654497 >>> [9,] 2 2 0.5404112 >>> [10,] 2 2 0.3373457 >>>> >>> >>> As you can see it currently repeats the group column which is a >>> shame (on the to do list to fix). >>> >>> Matthew >>> >>> http://datatable.r-forge.r-project.org/ >>> >>> >>> -- >>> View this message in context: >>> http://r.789695.n4.nabble.com/Sorting-and-subsetting-tp2547360p2548319.html >>> Sent from the R help mailing list archive at Nabble.com. >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> >> >> -- >> Joshua Wiley >> Ph.D. Student, Health Psychology >> University of California, Los Angeles >> http://www.joshuawiley.com/ >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > -- > Peter Dalgaard > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Email: pd....@cbs.dk Priv: pda...@gmail.com > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.