On Fri, Jan 27, 2012 at 1:26 PM, Sam Albers <tonightstheni...@gmail.com> wrote: > Hello, > > I am looking for a way to subset a data frame by choosing the top ten > maximum values from that dataframe. As well this occurs within some > factor levels. > > ## I've used plyr here but I'm not married to this approach > require(plyr) > > ## I've created a data.frame with two groups and then a id variable (y) > df <- data.frame(x=rnorm(400, mean=20), y=1:400, z=c("A","B")) > > ## So using ddply I can find the highest value of x > df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1]) > > ## Or the 2nd highest value > df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2]) > > ## And so on.... but when I try to make a series of numbers like so > ## to get the top ten values, I don't get a warning message but > ## two values that don't really make sense to me > df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10])
Well, sort returns a vector, so you probably want df.max <- ddply(df, c("z"), subset, x %in% sort(x, TRUE)[1:10]) but it would be better to do (e.g.) df.max <- ddply(df, c("z"), subset, rank(x) <= 10) which will also make it possible to deal with ties in a principled way. Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.