This gives the first mode in each group: # test data xx <- structure(list(v1 = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1), v2 = structure(c(1L, 2L, 2L, 2L, 4L, 4L, 3L, 3L, 5L, 1L), .Label = c("A", "B", "D", "W", "Z"), class = "factor")), .Names = c("v1", "v2"), row.names = c(NA, 10L), class = "data.frame")
# 1. first mode only in each group Mode2 <- function(x) { tab <- table(x); names(tab)[which.max(tab)] } aggregate(xx["v2"], xx["v1"], Mode2) # 2. All modes in each group: Mode3 <- function(x) { tab <- table(x); names(tab)[tab == max(tab)] } do.call(rbind, by(xx, xx$v1, function(x) data.frame(v1 = x$v1[1], v2 = Mode3(x$v2)))) # 3. Here is an SQL solution giving all modes in each group: library(sqldf) sqldf("select v1, v2 from xx a group by v1, v2 having count(*) >= (select max(k) from (select v1, v2, count(*) as k from xx b where b.v1 = a.v1 group by v1, v2) s)") The inner select gets the counts for all groups having the same v1 as the current v1 and the select surrounding it takes the largest of those. You could also check if replacing the >= with = makes any difference to the time. You could also try the above query with PostgreSQL. Just issue the library statement below and then repeat the sqldf statement just given. sqldf checks whether RpgSQL is loaded and if it is then it automatically uses PostgreSQL instead of sqlite. Note that the first time you execute sqldf with PostgreSQL in a session it will load java so disregard the timing of the first run. # 4. PostgreSQL library(RpgSQL) # repeat sqldf statement above On Mon, Mar 15, 2010 at 1:43 PM, Gabriel Yospin <yosp...@gmail.com> wrote: > Greetings Everyone - > > I have a data frame "x" that looks like this: > > v1 v2 > 1 A > 1 B > 1 B > 2 B > 2 W > 2 W > 3 D > 3 D > 3 Z > > What I would like to do is create a new data frame, "y", that has one row > for each unique value of v1, and returns the corresponding mode of v2. If I > were to run it on the above data frame, it should therefore return: > > v1 v2 > 1 B > 2 W > 3 D > > I've been using the following code: > > x <- data.frame(v1 = c(1,1,1,2,2,2,3,3,3), v2 = > c("A","B","B","B","W","W","D","D","Z")) > y <- aggregate.data.frame(x, by = list(x$var1), FUN = "Mode") > > which relies on the Mode function from package prettyR. The above code > works for me. > > My problem comes when I use my real database. Running this produces many > warnings, because there are multiple modes of v2 for many values of v1. My > database is also rather large (~700,000 rows), and I'm wondering if there is > a faster way to get R to process these data. > > Thank you for your help and consideration, > > Gabriel Yospin > Center for Ecology and Evolutionary Biology > University of Oregon > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.