Note that the relative speeds of these, which all use basically the same run-length-encoding algorithm, depend on the nature of the dataset. I made a million row data.frame with 10,000 unique users, 26 unique countries, and 6 unique languages with c. 3/4 million unique rows. Then the times for methods 1, 2, and 3 were 0.7, 6.2, and 10.5 seconds, respectively. With a million row data.frame with 100, 10, and 4 unique users, countries, and languages, with 4000 unique rows, the times were 0.3, 1.4, and 0.7.
Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf > Of Sam Steingold > Sent: Wednesday, October 17, 2012 12:58 PM > To: r-help@r-project.org > Subject: Re: [R] uniq -c > > > * Sam Steingold <f...@tah.bet> [2012-10-16 11:03:27 -0400]: > > > > I need an analogue of "uniq -c" for a data frame. > > Summary of options: > > 1. William: > > isFirstInRun <- function(x) UseMethod("isFirstInRun") > isFirstInRun.default <- function(x) c(TRUE, x[-1] != x[-length(x)]) > isFirstInRun.data.frame <- function(x) { > stopifnot(ncol(x)>0) > retval <- isFirstInRun(x[[1]]) > for(column in x) { > retval <- retval | isFirstInRun(column) > } > retval > } > row.count.1 <- function (x) { > i <- which(isFirstInRun(x)) > data.frame(x[i,], count=diff(c(i, 1L+nrow(x)))) > } > > 147 seconds > > 2. http://orgmode.org/worg/org-contrib/babel/examples/Rpackage.html#sec-6-1 > row.count.2 <- function (x) { > equal.to.previous <- rowSums( x[2:nrow(x),] != x[1:(nrow(x)-1),] )==0 > tf.runs <- rle(equal.to.previous) > counts <- c(1, unlist(mapply(function(x,y) if (y) x+1 else (rep(1,x)), > tf.runs$length, tf.runs$value))) > counts <- counts[ c( diff( counts ) <= 0, TRUE ) ] > unique.rows <- which( c(TRUE, !equal.to.previous ) ) > cbind(x[ unique.rows, ,drop=FALSE ], counts) > } > > 136 seconds > > 3. Micael: paste/strsplit > > row.count.3 <- function (x) { > pa <- do.call(paste,x) > rl <- rle(p) > sp <- strsplit(as.character(rl$values)," ") > data.frame(user = sapply(sp,"[",1), > country = sapply(sp,"[",2), > language = sapply(sp,"[",3), > count = rl$length) > } > > here I know the columns and rely on absense of spaces in values. > > 27 seconds. > > Thanks to all who answered. > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X > 11.0.11103000 > http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/ > http://thereligionofpeace.com http://ffii.org http://camera.org > A slave dreams not of Freedom, but of owning his own slaves. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.