On Tue, Mar 13, 2012 at 08:56:33PM -0700, mdvaan wrote: > Hi, > > I have data on individuals (B) who participated in events (A). If ALL > participants in an event are a subset of the participants in another event I > would like to remove the smaller event and if the participants in one event > are exactly similar to the participants in another event I would like to > remove one of the events (I don't care which one). The following example > does that however it is extremely slow (and the true dataset is very large). > What would be a more efficient way to solve the problem? I really appreciate > your help. Thanks! > > DF <- data.frame(read.table(textConnection(" A B > 12095 69832 > 12095 51750 > 12095 6734 ...
Hi. If a lot of events are eliminated, then the following may be faster, since eliminated events are removed before the further comparisons take place. data <- unique(DF$A) gr <- split(DF$B, f=factor(DF$A, levels=data)) gr <- lapply(gr, FUN=sort) gr <- lapply(gr, FUN=unique) accept <- rep(FALSE, times=length(gr)) accept[1] <- TRUE for (i in seq.int(from=2, length=length(accept)-1)) { cand <- gr[[i]] OK <- TRUE for (j in which(accept)) { prev <- gr[[j]] both <- unique(sort(c(cand, prev))) if (identical(prev, both)) { OK <- FALSE break } } if (OK) { for (j in which(accept)) { prev <- gr[[j]] both <- unique(sort(c(cand, prev))) if (identical(cand, both)) { accept[j] <- FALSE } } accept[i] <- TRUE } } DF2 <- DF[DF$A %in% data[accept], ] Can you afford to compute table(DF$A, DF$B) for the real data? Its size will be proportional to length(unique(DF$A))*length(unique(DF$B)). Petr Savicky. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.