On Tue, Mar 13, 2012 at 08:56:33PM -0700, mdvaan wrote: > Hi, > > I have data on individuals (B) who participated in events (A). If ALL > participants in an event are a subset of the participants in another event I > would like to remove the smaller event and if the participants in one event > are exactly similar to the participants in another event I would like to > remove one of the events (I don't care which one). The following example > does that however it is extremely slow (and the true dataset is very large). > What would be a more efficient way to solve the problem? I really appreciate > your help. Thanks! > > DF <- data.frame(read.table(textConnection(" A B > 12095 69832 > 12095 51750 ...
Hi. Try the following. data <- unique(DF$A) gr <- split(DF$B, f=factor(DF$A, levels=data)) gr <- lapply(gr, FUN=sort) gr <- lapply(gr, FUN=unique) elim <- rep(FALSE, times=length(gr)) for (i in seq.int(along=gr)) { gr.i <- gr[[i]] for (j in seq.int(along=gr)) { gr.j <- gr[[j]] if (j < i && identical(gr.i, gr.j)) { elim[i] <- TRUE } else if (i != j) { both <- unique(sort(c(gr.i, gr.j))) if (identical(gr.j, both) && !identical(gr.i, both)) { elim[i] <- TRUE } } } } DF1 <- DF[DF$A %in% data[!elim], ] How frequent it is that an event is eliminated in the real data? Petr Savicky. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.