Correction to my code. I created a "doc" variable because I was thinking of doing something faster, but I never did the change. grep needed to work on the original source "dat" to be used for counting.
Fixed: combs = structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, 34L, 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -5L)) dat = list( c(77,65,34,23,55, 65,23,77, 44), c(65,23,77,65,55,34, 77, 34,65, 10), c(77,34,65), c(55,78,56), c(98,23,77,65,34, 65, 23, 77, 34)) words = unlist(apply(combs, 1 , function(d) paste(as.character(d), collapse=" "))) dat = lapply(dat, function(d) paste( as.character(d), collapse= " ")) #doc = paste(dat, collapse = " ## ") # just some arbitrary separator character that isn't in your words counts = sapply(words, function(w) length(grep(w, dat))) names(counts) = words counts cbind(combs, data.frame(N = counts)) On Wed, Jul 27, 2016 at 11:27 AM, sri vathsan <srivib...@gmail.com> wrote: > Hi, > > It is not a just 79 triplets. As I said, there are 79 codes. I am making > triplets out of that 79 codes and matching the triplets in the list. > > Please find the dput of the data below. > > > dput(head(newd,10)) > structure(list(uniq_id = c("1", "2", "3", "4", "5", "6", "7", > "8", "9", "10"), hi = c("11, 22, 84, 85, 108, 111", "18, 84, 85, > 87, 122, 134", > "2, 18, 22", "18, 108, 122, 134, 176", "19, 85, 87, 100, 107", > "79, 85, 111", "11, 88, 108", "19, 88, 96", "19, 85, 96", > "19, 100, 103")), .Names = c("uniq_id", "hi"), row.names = c(NA, > -10L), class = c("tbl_df", "tbl", "data.frame")) > > > > I am trying to count the frequency of the triplets in the above data using > the below code. > > # split column into a list > myList <- strsplit(newd$hi, split=",") > # get all pairwise combinations > myCombos <- t(combn(unique(unlist(myList)), 3)) > # count the instances where the pair is present > myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { > sum(sapply(myList, function(j) { > sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) > #final matrix > final <- cbind(matrix(as.integer(myCombos), nrow(myCombos)), myCounts) > > I hope I made my point clear. Please let me know if I miss anything. > > Regards, > Sri > > > > > On Wed, Jul 27, 2016 at 11:19 PM, Sarah Goslee <sarah.gos...@gmail.com> > wrote: > > > You said you had 79 triplets and 8000 records. > > > > When I compared 100 triplets to 10000 records it took 86 seconds. > > > > So obviously there is something you're not telling us about the format > > of your data. > > > > If you use dput() to provide actual examples, you will get better > > results than if we on Rhelp have to guess. Because we tend to guess in > > ways that make the most sense after extensive R experience, and that's > > probably not what you have. > > > > Sarah > > > > On Wed, Jul 27, 2016 at 1:29 PM, sri vathsan <srivib...@gmail.com> > wrote: > > > Hi, > > > > > > Thanks for the solution. But I am afraid that after running this code > > still > > > it takes more time. It has been an hour and still it is executing. I > > > understand the delay because each triplet has to compare almost 9000 > > > elements. > > > > > > Regards, > > > Sri > > > > > > On Wed, Jul 27, 2016 at 9:02 PM, Sarah Goslee <sarah.gos...@gmail.com> > > > wrote: > > >> > > >> Hi, > > >> > > >> It's really a good idea to use dput() or some other reproducible way > > >> to provide data. I had to guess as to what your data looked like. > > >> > > >> It appears that order doesn't matter? > > >> > > >> Given than, here's one approach: > > >> > > >> combs <- structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, > > 34L, > > >> 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1", > > >> "V2", "V3"), class = "data.frame", row.names = c(NA, -5L)) > > >> > > >> dat <- list( > > >> c(77,65,34,23,55), > > >> c(65,23,77,65,55,34), > > >> c(77,34,65), > > >> c(55,78,56), > > >> c(98,23,77,65,34)) > > >> > > >> > > >> sapply(seq_len(nrow(combs)), function(i)sum(sapply(dat, > > >> function(j)all(combs[i,] %in% j)))) > > >> > > >> On a dataset of comparable time to yours, it takes me under a minute > > and a > > >> half. > > >> > > >> > combs <- combs[rep(1:nrow(combs), length=100), ] > > >> > dat <- dat[rep(1:length(dat), length=10000)] > > >> > > > >> > dim(combs) > > >> [1] 100 3 > > >> > length(dat) > > >> [1] 10000 > > >> > > > >> > system.time(test <- sapply(seq_len(nrow(combs)), > > >> > function(i)sum(sapply(dat, function(j)all(combs[i,] %in% j))))) > > >> user system elapsed > > >> 86.380 0.006 86.391 > > >> > > >> > > >> > > >> > > >> On Wed, Jul 27, 2016 at 10:47 AM, sri vathsan <srivib...@gmail.com> > > wrote: > > >> > Hi, > > >> > > > >> > Apologizes for the less information. > > >> > > > >> > Basically, myCombos is a matrix with 3 variables which is a triplet > > that > > >> > is > > >> > a combination of 79 codes. There are around 3lakh combination as > such > > >> > and > > >> > it looks like below. > > >> > > > >> > V1 V2 V3 > > >> > 65 23 77 > > >> > 77 34 65 > > >> > 55 34 23 > > >> > 23 77 34 > > >> > 34 65 55 > > >> > > > >> > Each triplet will compare in a list (mylist) having 8177 elements > > which > > >> > will looks like below. > > >> > > > >> > 77,65,34,23,55 > > >> > 65,23,77,65,55,34 > > >> > 77,34,65 > > >> > 55,78,56 > > >> > 98,23,77,65,34 > > >> > > > >> > Now I want to count the no of occurrence of the triplet in the above > > >> > list. > > >> > I.e., the triplet 65 23 77 is seen 3 times in the list. So my output > > >> > looks > > >> > like below > > >> > > > >> > V1 V2 V3 Freq > > >> > 65 23 77 3 > > >> > 77 34 65 4 > > >> > 55 34 23 2 > > >> > > > >> > I hope, I made it clear this time. > > >> > > > >> > > > >> > On Wed, Jul 27, 2016 at 7:00 PM, Bert Gunter < > bgunter.4...@gmail.com> > > >> > wrote: > > >> > > > >> >> Not entirely sure I understand, but match() is already vectorized, > so > > >> >> you > > >> >> should be able to lose the supply(). This would speed things up a > > lot. > > >> >> Please re-read ?match *carefully* . > > >> >> > > >> >> Bert > > >> >> > > >> >> On Jul 27, 2016 6:15 AM, "sri vathsan" <srivib...@gmail.com> > wrote: > > >> >> > > >> >> Hi, > > >> >> > > >> >> I created list of 3 combination numbers (mycombos, around 3 lakh > > >> >> combinations) and counting the occurrence of those combination in > > >> >> another > > >> >> list. This comparision list (mylist) is having around 8000 > records.I > > am > > >> >> using the following code. > > >> >> > > >> >> myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { > > >> >> sum(sapply(myList, function(j) { > > >> >> sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) > > >> >> > > >> >> The above code takes very long time to execute and is there any > other > > >> >> effecting method which will reduce the time. > > >> >> -- > > >> >> > > >> >> Regards, > > >> >> Srivathsan.K > > >> >> > > > > > > > > > > > > > > > > > > -- > > Regards, > Srivathsan.K > Phone : 9600165206 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.