Hi Brant, I'm a bit confused about which data frame is the one to match to, but the following, while still including loops, should run much faster than the above as it only matches dates within id matches.
df1<-read.table(text="id date test1.result a 2009-08-28 1 a 2009-09-16 1 b 2008-08-06 0 c 2012-02-02 1 c 2010-08-03 1 c 2012-08-02 0",header=TRUE) df2<-read.table(text="id date test2.result a 2011-02-03 1 b 2011-09-27 0 b 2011-09-01 1 c 2009-07-16 0 c 2009-04-15 0 c 2010-08-10 1",header=TRUE) bi.match<-function(x1,x2,maxdaydiff=30) { # convert the character strings to dates (may not be necessary) x1$dates<-as.Date(x1$date,"%Y-%m-%d") x2$dates<-as.Date(x2$date,"%Y-%m-%d") # initialize the l and m variables x1$l<-x1$m<-0 # get all the id codes ids<-unique(x2$id) # step through the id codes for(id1 in ids) { x1ind<-which(x1$id == id1) x2ind<-which(x2$id == id1) for(id2 in 1:length(x1ind)) { # get the indices of the x2 dates that are within maxdaydiff days of this x1 date diffok<-which(abs(x1$dates[x1ind[id2]]-x2$dates[x2ind])<=30) # set the date diff match indicator to 1 x1$l[x1ind[id2]]<-length(diffok) > 0 # set the positive test indicator to 1 x1$m[x1ind[id2]]<-any(x2$test2.result[x2ind[diffok]] > 0) } } return(x1) } bi.match(df1,df2) Jim On Sat, Apr 18, 2015 at 2:14 PM, Brant Inman <brant.in...@me.com> wrote: > I have two large data frames with the following structure: > >> df1 > id date test1.result > 1 a 2009-08-28 1 > 2 a 2009-09-16 1 > 3 b 2008-08-06 0 > 4 c 2012-02-02 1 > 5 c 2010-08-03 1 > 6 c 2012-08-02 0 > >> df2 > id date test2.result > 1 a 2011-02-03 1 > 2 b 2011-09-27 0 > 3 b 2011-09-01 1 > 4 c 2009-07-16 0 > 5 c 2009-04-15 0 > 6 c 2010-08-10 1 > > I need to match items in df2 to those in df1 with specific matching criteria. > I have written a looped matching algorithm that works, but it is very slow > with my large datasets. I am requesting help on making a version of this code > that is faster and “vectorized" so to speak. > > My algorithm is currently something like this code. It works but is damn slow. > > findTestPairs <- function(test1, id1, date1, test2, id2, date2, predays=-30, > lagdays=30){ > # Function to find, within subjects, two tests that occur with a timeframe > # > # test1 = the reference test result for which matching second tests are > sought > # test2 = the second test result > # date1 = the date of test1 > # date2 = the date of test2 > # id1 = unique identifier for subject undergoing test 1 > # id2 = unique identifier for subject undergoing test 2 > # predays = maximum number of days prior to test1 date that test2 date > might occur > # lagdays = maximum number of days after test1 date that test2 date might > occur > > result <- data.frame(matrix(ncol=5, nrow=length(test1))) > colnames(result) <- c('id','test1','date','test2count',’test2lag.result') > result$id <- id1 > result$test1 <- test1 > result$date <- date1 > > for(i in 1:length(test1)){ > l <- 0 # Counter of test2 results that matches test1 within lag > interval > m <- NA # Indicator of positive test2 within lag interval > > for(j in 1:length(test2)){ > if(id1[i] == id2[j]){ # STEP1: Match IDs > interval <- date2[j] - date1[i] > intmatch <- ifelse(interval >= predays && interval <= lagdays, 1, 0) > > if(intmatch == 1){ # STEP2: Does test2 fall within lag > interval? > l <- l+1 # If test2 within lag interval, > count it > > if(test2[j] == 1) { # STEP3: Is test 2 positive? > m <- 1 # If test2 is positive, set > indicator to 1 > } else { > m <- 0 > } > } > } > } > result$test2count[i] <- l > result$test2lag.result[i] <- m > } > return(result) > } > > I would appreciate help on building a faster matching algorithm. I am pretty > certain that R functions can be used to do this but I do not have a good > grasp of how to make it work. > > Brant Inman > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.