Re: [R] Programming R to avoid loops

Jim Lemon Sat, 18 Apr 2015 00:28:03 -0700

Hi Brant,
I'm a bit confused about which data frame is the one to match to, but
the following, while still including loops, should run much faster
than the above as it only matches dates within id matches.


df1<-read.table(text="id date test1.result
  a 2009-08-28      1
  a 2009-09-16      1
  b 2008-08-06      0
  c 2012-02-02      1
  c 2010-08-03      1
  c 2012-08-02      0",header=TRUE)
df2<-read.table(text="id date test2.result
  a 2011-02-03      1
  b 2011-09-27      0
  b 2011-09-01      1
  c 2009-07-16      0
  c 2009-04-15      0
  c 2010-08-10      1",header=TRUE)

bi.match<-function(x1,x2,maxdaydiff=30) {
 # convert the character strings to dates (may not be necessary)
 x1$dates<-as.Date(x1$date,"%Y-%m-%d")
 x2$dates<-as.Date(x2$date,"%Y-%m-%d")
 # initialize the l and m variables
 x1$l<-x1$m<-0
 # get all the id codes
 ids<-unique(x2$id)
 # step through the id codes
 for(id1 in ids) {
  x1ind<-which(x1$id == id1)
  x2ind<-which(x2$id == id1)
  for(id2 in 1:length(x1ind)) {
   # get the indices of the x2 dates that are within maxdaydiff days
of this x1 date
   diffok<-which(abs(x1$dates[x1ind[id2]]-x2$dates[x2ind])<=30)
   # set the date diff match indicator to 1
   x1$l[x1ind[id2]]<-length(diffok) > 0
   # set the positive test indicator to 1
   x1$m[x1ind[id2]]<-any(x2$test2.result[x2ind[diffok]] > 0)
  }
 }
 return(x1)
}

bi.match(df1,df2)

Jim


On Sat, Apr 18, 2015 at 2:14 PM, Brant Inman <brant.in...@me.com> wrote:
> I have two large data frames with the following structure:
>
>> df1
>   id       date test1.result
> 1  a 2009-08-28      1
> 2  a 2009-09-16      1
> 3  b 2008-08-06      0
> 4  c 2012-02-02      1
> 5  c 2010-08-03      1
> 6  c 2012-08-02      0
>
>> df2
>   id       date test2.result
> 1  a 2011-02-03      1
> 2  b 2011-09-27      0
> 3  b 2011-09-01      1
> 4  c 2009-07-16      0
> 5  c 2009-04-15      0
> 6  c 2010-08-10      1
>
> I need to match items in df2 to those in df1 with specific matching criteria. 
> I have written a looped matching algorithm that works, but it is very slow 
> with my large datasets. I am requesting help on making a version of this code 
> that is faster and “vectorized" so to speak.
>
> My algorithm is currently something like this code. It works but is damn slow.
>
> findTestPairs <- function(test1, id1, date1, test2, id2, date2, predays=-30,
>                           lagdays=30){
>   # Function to find, within subjects, two tests that occur with a timeframe
>   #
>   # test1 = the reference test result for which matching second tests are 
> sought
>   # test2 = the second test result
>   # date1 = the date of test1
>   # date2 = the date of test2
>   # id1   = unique identifier for subject undergoing test 1
>   # id2   = unique identifier for subject undergoing test 2
>   # predays  = maximum number of days prior to test1 date that test2 date 
> might occur
>   # lagdays  = maximum number of days after test1 date that test2 date might 
> occur
>
>   result <- data.frame(matrix(ncol=5, nrow=length(test1)))
>     colnames(result) <- c('id','test1','date','test2count',’test2lag.result')
>     result$id    <- id1
>     result$test1 <- test1
>     result$date  <- date1
>
>   for(i in 1:length(test1)){
>     l <- 0    # Counter of test2 results that matches test1 within lag 
> interval
>     m <- NA   # Indicator of positive test2 within lag interval
>
>     for(j in 1:length(test2)){
>       if(id1[i] == id2[j]){               # STEP1: Match IDs
>         interval <- date2[j] - date1[i]
>         intmatch <- ifelse(interval >= predays && interval <= lagdays, 1, 0)
>
>         if(intmatch == 1){                # STEP2: Does test2 fall within lag 
> interval?
>           l <- l+1                        # If test2 within lag interval, 
> count it
>
>           if(test2[j] == 1) {             # STEP3: Is test 2 positive?
>             m <- 1                        # If test2 is positive, set 
> indicator to 1
>           } else {
>             m <- 0
>           }
>         }
>       }
>     }
>     result$test2count[i] <- l
>     result$test2lag.result[i] <- m
>   }
>   return(result)
> }
>
> I would appreciate help on building a faster matching algorithm. I am pretty 
> certain that R functions can be used to do this but I do not have a good 
> grasp of how to make it work.
>
> Brant Inman
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Programming R to avoid loops

Reply via email to