On Sat, 18 Apr 2015, Brant Inman wrote:

I have two large data frames with the following structure:

df1
 id       date test1.result
1  a 2009-08-28      1
2  a 2009-09-16      1
3  b 2008-08-06      0
4  c 2012-02-02      1
5  c 2010-08-03      1
6  c 2012-08-02      0

df2
 id       date test2.result
1  a 2011-02-03      1
2  b 2011-09-27      0
3  b 2011-09-01      1
4  c 2009-07-16      0
5  c 2009-04-15      0
6  c 2010-08-10      1


I need to match items in df2 to those in df1 with specific matching criteria. I have written a looped matching algorithm that works, but it is very slow with my large datasets. I am requesting help on making a version of this code that is faster and “vectorized" so to speak.

As I see in your posted code, you match id's exactly, dates according to a range, and count the number of positive test result in the second data.frame.

For this, the countOverlaps() function of the GenomicRanges package will do the trick with suitably defined GRanges objects. Something like:

require(GenomicRanges)

date1 <- as.integer( as.Date( df1$date, "%Y-%m-%d" ))
date2 <- as.integer( as.Date( df2$date, "%Y-%m-%d" ))

lagdays <- 30L
predays <- -30L

gr1 <- GRanges(seqnames=df1$id, IRanges(start=date1,width=1),strand="*")

gr2 <- GRanges(seqnames=df2$id,
               IRanges(start=date2+predays,end=date2+lagdays),
               strand="*")[ df2$test2.result==1,]

df1$test2.count <- countOverlaps(gr1,gr2)


For the example data.frames (as rendered by Jim Lemon's code), this yields

df1
  id       date test1.result test2.count
1  a 2009-08-28            1           0
2  a 2009-09-16            1           0
3  b 2008-08-06            0           0
4  c 2012-02-02            1           0
5  c 2010-08-03            1           1
6  c 2012-08-02            0           0

The GenomicRanges package is at

http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html

where you will find installation instructions and links to vignettes.

HTH,

Chuck
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to