On Sat, 18 Apr 2015, Brant Inman wrote:
I have two large data frames with the following structure:
df1
id date test1.result
1 a 2009-08-28 1
2 a 2009-09-16 1
3 b 2008-08-06 0
4 c 2012-02-02 1
5 c 2010-08-03 1
6 c 2012-08-02 0
df2
id date test2.result
1 a 2011-02-03 1
2 b 2011-09-27 0
3 b 2011-09-01 1
4 c 2009-07-16 0
5 c 2009-04-15 0
6 c 2010-08-10 1
I need to match items in df2 to those in df1 with specific matching
criteria. I have written a looped matching algorithm that works, but it
is very slow with my large datasets. I am requesting help on making a
version of this code that is faster and “vectorized" so to speak.
As I see in your posted code, you match id's exactly, dates according to a
range, and count the number of positive test result in the second
data.frame.
For this, the countOverlaps() function of the GenomicRanges package will
do the trick with suitably defined GRanges objects. Something like:
require(GenomicRanges)
date1 <- as.integer( as.Date( df1$date, "%Y-%m-%d" ))
date2 <- as.integer( as.Date( df2$date, "%Y-%m-%d" ))
lagdays <- 30L
predays <- -30L
gr1 <- GRanges(seqnames=df1$id, IRanges(start=date1,width=1),strand="*")
gr2 <- GRanges(seqnames=df2$id,
IRanges(start=date2+predays,end=date2+lagdays),
strand="*")[ df2$test2.result==1,]
df1$test2.count <- countOverlaps(gr1,gr2)
For the example data.frames (as rendered by Jim Lemon's code), this yields
df1
id date test1.result test2.count
1 a 2009-08-28 1 0
2 a 2009-09-16 1 0
3 b 2008-08-06 0 0
4 c 2012-02-02 1 0
5 c 2010-08-03 1 1
6 c 2012-08-02 0 0
The GenomicRanges package is at
http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html
where you will find installation instructions and links to vignettes.
HTH,
Chuck
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.