Re: [R] long format - find age when another variable is first 'high'

William Dunlap Tue, 26 May 2009 09:44:54 -0700

> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Marc Schwartz
> Sent: Monday, May 25, 2009 6:52 AM
> To: David Freedman
> Cc: [email protected]
> Subject: Re: [R] long format - find age when another variable 
> is first 'high'
> 
> 
> On May 25, 2009, at 7:45 AM, David Freedman wrote:
> 
> >
> > Dear R,
> >
> > I've got a data frame with children examined multiple times and at  
> > various
> > ages.  I'm trying to find the first age at which another variable
> > (LDL-Cholesterol) is >= 130 mg/dL; for some children, this 
> may never  
> > happen.
> > I can do this with transformBy and ddply, but with 10,000 different
> > children, these functions take some time on my PCs - is there a  
> > faster way
> > to do this in R?  My code on a small dataset follows.
> >
> > Thanks very much, David Freedman
> >
> > d<-data.frame(id=c(rep(1,3),rep(2,2), 
> > 3),age=c(5,10,15,4,7,12),ldlc=c(132,120,125,105,142,160))
> > d$high.ldlc<-ifelse(d$ldlc>=130,1,0)
> > d
> > library(plyr)
> > d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1]));
> > library(doBy)
> > d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1]));
> > d2
> 
> The first thing that I would do is to get rid of records that 
> are not  
> relevant to your question:
> 
>  > d
> id age ldlc high.ldlc
> 1  1   5  132         1
> 2  1  10  120         0
> 3  1  15  125         0
> 4  2   4  105         0
> 5  2   7  142         1
> 6  3  12  160         1
> 
> 
> # Get records with high ldl
> d.new <- subset(d, ldlc >= 130)
> 
> 
>  > d.new
> id age ldlc high.ldlc
> 1  1   5  132         1
> 5  2   7  142         1
> 6  3  12  160         1
> 
> 
> That will help to reduce the total size of the dataset, perhaps  
> substantially. It will also remove entire subjects that are not  
> relevant (eg. never have LDL >= 130).
> 
> Then get the minimum age for each of the remaining subjects:
> 
>  > aggregate(d.new$age, list(id = d.new$id), min)
> id  x
> 1  1  5
> 2  2  7
> 3  3 12


If the dataset has a lot of rows you can save more time
by replacing the call to aggregate(age,id,min) by code that sorts
the filtered data by 'id' then breaking ties with 'age', and
then picking out the elements just after a change in the
value of 'id':
    f <- function(d) {
         dSorted <- d[ order(d$id,d$age),]
         n <- length(d$id) # or nrow(d)
         dSorted[   c(TRUE, dSorted$id[-1] != dSorted$id[-n]), ]
    }
    f(d.new) # or f(d[d$ldlc>=130,]) to avoid leaving around the temp
variable.
If you know your dataset is already sorted in this way, you just
need only the last line of that function.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 

> 
> Try that to see what sort of time reduction you observe.
> 
> HTH,
> 
> Marc Schwartz
> 
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] long format - find age when another variable is first 'high'

Reply via email to