> From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Marc Schwartz > Sent: Monday, May 25, 2009 6:52 AM > To: David Freedman > Cc: r-help@r-project.org > Subject: Re: [R] long format - find age when another variable > is first 'high' > > > On May 25, 2009, at 7:45 AM, David Freedman wrote: > > > > > Dear R, > > > > I've got a data frame with children examined multiple times and at > > various > > ages. I'm trying to find the first age at which another variable > > (LDL-Cholesterol) is >= 130 mg/dL; for some children, this > may never > > happen. > > I can do this with transformBy and ddply, but with 10,000 different > > children, these functions take some time on my PCs - is there a > > faster way > > to do this in R? My code on a small dataset follows. > > > > Thanks very much, David Freedman > > > > d<-data.frame(id=c(rep(1,3),rep(2,2), > > 3),age=c(5,10,15,4,7,12),ldlc=c(132,120,125,105,142,160)) > > d$high.ldlc<-ifelse(d$ldlc>=130,1,0) > > d > > library(plyr) > > d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1])); > > library(doBy) > > d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1])); > > d2 > > The first thing that I would do is to get rid of records that > are not > relevant to your question: > > > d > id age ldlc high.ldlc > 1 1 5 132 1 > 2 1 10 120 0 > 3 1 15 125 0 > 4 2 4 105 0 > 5 2 7 142 1 > 6 3 12 160 1 > > > # Get records with high ldl > d.new <- subset(d, ldlc >= 130) > > > > d.new > id age ldlc high.ldlc > 1 1 5 132 1 > 5 2 7 142 1 > 6 3 12 160 1 > > > That will help to reduce the total size of the dataset, perhaps > substantially. It will also remove entire subjects that are not > relevant (eg. never have LDL >= 130). > > Then get the minimum age for each of the remaining subjects: > > > aggregate(d.new$age, list(id = d.new$id), min) > id x > 1 1 5 > 2 2 7 > 3 3 12
If the dataset has a lot of rows you can save more time by replacing the call to aggregate(age,id,min) by code that sorts the filtered data by 'id' then breaking ties with 'age', and then picking out the elements just after a change in the value of 'id': f <- function(d) { dSorted <- d[ order(d$id,d$age),] n <- length(d$id) # or nrow(d) dSorted[ c(TRUE, dSorted$id[-1] != dSorted$id[-n]), ] } f(d.new) # or f(d[d$ldlc>=130,]) to avoid leaving around the temp variable. If you know your dataset is already sorted in this way, you just need only the last line of that function. Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com > > Try that to see what sort of time reduction you observe. > > HTH, > > Marc Schwartz > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.