Try using ave(), as in dat1WithMean <- within(dat1, Mean <- ave(OBS_VALUE, PT_ID, DAYS_DIFF, FUN = mean))
Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf > Of arun > Sent: Thursday, September 06, 2012 12:21 PM > To: Weijia Wang > Cc: R help > Subject: Re: [R] R_closest date > > > > HI, > > For your question to split by two variables, try this: > dat1<-read.table(text=" > ID PT_ID BASE IDX_DT OBS_DATE OBS_VALUE CATEGORY > DAYS_DIFF LDL_BASE rf > 118485 9624295 164.2 2006-11-21 2009-02-17 157.6 2 819 > 2006-11-20 2.5 > 118486 9624295 164.2 2006-11-21 2006-09-30 208.0 2 -52 > 2006-11-20 2.5 > 118487 9624295 164.2 2006-11-21 2008-04-09 123.8 2 505 > 2006-11-20 2.5 > 118488 9624295 164.2 2006-11-21 2008-02-26 17.4 1 462 > 2006-11-20 2.5 > 118489 9624295 164.2 2006-11-21 2008-02-26 139.0 2 462 > 2006-11-20 2.5 > 118490 9624295 164.2 2006-11-21 2007-07-02 107.2 2 223 > 2006-11-25 2.5 > 118491 9624295 164.2 2006-11-21 2007-02-27 86.0 1 98 > 2006-11-25 2.5 > 118492 9624295 164.2 2006-11-21 2008-09-09 131.2 2 658 > 2006-11-25 2.5 > 118485 9624296 164.2 2006-11-21 2009-02-17 157.6 2 819 > 2006-11-26 2.5 > 118486 9624296 164.2 2006-11-21 2006-09-30 208.0 2 -52 > 2006-11-26 2.5 > 118487 9624296 164.2 2006-11-21 2008-04-09 123.8 2 505 > 2006-11-26 2.5 > 118488 9624296 164.2 2006-11-21 2008-02-26 17.4 1 462 > 2006-11-27 2.5 > 118489 9624296 164.2 2006-11-21 2008-02-26 139.0 2 462 > 2006-11-27 2.5 > 118490 9624296 164.2 2006-11-21 2007-07-02 107.2 2 223 > 2006-11-27 2.5 > 118491 9624296 164.2 2006-11-21 2007-02-27 86.0 1 98 > 2006-11-27 2.5 > 118492 9624296 164.2 2006-11-21 2008-09-09 131.2 2 658 > 2006-11-27 2.5 > ",sep="",header=TRUE) > > dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID)) > dat3<-list() > for(i in seq_along(dat2)){ > dat3[[i]]<-list() > dat3[[i]]<-ddply(dat2[[i]],.(DAYS_DIFF),summarize,Mean=mean(OBS_VALUE)) > } > dat3 > do.call(rbind,dat3) > # DAYS_DIFF Mean > #1 -52 208.0 > #2 462 78.2 > #3 505 123.8 > #4 819 157.6 > #5 98 86.0 > #6 223 107.2 > #7 658 131.2 > #8 -52 208.0 > #9 505 123.8 > #10 819 157.6 > #11 98 86.0 > #12 223 107.2 > #13 462 78.2 > #14 658 131.2 > > > #Not sure whether this will work or not in your huge dataset. May be you can > try lapply() > also. > > A.K. > ________________________________ > From: Weijia Wang <wwang....@gmail.com> > To: arun <smartpink...@yahoo.com> > Sent: Thursday, September 6, 2012 12:21 PM > Subject: Re: [R] R_closest date > > > Hi, Arun > > Do you have idea about good package that split HUGE dataframe by two > variables? > > I was trying to use 'ddply' to calculate a mean of LDL-C values which had > same date for > every patient. > > Therefore, I need to break down my dataframe, first by patient ID, then by > the date of > the LDL-C, and finally calculate the mean, if there are multiple LDL-C on a > same day. > > The example is: > > PT_ID BASE IDX_DT OBS_DATE OBS_VALUE CATEGORY > DAYS_DIFF LDL_BASE rf > 118485 9624295 164.2 2006-11-21 2009-02-17 157.6 2 819 days > 2006-11-20 2.5 > 118486 9624295 164.2 2006-11-21 2006-09-30 208.0 2 -52 days > 2006-11-20 2.5 > 118487 9624295 164.2 2006-11-21 2008-04-09 123.8 2 505 days > 2006-11-20 2.5 > 118488 9624295 164.2 2006-11-21 2008-02-26 17.4 1 462 days > 2006-11-20 2.5 > 118489 9624295 164.2 2006-11-21 2008-02-26 139.0 2 462 days > 2006-11-20 2.5 > 118490 9624295 164.2 2006-11-21 2007-07-02 107.2 2 223 days > 2006-11-20 2.5 > 118491 9624295 164.2 2006-11-21 2007-02-27 86.0 1 98 days > 2006-11-20 2.5 > 118492 9624295 164.2 2006-11-21 2008-09-09 131.2 2 658 days > 2006-11-20 2.5 > REDUCTION GOAL FAILURE > 118485 0.04019488 NOT AT GOAL FAIL > 118486 -0.26674787 PRE NOT AT GOAL NOT FAIL > 118487 0.24604141 AT GOAL NOT FAIL > 118488 0.89403167 AT GOAL NOT FAIL > 118489 0.15347138 NOT AT GOAL FAIL > 118490 0.34713764 AT GOAL NOT FAIL > 118491 0.47624848 AT GOAL NOT FAIL > 118492 0.20097442 NOT AT GOAL NOT FAIL > > So, this patient has two LDL-C readings on '462 days', therefore, I want to > get a mean of > these 17.4 and 139.0. > > 'ddply' did give me a mean when running on a test dataframe, but when I used > it on my > dataframe with 200,000ish observations, the computer run for like 5 hours and > return > error. Do you have idea about other good function, that focuses on split and > apply > function, and rbind? > > Best > Weijia > > > On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang....@gmail.com> wrote: > > That is actually a great idea, thanks again! > > > >Weijia Wang > > > > > >On Sep 2, 2012, at 12:12 PM, arun <smartpink...@yahoo.com> wrote: > > > >> Hi, > >> No problem. > >> > >> If you use join() instead of merge(), the original order of columns may > >> not get altered. > >> > >> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) > >> library(plyr) > >> join(dat1,dat3,type="inner") > >> #Joining by: PT_ID, DAYS_DIFF > >> # PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY > >> #1 4549 2002-08-21 2002-08-20 -1 183 2 > >> #2 4839 2006-11-28 2006-11-28 0 179 2 > >> A.K. > >> > >> > >> > >> > >> > >> > >> ________________________________ > >> From: Weijia Wang <wwang....@gmail.com> > >> To: arun <smartpink...@yahoo.com> > >> Sent: Saturday, September 1, 2012 5:11 PM > >> Subject: Re: [R] R_closest date > >> > >> > >> Thank you Arun, for your help again. > >> > >> Best > >> ______________________________ > >> WANG WEIJIA > >> Graudate Research and Teaching Assistant > >> Department of Environmental Medicine > >> New York University, School of Medicine > >> wwang....@gmail.com > >> > >> > >> > >> > >> On Sep 1, 2012, at 5:04 PM, arun <smartpink...@yahoo.com> wrote: > >> > >> Hi, > >>> Try this: > >>> dat1 <- read.table(text=" > >>> PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY > >>> 13 4549 2002-08-21 2002-08-20 -1 183 2 > >>> 14 4549 2002-08-21 2002-11-14 85 91 1 > >>> 15 4549 2002-08-21 2003-02-18 181 89 1 > >>> 16 4549 2002-08-21 2003-05-15 267 109 2 > >>> 17 4549 2002-08-21 2003-12-16 482 96 1 > >>> 128 4839 2006-11-28 2006-11-28 0 179 2 > >>> ", header=TRUE) > >>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min) > >>> merge(dat1,dat3) > >>> # PT_ID DAYS_DIFF IDX_DT OBS_DATE OBS_VALUE CATEGORY > >>> #1 4549 -1 2002-08-21 2002-08-20 183 2 > >>> #2 4839 0 2006-11-28 2006-11-28 179 2 > >>> > >>> #or, > >>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min) > >>> dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2) > >>> row.names(dat4)<-1:nrow(dat4) > >>> merge(dat1,dat4) > >>> # PT_ID DAYS_DIFF IDX_DT OBS_DATE OBS_VALUE CATEGORY > >>> #1 4549 -1 2002-08-21 2002-08-20 183 2 > >>> #2 4839 0 2006-11-28 2006-11-28 179 2 > >>> A.K. > >>> > >>> > >>> > >>> > >>> > >>> ----- Original Message ----- > >>> From: WANG WEIJIA <wwang....@gmail.com> > >>> To: "r-help@R-project.org" <r-help@r-project.org> > >>> Cc: > >>> Sent: Saturday, September 1, 2012 1:10 PM > >>> Subject: [R] R_closest date > >>> > >>> Hi, > >>> > >>> I have encountered an issue about finding a date closest to another date > >>> > >>> So this is how the data frame looks like: > >>> > >>> PT_ID IDX_DT OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY > >>> 13 4549 2002-08-21 2002-08-20 -1 183 2 > >>> 14 4549 2002-08-21 2002-11-14 85 91 1 > >>> 15 4549 2002-08-21 2003-02-18 181 89 1 > >>> 16 4549 2002-08-21 2003-05-15 267 109 2 > >>> 17 4549 2002-08-21 2003-12-16 482 96 1 > >>> 128 4839 2006-11-28 2006-11-28 0 179 2 > >>> > >>> I need to find, the single observation, which has the closest date of > >>> 'OBS_DATE' to > 'IDX_DT'. > >>> > >>> For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is > >>> just one > day away from IDX_DT. > >>> > >>> I was thinking about using abs(), and I got this: > >>> > >>> baseline<- function(x){ > >>> + > >>> + #remove all uncessary variables > >>> + baseline<- x[,c("PT_ID","DAYS_DIFF")] > >>> + > >>> + #get a list of every unique ID > >>> + uniqueID <- unique(baseline$PT_ID) > >>> + > >>> + #make a vector that will contain the smallest DAYS_DIFF > >>> + first <- rep(-99,length(uniqueID)) > >>> + > >>> + i = 1 > >>> + #loop through each unique ID > >>> + for (PT_ID in uniqueID){ > >>> + > >>> + #for each iteration get the smallest DAYS_DIFF for that ID > >>> + first[i] <- > >>> min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)]) > >>> + > >>> + #up the iteration counter > >>> + i = i + 1 > >>> + > >>> + } > >>> + #make a data frame with the lowest DAYS_DIFF and ID > >>> + newdata <- data.frame(uniqueID,first) > >>> + names(newdata) <- c("PT_ID","DAYS_DIFF") > >>> + > >>> + #return the data frame containing the lowest GPI for each ID > >>> + return(newdata) > >>> + } > >>> > >>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, > >>> all unique# > >>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), > abs(baseline$DAYS_DIFF)) : > >>> undefined columns selected > >>> > >>> Can anyone help me in figuring out how to get the minimum value of the > >>> absolute > value of DAYS_DIFF for unique ID? > >>> > >>> Thanks a lot > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >>> > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.