This is the fastest data.table way I can think of : ans = mydt[,list(mytime=.N),by=list(id,mygroup)] ans[,censor:=0L] ans[J(unique(id)), censor:=1L, mult="last"] id mygroup mytime censor [1,] 1 A 1 1 [2,] 2 B 3 0 [3,] 2 C 3 0 [4,] 2 D 6 1 [5,] 3 A 3 0 [6,] 3 B 3 1 [7,] 4 A 1 1
> I'll post the timings on the real data set shortly. Please do. Matthew "William Dunlap" <wdun...@tibco.com> wrote in message news:e66794e69cfde04d9a70842786030b9304e...@pa-mbx04.na.tibco.com... > I'll assume that all of an individual's data rows > are contiguous and that an individual always passes through > the groups in order (or, least, the individual > never leaves a group and then reenters it), so we > can find everything we need to know by comparing each > row with the previous row. > > You can use rle() to quickly make the time > column: > > rle(paste(d$mygroup, d$id))$lengths > [1] 1 3 3 6 3 3 1 > > For the censor column it is probably easiest to consider > what rle() must do internally and use a modification of that. > E.g., > isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)]) > isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE) > outputRows <- isLastInRun(d$mygroup) | isLastInRun(d$id) > output <- d[outputRows, ] > output$mytime <- diff(c(0, which(outputRows))) > output$censor <- as.integer(isLastInRun(e$id)) > which gives you > > output > gender mygroup id mytimes censor > 1 F A 1 1 1 > 4 F B 2 3 0 > 7 F C 2 3 0 > 13 F D 2 6 1 > 16 M A 3 3 0 > 19 M B 3 3 1 > 20 M A 4 1 1 > You showed a rearrangment of the columns > > output[, c("id", "mygroup", "mytime", "censor")] > id mygroup mytime censor > 1 1 A 1 1 > 4 2 B 3 0 > 7 2 C 3 0 > 13 2 D 6 1 > 16 3 A 3 0 > 19 3 B 3 1 > 20 4 A 1 1 > This ought to be quicker than plyr, but data.table > may do similar run-oriented operations. > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > >> -----Original Message----- >> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] >> On Behalf Of Juliet Hannah >> Sent: Wednesday, August 31, 2011 10:51 AM >> To: r-help@r-project.org >> Subject: [R] formatting a 6 million row data set; creating a censoring >> variable >> >> List, >> >> Consider the following data. >> >> gender mygroup id >> 1 F A 1 >> 2 F B 2 >> 3 F B 2 >> 4 F B 2 >> 5 F C 2 >> 6 F C 2 >> 7 F C 2 >> 8 F D 2 >> 9 F D 2 >> 10 F D 2 >> 11 F D 2 >> 12 F D 2 >> 13 F D 2 >> 14 M A 3 >> 15 M A 3 >> 16 M A 3 >> 17 M B 3 >> 18 M B 3 >> 19 M B 3 >> 20 M A 4 >> >> Here is the reshaping I am seeking (explanation below). >> >> id mygroup mytime censor >> [1,] 1 A 1 1 >> [2,] 2 B 3 0 >> [3,] 2 C 3 0 >> [4,] 2 D 6 1 >> [5,] 3 A 3 0 >> [6,] 3 B 3 1 >> [7,] 4 A 1 1 >> >> I need to create 2 variables. The first one is a time variable. >> Observe that for id=2, the variable mygroup=B was observed 3 times. In >> the solution we see in row 2 that id=2 has a mytime variable of 3. >> >> Next, I need to create a censoring variable. >> >> Notice id=2 goes through has values of B, C, D for mygroup. This means >> the change from B to C and C to D is observed. There is no change >> from D. I need to indicate this with a 'censoring' variable. So B and >> C would have values 0, and D would have a value of 1. As another >> example, id=1 never changes, so I assign it censor= 1. Overall, if a >> change is observed, 0 should be assigned, and if a change is not >> observed 1 should be assigned. >> >> One potential challenge is that the original data set has over 5 >> million rows. I have ideas, but I'm still getting used the the >> data.table and plyr syntax. I also seek a base R solution. I'll post >> the timings on the real data set shortly. >> >> Thanks for your help. >> >> > sessionInfo() >> R version 2.13.1 (2011-07-08) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> # Here is a simplified data set >> >> myData <- structure(list(gender = c("F", "F", "F", "F", "F", "F", "F", >> "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M" >> ), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", "D", "D", >> "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), id = c("1", >> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3", >> "3", "3", "3", "3", "3", "4")), .Names = c("gender", "mygroup", >> "id"), class = "data.frame", row.names = c(NA, -20L)) >> >> >> # here is plyr solution with idata.frame >> >> library(plyr) >> imyData <- idata.frame(myData) >> timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize, >> mytime = length(mygroup))) >> >> makeCensor <- function(x) { >> myvec <- rep(0,length(x)) >> lastInd <- length(myvec) >> myvec[lastInd] = 1 >> myvec >> } >> >> >> plyrSolution <- ddply(timeData, "id", transform, censor = >> makeCensor(mygroup)) >> >> >> # here is a data table solution >> # use makeCensor function from above >> >> library(data.table) >> mydt <- data.table(myData) >> setkey(mydt,id,mygroup) >> >> timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)] >> makeCensor <- function(x) { >> myvec <- rep(0,length(x)) >> lastInd <- length(myvec) >> myvec[lastInd] = 1 >> myvec >> } >> >> mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id] >> datatableSolution <- cbind(timeData,mycensor[,list(censor)]) >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.