List, Consider the following data.
gender mygroup id 1 F A 1 2 F B 2 3 F B 2 4 F B 2 5 F C 2 6 F C 2 7 F C 2 8 F D 2 9 F D 2 10 F D 2 11 F D 2 12 F D 2 13 F D 2 14 M A 3 15 M A 3 16 M A 3 17 M B 3 18 M B 3 19 M B 3 20 M A 4 Here is the reshaping I am seeking (explanation below). id mygroup mytime censor [1,] 1 A 1 1 [2,] 2 B 3 0 [3,] 2 C 3 0 [4,] 2 D 6 1 [5,] 3 A 3 0 [6,] 3 B 3 1 [7,] 4 A 1 1 I need to create 2 variables. The first one is a time variable. Observe that for id=2, the variable mygroup=B was observed 3 times. In the solution we see in row 2 that id=2 has a mytime variable of 3. Next, I need to create a censoring variable. Notice id=2 goes through has values of B, C, D for mygroup. This means the change from B to C and C to D is observed. There is no change from D. I need to indicate this with a 'censoring' variable. So B and C would have values 0, and D would have a value of 1. As another example, id=1 never changes, so I assign it censor= 1. Overall, if a change is observed, 0 should be assigned, and if a change is not observed 1 should be assigned. One potential challenge is that the original data set has over 5 million rows. I have ideas, but I'm still getting used the the data.table and plyr syntax. I also seek a base R solution. I'll post the timings on the real data set shortly. Thanks for your help. > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base # Here is a simplified data set myData <- structure(list(gender = c("F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M" ), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), id = c("1", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "4")), .Names = c("gender", "mygroup", "id"), class = "data.frame", row.names = c(NA, -20L)) # here is plyr solution with idata.frame library(plyr) imyData <- idata.frame(myData) timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize, mytime = length(mygroup))) makeCensor <- function(x) { myvec <- rep(0,length(x)) lastInd <- length(myvec) myvec[lastInd] = 1 myvec } plyrSolution <- ddply(timeData, "id", transform, censor = makeCensor(mygroup)) # here is a data table solution # use makeCensor function from above library(data.table) mydt <- data.table(myData) setkey(mydt,id,mygroup) timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)] makeCensor <- function(x) { myvec <- rep(0,length(x)) lastInd <- length(myvec) myvec[lastInd] = 1 myvec } mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id] datatableSolution <- cbind(timeData,mycensor[,list(censor)]) ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.