Re: [R] formatting a 6 million row data set; creating a censoring variable

William Dunlap Wed, 31 Aug 2011 12:33:56 -0700

I'll assume that all of an individual's data rows
are contiguous and that an individual always passes through
the groups in order (or, least, the individual
never leaves a group and then reenters it), so we
can find everything we need to know by comparing each
row with the previous row.


You can use rle() to quickly make the time
column: 
  > rle(paste(d$mygroup, d$id))$lengths
  [1] 1 3 3 6 3 3 1

For the censor column it is probably easiest to consider
what rle() must do internally and use a modification of that.
E.g.,
  isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)])
  isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
  outputRows <- isLastInRun(d$mygroup) | isLastInRun(d$id)
  output <- d[outputRows, ]
  output$mytime <- diff(c(0, which(outputRows)))
  output$censor <- as.integer(isLastInRun(e$id))
which gives you
  > output
     gender mygroup id mytimes censor
  1       F       A  1       1      1
  4       F       B  2       3      0
  7       F       C  2       3      0
  13      F       D  2       6      1
  16      M       A  3       3      0
  19      M       B  3       3      1
  20      M       A  4       1      1
You showed a rearrangment of the columns
  > output[, c("id", "mygroup", "mytime", "censor")]
     id mygroup mytime censor
  1   1       A      1      1
  4   2       B      3      0
  7   2       C      3      0
  13  2       D      6      1
  16  3       A      3      0
  19  3       B      3      1
  20  4       A      1      1
This ought to be quicker than plyr, but data.table
may do similar run-oriented operations.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
> Behalf Of Juliet Hannah
> Sent: Wednesday, August 31, 2011 10:51 AM
> To: r-help@r-project.org
> Subject: [R] formatting a 6 million row data set; creating a censoring 
> variable
> 
> List,
> 
> Consider the following data.
> 
>    gender mygroup id
> 1       F       A  1
> 2       F       B  2
> 3       F       B  2
> 4       F       B  2
> 5       F       C  2
> 6       F       C  2
> 7       F       C  2
> 8       F       D  2
> 9       F       D  2
> 10      F       D  2
> 11      F       D  2
> 12      F       D  2
> 13      F       D  2
> 14      M       A  3
> 15      M       A  3
> 16      M       A  3
> 17      M       B  3
> 18      M       B  3
> 19      M       B  3
> 20      M       A  4
> 
> Here is the reshaping I am seeking (explanation below).
> 
>      id mygroup mytime censor
> [1,]  1       A      1      1
> [2,]  2       B      3      0
> [3,]  2       C      3      0
> [4,]  2       D      6      1
> [5,]  3       A      3      0
> [6,]  3       B      3      1
> [7,]  4       A      1      1
> 
> I need to create 2 variables. The first one is a time variable.
> Observe that for id=2, the variable mygroup=B was observed 3 times. In
> the solution we see in row 2 that id=2 has a mytime variable of 3.
> 
> Next, I need to create a censoring variable.
> 
> Notice id=2 goes through has values of B, C, D for mygroup. This means
> the change from B to C and C to D is observed.  There is no change
> from D. I need to indicate this with a 'censoring' variable. So B and
> C would have values 0, and D would have a value of 1. As another
> example, id=1 never changes, so I assign it  censor= 1. Overall, if a
> change is observed, 0 should be assigned, and if a change is not
> observed 1 should be assigned.
> 
> One potential challenge is that the original data set has over 5
> million rows. I have ideas, but I'm still getting used the the
> data.table and plyr syntax.  I also seek a base R solution. I'll post
> the timings on the real data set shortly.
> 
> Thanks for your help.
> 
> > sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> # Here is a simplified data set
> 
> myData <- structure(list(gender = c("F", "F", "F", "F", "F", "F", "F",
> "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M"
> ), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", "D", "D",
> "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), id = c("1",
> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3",
> "3", "3", "3", "3", "3", "4")), .Names = c("gender", "mygroup",
> "id"), class = "data.frame", row.names = c(NA, -20L))
> 
> 
> # here is plyr solution with  idata.frame
> 
> library(plyr)
> imyData <-  idata.frame(myData)
> timeData <- idata.frame(ddply(imyData, .(id,mygroup), summarize,
> mytime = length(mygroup)))
> 
> makeCensor <- function(x) {
>    myvec <- rep(0,length(x))
>    lastInd <- length(myvec)
>    myvec[lastInd] = 1
>    myvec
> }
> 
> 
> plyrSolution <- ddply(timeData, "id", transform, censor = makeCensor(mygroup))
> 
> 
> # here is a data table solution
> # use makeCensor function from above
> 
> library(data.table)
> mydt <- data.table(myData)
> setkey(mydt,id,mygroup)
> 
> timeData <- mydt[,list(mytime=length(gender)),by=list(id,mygroup)]
> makeCensor <- function(x) {
>    myvec <- rep(0,length(x))
>    lastInd <- length(myvec)
>    myvec[lastInd] = 1
>    myvec
> }
> 
> mycensor <- timeData[,list(censor=makeCensor(mygroup)),by=id]
> datatableSolution <- cbind(timeData,mycensor[,list(censor)])
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] formatting a 6 million row data set; creating a censoring variable

Reply via email to