Since this involves time series within each id group I thought it would be interesting to see if this could be formulated using zoo series. The approach is to read it in, and convert it to a long form by just stacking the start and stop times in a data frame and converting that to zoo using the concatenation of date and id as the time index. It makes use of the fact that - zoo can use non-standard "times" (here the concatenation of date and id) - read.zoo can split and aggregate and - the resulting object can be directly cumsum'd using cumsum.zoo .
Finally we convert it back to the required form. This was mostly done by looking at the output rather than trying to follow the SAS code so some differences are possible. It does seem to give the same output in this case. Yet another approach might be to use model.matrix to create the NRTI/NNRTI/PI columns: model.matrix(~ drug - 1, long2) * long2$change but below we stick with zoo. Some of the code here could be transplanted into Ista's solution (which uses the reshape and plyr packages) to get a combination approach. Lines <- "1004 NRTI 07/24/95 01/05/99 1004 NRTI 11/20/95 12/10/95 1004 NRTI 01/10/96 01/05/99 1004 PI 05/09/96 11/16/97 1004 NRTI 06/01/96 02/01/97 1004 NRTI 07/01/96 03/01/97 9999 PI 01/02/03 . 9999 NNRTI 04/05/06 07/08/09" library(zoo) # need na.fill from development version of zoo source("http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/zoo/R/na.fill.R?revision=831&root=zoo") # read in data and reshape to long form DF <- read.table(textConnection(Lines), as.is = TRUE, na.strings = ".", col.names = c("id", "drug", "date", "date"), check.names = FALSE) long <- rbind(cbind(DF[-4], change = 1), cbind(DF[-3], change = -1)) # convert to zoo. date.id, the concatenation of date & id, becomes index. # Then take cumulative sums of columns and append regimen. long2 <- with(long, data.frame(date.id = paste(as.Date(date, "%m/%d/%y"), id), drug, change)) z <- read.zoo(long2, index = 1, split = "drug", FUN = identity, aggregate = sum) z <- cumsum(na.fill(z, fill = 0)) z$regimen <- ave(1:nrow(z), sub(".* ", "", time(z)), FUN = seq_along) # convert z to data frame for final processing id <- sub(".* ", "", time(z)) start_date = as.Date(sub(" .*", "", time(z))) d2 <- data.frame(id, start_date, coredata(z)) # from each group of rows in same id form the desired columns. # f processes one such group. f <- function(x) with(x, data.frame( id = head(id, -1), start_date = head(start_date, -1), stop_date = start_date[-1], head(cbind(regimen, NRTI, NNRTI, PI, all = NRTI + NNRTI + PI, HAART = as.numeric((NRTI >= 3 & NNRTI==0 & PI==0) | (NRTI >= 2 & (NNRTI >= 1 | PI >= 1)) | (NRTI == 1 & NNRTI >= 1 & PI >= 1))), -1))) result <- do.call(rbind, by(d2, id, f)) rownames(result) <- NULL result The end result is this: > result id start_date stop_date regimen NRTI NNRTI PI all HAART 1 1004 1995-07-24 1995-11-20 1 1 0 0 1 0 2 1004 1995-11-20 1995-12-10 2 2 0 0 2 0 3 1004 1995-12-10 1996-01-10 3 1 0 0 1 0 4 1004 1996-01-10 1996-05-09 4 2 0 0 2 0 5 1004 1996-05-09 1996-06-01 5 2 0 1 3 1 6 1004 1996-06-01 1996-07-01 6 3 0 1 4 1 7 1004 1996-07-01 1997-02-01 7 4 0 1 5 1 8 1004 1997-02-01 1997-03-01 8 3 0 1 4 1 9 1004 1997-03-01 1997-11-16 9 2 0 1 3 1 10 1004 1997-11-16 1999-01-05 10 2 0 0 2 0 11 9999 2003-01-02 2006-04-05 1 0 0 1 1 0 12 9999 2006-04-05 2009-07-08 2 0 1 1 2 0 13 9999 2009-07-08 <NA> 3 0 0 1 1 0 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.