> -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of David Atkins > Sent: Monday, April 26, 2010 12:23 PM > To: r-help@r-project.org > Subject: [R] Dropping "trailing zeroes" in longitudinal data > > > Background: Our research group collected data from students > via the web > about their drinking habits (alcohol) over the last 90 days. As you > might guess, some students seem to have lost interest and > completed some > information but not all. Unfortunately, the survey was programmed to > "pre-populate" the fields with zeroes (to make it easier for > students to > complete). > > Obviously, when we see a stretch of zeroes, we've no idea > whether this > is "true" data or not, but we'd like to at least do some sensitivity > analyses by dropping "trailing zeroes" (ie, when there are non-zero > responses for some duration of the data that then "flat line" > into all > zeroes to the end of the time period) > > I've included a toy dataset below. > > Basically, we have the data in the "long" format, and what > I'd like to > do is subset the data.frame by deleting rows that occur at > the end of a > person's data that are all zeroes. In a nutshell, select rows from a > person that are continuously zero, up to first non-zero, > starting at the > end of their data (which, below, would be time = 10). > > With the toy data, this would be the last 6 rows of ids #10 > and #8 (for > example). I can begin to think about how I might do this via > grep/regexp but am a bit stumped about how to translate that to this > type of data. > > Any thoughts appreciated. > > cheers, Dave > > ### toy dataset > set.seed(123) > toy.df <- data.frame(id = factor(rep(1:10, each=10)), > time = rep(1:10, 10), > dv = rnbinom(100, mu > = 0.5, size = 100)) > toy.df > > library(lattice) > > xyplot(dv ~ time | id, data = toy.df, type = c("g","l"))
Try using rle (run length encoding) along with either ave() or lapply(). E.g., define the function isInTrailingRunOfZeroes <- function (x, group, minRunLength = 1) { as.logical(ave(x, group, FUN = function(x) { r <- rle(x) n <- length(r$values) if (n == 0) { logical(0) } else if (r$values[n] == 0 && r$lengths[n] >= minRunLength) { rep(c(FALSE, TRUE), c(sum(r$lengths[-n]), r$lengths[n])) } else { rep(FALSE, sum(r$lengths)) } })) } and use it to drop the trailing runs of 0's with xyplot(data=toy.df[!isInTrailingRunOfZeroes(toy.df$dv, toy.df$id),], dv~time|id, type=c("g","l")) or replace them with NA's with toy.df.copy <- toy.df toy.df.copy[isInTrailingRunOfZeroes(toy.df.copy$dv, toy.df.copy$id),"dv"] <- NA The last argument, minRunLength lets you say you only want to consider the data spurious if there are at least that many zeroes. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > > -- > Dave Atkins, PhD > Research Associate Professor > Department of Psychiatry and Behavioral Science > University of Washington > datk...@u.washington.edu > > Center for the Study of Health and Risk Behaviors (CSHRB) > 1100 NE 45th Street, Suite 300 > Seattle, WA 98105 > 206-616-3879 > http://depts.washington.edu/cshrb/ > (Mon-Wed) > > Center for Healthcare Improvement, for Addictions, Mental Illness, > Medically Vulnerable Populations (CHAMMP) > 325 9th Avenue, 2HH-15 > Box 359911 > Seattle, WA 98104? > 206-897-4210 > http://www.chammp.org > (Thurs) > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.