I have a population of subjects each with a variable which has been captured at a baseline date. Then for many subjects (but not all) an intervention has occurred and the variable has changed at one or more time points after the baseline date. So my dataset consists of a subject ID (x), which may appear several times or just once, a measure (y), and a date of observation (z). I would like to be able to have some sort of animated plot with a slider representing time so I can show how the distribution of the variable has altered (say in a histogram or a box plot) from baseline up to the end of a period. I need each subject only to be counted once in this distribution using the measure recorded up to or including the current data on the slider.
I have created a synthetic data set using the code below which kind of replicates the problem for a just a few data points over a couple of months. My real data set has about 30,000 subjects with multiple measures captured over 10 years. What I need for each date point is a summary chart such as a histogram, which shows me the distribution of my variable (in this case y) with just one observation per subject, that observation being the most up to date at the point at which the slider. I have tried to use the manipulate package, which I've used successfully for other simple applications, but hit two problems - firstly it doesn't like dates as a slider variable. I can work around this by making them numeric, but would like to work with dates if possible. Secondly I don't know how to restrict observations to the date on the slider - eg. Subject 100 has a baseline of 45.26, but on or after 26th April it becomes 56.96. So where the slider is set beyond this date I would want the earlier value for this subject to be excluded from the distribution. I'm not sure that manipulate was really made for this problem and perhaps I should be looking elsewhere. I guess a solution involves using the aggregate command to get the unique values at various time points and then using something like the TeachingDemos package? Not sure I'm on the right path with this though and as an R beginner I can't get to first base with this. I can't figure how to use aggregate to give me the value of y corresponding to the latest date z. I know this is basic stuff but I really can't see how to do it. My synthetic data can be generated like this (clunky I know, but I can't do this any slicker) #make my baseline observations for 100 subjects on 1st April 2013 set.seed(1) a<-data.frame(x=seq(1:100),y=rnorm(100,mean=50,sd=10),z=as.Date("2013-04-01")) #simulate 50 subsequent observations in the next 2 months resulting in some subjects having different future measurements of y Start <- as.Date("2013-04-02") End <- as.Date("2013-06-30") dates <- seq(from = Start, to = End, by = 1) set.seed(1) b<-data.frame(x=sample(0:100,50,replace=TRUE),y=rnorm(50,mean=50,sd=10),z=sample(dates,50,replace=FALSE)) #make one table of observations c<-merge(a,b,all=TRUE) Any suggestions much appreciated. Gavin. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.