On Fri, Aug 20, 2010 at 3:47 PM, Hall, Ken (CDC/OSELS/NCPHI) <k...@cdc.gov> wrote: > I am working on a simple pilot project comparing the capability of SQL, > SAS and R to perform a rolling mean per the following instructions. I > have completed the SQL and SAS analysis, so now it's R's turn. > > Calculate mean values of x (x=count) for each date in the dataset where > mean = the average count of days [t-9] through day [t-3] for each > date/illness combination. > > Dataset aggpilot > > date illness x > 1 2006/01/01 DERM 319 > 2 2006/01/02 DERM 388 > 3 2006/01/03 DERM 336 > 4 2006/01/04 DERM 255 > 5 2006/01/05 DERM 177 > 6 2006/01/06 DERM 377 > 7 2006/01/07 DERM 113 > 8 2006/01/08 DERM 253 > 9 2006/01/09 DERM 316 > 10 2006/01/10 DERM 187 > 11 2006/01/11 DERM 292 > 12 2006/01/12 DERM 275 > 13 2006/01/13 DERM 355 > ... 3102 total observations > > Result of command: str(aggpilot) > > 'data.frame': 3102 obs. of 3 variables: > $ date : Factor w/ 1551 levels "2006/01/01","2006/01/02",..: 1 2 3 4 > 5 6 7 8 9 10 ... > $ illness: Factor w/ 2 levels "DERM","FEVER": 1 1 1 1 1 1 1 1 1 1 ... > $ x : int 319 388 336 255 177 377 113 253 316 187 ... > > The results should look something like this: > > DATE ILLNESS x MEAN NOTES [ROLLING MEAN DATE > RANGE] > 1/1/2006 DERM 319 > 1/2/2006 DERM 388 > 1/3/2006 DERM 336 > 1/4/2006 DERM 255 319 [1/1/2006 - > 1/1/2006] > 1/5/2006 DERM 177 353.5 [1/1/2006 - > 1/2/2006] > 1/6/2006 DERM 377 347.6666667 [1/1/2006 - > 1/3/2006] > 1/7/2006 DERM 113 324.5 [1/1/2006 - > 1/4/2006] > 1/8/2006 DERM 253 295 [1/1/2006 - > 1/5/2006] > 1/9/2006 DERM 316 308.6666667 [1/1/2006 - > 1/6/2006] > 1/10/2006 DERM 187 280.7142857 [1/1/2006 - > 1/7/2006] > 1/11/2006 DERM 292 271.2857143 [1/2/2006 - > 1/8/2006] > 1/12/2006 DERM 275 261 [1/3/2006 - > 1/9/2006] > 1/13/2006 DERM 355 239.7142857 [1/4/2006 - > 1/10/2006] > >
Lines <- "date illness x 2006/01/01 DERM 319 2006/01/02 DERM 388 2006/01/03 DERM 336 2006/01/04 DERM 255 2006/01/05 DERM 177 2006/01/06 DERM 377 2006/01/07 DERM 113 2006/01/08 DERM 253 2006/01/09 DERM 316 2006/01/10 DERM 187 2006/01/11 DERM 292 2006/01/12 DERM 275 2006/01/13 DERM 355 2006/01/01 FEVER 3190 2006/01/02 FEVER 3880 2006/01/03 FEVER 3360 2006/01/04 FEVER 2550 2006/01/05 FEVER 1770 2006/01/06 FEVER 3770 2006/01/07 FEVER 1130 2006/01/08 FEVER 2530 2006/01/09 FEVER 3160 2006/01/10 FEVER 1870 2006/01/11 FEVER 2920 2006/01/12 FEVER 2750 2006/01/13 FEVER 3550" ################################################## # 1. Produces same result from long form data # Only uses core R functionality (no packages). # # For each row form ix, the set of row indices # over which to average. Then form the columns. ################################################## # read in data DF <- read.table(textConnection(Lines), header = TRUE) DF$date <- as.Date(DF$date) # calculate one row of the output one_row <- function(i, X) with(X, { ix <- seq(i-9, i-3) ix <- ix[ix > 0] ix <- ix[X$illness[ix] == X$illness[i]] date <- format(date, "%m/%d/%Y") tt <- date[ix] NOTES <- if (length(ix) > 0) paste(head(tt, 1), "-", tail(tt, 1)) else "" data.frame(DATE = date[i], ILLNESS = illness[i], x = x[i], MEAN = mean(x[ix]), NOTES) }) do.call(rbind, lapply(1:NROW(DF), one_row, DF)) ################################################## # 2. In common situations the partial means at ends # would not be calculated since they are not # comparable to the other values -- they have # higher variance. Also this data would typically # be represented as a multivariate time series, # not as long form data. # With these changes and making use of the # rollmean in the zoo package it simplifies. # # read.zoo with split= reads it into a multivariate # time series with one column per illness. # We convert dates to chron class since it uses # the desired output date format by default. # It is converted to zooreg as its a weakly # regular series (actually its completely regular). # Then we merge it with its rolling mean and # form the desired columns. ################################################## library(zoo) library(chron) x <- read.zoo(textConnection(Lines), header = TRUE, split = "illness") time(x) <- as.chron(time(x)) x <- as.zooreg(x) zm <- merge(x, mean = lag(rollmean(x, 7, align = "right"), -3), all = FALSE) fmt <- function(x) format(x, "%m/%d/%Y") data.frame(DATE = time(zm), coredata(zm), notes = paste(time(zm) - 9, "-", time(zm) - 3)) ################################################## # 3. An alternative approach is to use the sqldf R package # See http://sqldf.googlecode.com ################################################## library(sqldf) sqldf('select strftime("%m/%d/%Y", s1.date*24*60*60, \'unixepoch\') "DATE", s1.illness "ILLNESS", x, mean "MEAN", coalesce(strftime("%m/%d/%Y", fromdate*24*60*60, \'unixepoch\') || \' - \' || strftime("%m/%d/%Y", fromdate*24*60*60, \'unixepoch\'), \'\') "NOTES" from DF s1 left join (select t1.date, avg(t2.x) mean, min(t2.date) fromdate, max(t2.date) todate, max(t2.illness) illness from DF t1, DF t2 where julianday(t1.date) between julianday(t2.date) + 3 AND julianday(t2.date) + 9 and t1.illness = t2.illness group by t1.illness, t1.date order by t1.illness, t1.date) s2 on s1.date = s2.date and s1.illness = s2.illness') # You can also explore doing part of it with sqldf and other parts in R. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.