Hello, I am interested in producing the expected number of events, in a recurring events setting. I am using the Andersen-Gill model, as fit by the function "coxph" in the package "survival."
I need to produce expected numbers of events for a cohort, cumulatively, at several fixed times. My ultimate goal is: To fit an AG model to a reference sample, then use that fitted model to generate expected numbers of events for a new cohort; then, comparing the expected vs. the observed numbers of events would give us some idea of whether the new cohort differs from the reference one. >From my reading of the documentation and the text by Therneau and Grambsch, it seems that the function "survexp" is what I need. But using it I am not able to obtain expected numbers of events that match reasonably well the observed numbers *even for the same reference population.* So, I think I am misunderstanding something quite badly. Below is an example that illustrates the situation. At the end I include the sessionInfo(). Thank you! Omar. ############################ # Example of unexpected behavior in computing estimated number of events # in using package "survival" for fitting the Andersen-Gill model require(survival) head(bladder2) # this is the data, in interval format # Fit Andersen-Gill model cphfit = coxph(Surv(start,stop,event)~rx+number+size+cluster(id),data=bladder2) # Choose some arbitrary time horizons t.horiz = seq(min(bladder2$start),max(bladder2$stop),length=6) # Compute the cohort expected survival s = survexp(~1,data=bladder2,ratetable=cphfit,times=t.horiz) # This are the expected survival values: s$surv # We are interested in the rate of events e.r = as.vector( 1 - s$surv ) # How does this compare to the actual number of events, cumulative at # each time horizon? observed = numeric(length(t.horiz)) for (i in 1:length(t.horiz)){ observed[i] = sum(bladder2$event[bladder2$stop <= t.horiz[i]]) } print(observed) # We would like to compute expected numbers of events that approximately # match these observed values. # We should multiply the expected survival rates by the number of individuals. # Now, one would think that this is the number of at-risk individuals: s$n.risk # But that is actually the total number of rows in the data. In any case, # these numbers do not match: rbind(expected = s$n.risk*e.r,observed=observed) # What if we multiply by the number of individuals? rbind(expected = length(unique(bladder2$id))*e.r,observed=observed) # This does not work either! The required factor seems to be about 133, but # I don't see an explanation for that. # In this example, multiplying by 133.182 gives a good match between observed # and expected values, but in other examples even the shape of the curves # are different. # Multiplying by a number of individuals at risk at each time point # (number of individuals # for which there is a time interval containing the time horizon) does # not work either. ##################### > sessionInfo() R version 2.15.1 (2012-06-22) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines stats graphics grDevices utils datasets methods base other attached packages: [1] survival_2.36-14 loaded via a namespace (and not attached): [1] tools_2.15.1 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.