Hello everyone,

These are some questions about the 'pec' function in R.  These questions deal 
with prediction error curves and their derivation.  Prediction error curves are 
documented in, for example, "Efron-type measures of prediction error for 
survival analysis" by Gerds and Schumacher.

I have detailed some syntax that I have used at the bottom of this email.  The 
associated data is available upon request.

In the 'pec' function I have

formula=Surv(OVS,dead)~1

...apparently "for right censored data, the RH side of the formula is used to 
specify conditional censoring models".... when there are no covariates (as in 
our case)...I understand that we are assuming that censoring occurs totally at 
random (for all models). 

Say we have a data set containing potential predictor variables X1,X2....Xp.  
Our "survival time" variable (time) is measured in months and our status 
variable (status) has 0=alive and 1=dead.

Firstly, I would like to ask about how to derive conditional censoring models. 
From past conversations it seems that to establish the form of our censoring 
model(s), we would use status=event=alive =1  and status=event=death=0.  Is 
this correct?.. 

Now, say we are assuming that the model for censoring is of the cox regression 
type, do we assess the model for censoring using the usual variable selection 
procedures where candidate variables for variable selection are X1,X2....Xp and 
we code *alive=1 and dead=0*?

Say we found that X1 and X5 should be included in our cox regression model for 
censoring, then we would enter:

formula=Surv(time,status)~X1+X5 and cens.model=cox  in function pec.

Am I thinking about this correctly? I think it could be more difficult than 
this, so I'd appreciate some guidance.

Replan is the method for estimating prediction error curves. I understand that 
replan=none would be used when we don't cross validate i.e. we would create a 
model using a specific bootstrap sample and then evaluate its performance on 
the same sample. 

If we are looking at "bootstrap crossvalidation" (OutOfBag) method...say we 
have 500 patients.....it says in the paper - Gerds and Schumacher - that 
bootstrap samples Q*_1....Q*_B each of size n are drawn with replacement from 
the original data (we have chosen 100 bootstrap samples  i.e. B=100 in our 
work).  

So here I assume that n is equal to the total number of patients i.e. 500 in 
this example?

When we decide to sample with replacement, as here,...I am assuming that, for 
each of the bootstrap samples, there are 500 patients but some of these 
patients could occur in this bootstrap sample more than once.

The documentation says "M is the size of the bootstrap samples for sampling 
*without* replacement".

Hence am I correct in thinking that for sampling *with* replacement, M is equal 
to n ?  i.e. M=n by default.  However, if we choose M<n  then each of our 
bootstrap samples will have 'sampling without replacement' i.e. the training 
set would be comprised of n-M patients.  Am I correct?

[Each of the bootstrap samples acts as a 'training data set' to generate a 
model....the model is then validated using the patients which weren't in the 
bootstrap sample. ]

In a study I did using a 286 case data set, I noticed that my prediction error 
curves seem to terminate at around 35 months when the actual last survival time 
was 248 months.  I checked the survival times and corresponding "alive/dead" 
for this data set and noticed that the number of 'deaths' gets very sparse 
after month 35....but I'm still a bit puzzled as to why the curves end at 
around 35 months.  Perhaps the small number of cases in the original data set 
leads to small training data sets and hence to a termination of the curves at 
low times?  Has anybody else encountered this?

In our study, we wanted to develop a model on a specific bootstrap sample  and 
then test it out using observations which aren't in the bootstrap sample (this 
would be repeated B times)...we chose the 0.632+ bootstrap estimator.   There 
are many types of  estimators for prediction error curves.  How should we 
decide which is 'best'?

Thanks for any advice on these questions,
Kindest Regards,
Kim

Dr Kim Pearce CStat
Industrial Statistics Research Unit (ISRU)
School of Mathematics and Statistics
Herschel Building
University of Newcastle
Newcastle upon Tyne
United Kingdom
NE1 7RU

Tel.   0044 (0)191 222 6244 (direct)
Fax.   0044 (0)191 222 8020
Email: k.f.pea...@ncl.ac.uk
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to