[R] cross validation function translated from stata

zhu yao Thu, 21 Jan 2010 05:56:31 -0800

Hi, everyone:

I ask for help about translating a stata program into R.


The program perform cross validation as it stated.

#1. Randomly divide the data set into 10 sets of equal size, ensuring equal
numbers of events in each set
#2. Fit the model leaving out the 1st set
#3. Apply the fitted model in (2) to the 1st set to obtain the predicted
probability of a prostate cancer diagnosis.
#4. Repeat steps (2) to (3) leaving out and then applying the fitted model
to the ith group, i = 2, 3... 10. Every subject now has a predicted
probability of a prostate cancer diagnosis.
#5. Using the predicted probabilities, compute the net benefit at various
threshold probabilities.
#6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each
threshold probability is the mean across the 200 replications.

=================================================================
First is stata code.

forvalues i=1(1)200 {
local event="cancer"
local predictors1 = "total_psa"
local predictors2 = "total_psa free_psa"
local prediction1 = "base"
local prediction2 = "full"

g `prediction1'=.
g `prediction2'=.

quietly g u = uniform()
sort `event' u
g set = mod(_n, 10) + 1

forvalues j=1(1)10{
quietly logit `event' `predictors1' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction1' = ptemp if set==`j'
drop ptemp


quietly logit `event' `predictors2' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction2' = ptemp if set==`j'
drop ptemp
}
tempfile dca`i'
quietly dca `event' `prediction1' `prediction2', graphoff saving("`dca`i''")

drop u set `prediction1' `prediction2'
}


use "`dca1'", clear
forvalues i=2(1)200 {
append using "`dca`i''"
}

collapse all none modelp1 modelp2, by(threshold)

save "cross validation dca output.dta", replace

twoway(line none all modelp1 modelp2 threshold, sort)

=================================================================
Here is my draft of R code. cMain is my dataset.

predca<-rep(0,40000)
dim(predca)<-c(200,200)

for (i in 1:200) {
 cvgroup<-rep(1:10,length=110)
cvgroup<-sample(cvgroup)
cvpre<-rep(0,length=110)
cvMain<-cbind(cMain,cvgroup,cvpre)

for (j in 1:10) {
cvdev<-cvMain[cvMain$cvgroup!=j,]
cvval<-cvMain[cvMain$cvgroup==j,]
 cvfit<-lrm(Y~X,data=cvdev,x=T,y=T)
cvprej<-predict(cvfit,cvval,type="fitted")

#put the fitted value in dataset
cvMain[cvgroup==j,]$cvpre<-prej
}

cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y"))
cvnb<-100*(cvdcaop[,1]-cvdcaop[,2])
cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4])
cvnr<-cvnb/cvtpthres
predca[cvn,1:99]<-cvnb
predca[cvn,101:199]<-cvnr
}

=================================================================

My questions are
1. How to ensure equal numbers of events in each set in R?
2. A part of stata code is
                        forvalues j=1(1)10{
quietly logit `event' `predictors1' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction1' = ptemp if set==`j'
drop ptemp

quietly logit `event' `predictors2' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction2' = ptemp if set==`j'
drop ptemp
}
  I don't understand what's the difference between prediction1 and
prediction2
3. Is my code right?

Thanks !

Yao Zhu
Department of Urology
Fudan University Shanghai Cancer Center
Shanghai, China

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] cross validation function translated from stata

Reply via email to