Hi, everyone:
I ask for help about translating a stata program into R.
The program perform cross validation as it stated.
#1. Randomly divide the data set into 10 sets of equal size, ensuring equal
numbers of events in each set
#2. Fit the model leaving out the 1st set
#3. Apply the fitted model in (2) to the 1st set to obtain the predicted
probability of a prostate cancer diagnosis.
#4. Repeat steps (2) to (3) leaving out and then applying the fitted model
to the ith group, i = 2, 3... 10. Every subject now has a predicted
probability of a prostate cancer diagnosis.
#5. Using the predicted probabilities, compute the net benefit at various
threshold probabilities.
#6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each
threshold probability is the mean across the 200 replications.
=================================================================
First is stata code.
forvalues i=1(1)200 {
local event="cancer"
local predictors1 = "total_psa"
local predictors2 = "total_psa free_psa"
local prediction1 = "base"
local prediction2 = "full"
g `prediction1'=.
g `prediction2'=.
quietly g u = uniform()
sort `event' u
g set = mod(_n, 10) + 1
forvalues j=1(1)10{
quietly logit `event' `predictors1' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction1' = ptemp if set==`j'
drop ptemp
quietly logit `event' `predictors2' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction2' = ptemp if set==`j'
drop ptemp
}
tempfile dca`i'
quietly dca `event' `prediction1' `prediction2', graphoff saving("`dca`i''")
drop u set `prediction1' `prediction2'
}
use "`dca1'", clear
forvalues i=2(1)200 {
append using "`dca`i''"
}
collapse all none modelp1 modelp2, by(threshold)
save "cross validation dca output.dta", replace
twoway(line none all modelp1 modelp2 threshold, sort)
=================================================================
Here is my draft of R code. cMain is my dataset.
predca<-rep(0,40000)
dim(predca)<-c(200,200)
for (i in 1:200) {
cvgroup<-rep(1:10,length=110)
cvgroup<-sample(cvgroup)
cvpre<-rep(0,length=110)
cvMain<-cbind(cMain,cvgroup,cvpre)
for (j in 1:10) {
cvdev<-cvMain[cvMain$cvgroup!=j,]
cvval<-cvMain[cvMain$cvgroup==j,]
cvfit<-lrm(Y~X,data=cvdev,x=T,y=T)
cvprej<-predict(cvfit,cvval,type="fitted")
#put the fitted value in dataset
cvMain[cvgroup==j,]$cvpre<-prej
}
cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y"))
cvnb<-100*(cvdcaop[,1]-cvdcaop[,2])
cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4])
cvnr<-cvnb/cvtpthres
predca[cvn,1:99]<-cvnb
predca[cvn,101:199]<-cvnr
}
=================================================================
My questions are
1. How to ensure equal numbers of events in each set in R?
2. A part of stata code is
forvalues j=1(1)10{
quietly logit `event' `predictors1' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction1' = ptemp if set==`j'
drop ptemp
quietly logit `event' `predictors2' if set~=`j'
quietly predict ptemp if set==`j'
quietly replace `prediction2' = ptemp if set==`j'
drop ptemp
}
I don't understand what's the difference between prediction1 and
prediction2
3. Is my code right?
Thanks !
Yao Zhu
Department of Urology
Fudan University Shanghai Cancer Center
Shanghai, China
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.