Hi, everyone: I ask for help about translating a stata program into R.
The program perform cross validation as it stated. #1. Randomly divide the data set into 10 sets of equal size, ensuring equal numbers of events in each set #2. Fit the model leaving out the 1st set #3. Apply the fitted model in (2) to the 1st set to obtain the predicted probability of a prostate cancer diagnosis. #4. Repeat steps (2) to (3) leaving out and then applying the fitted model to the ith group, i = 2, 3... 10. Every subject now has a predicted probability of a prostate cancer diagnosis. #5. Using the predicted probabilities, compute the net benefit at various threshold probabilities. #6. Repeat steps (1) to (5) 200 times. The corrected net benefit for each threshold probability is the mean across the 200 replications. ================================================================= First is stata code. forvalues i=1(1)200 { local event="cancer" local predictors1 = "total_psa" local predictors2 = "total_psa free_psa" local prediction1 = "base" local prediction2 = "full" g `prediction1'=. g `prediction2'=. quietly g u = uniform() sort `event' u g set = mod(_n, 10) + 1 forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } tempfile dca`i' quietly dca `event' `prediction1' `prediction2', graphoff saving("`dca`i''") drop u set `prediction1' `prediction2' } use "`dca1'", clear forvalues i=2(1)200 { append using "`dca`i''" } collapse all none modelp1 modelp2, by(threshold) save "cross validation dca output.dta", replace twoway(line none all modelp1 modelp2 threshold, sort) ================================================================= Here is my draft of R code. cMain is my dataset. predca<-rep(0,40000) dim(predca)<-c(200,200) for (i in 1:200) { cvgroup<-rep(1:10,length=110) cvgroup<-sample(cvgroup) cvpre<-rep(0,length=110) cvMain<-cbind(cMain,cvgroup,cvpre) for (j in 1:10) { cvdev<-cvMain[cvMain$cvgroup!=j,] cvval<-cvMain[cvMain$cvgroup==j,] cvfit<-lrm(Y~X,data=cvdev,x=T,y=T) cvprej<-predict(cvfit,cvval,type="fitted") #put the fitted value in dataset cvMain[cvgroup==j,]$cvpre<-prej } cvdcaop<-dca(cvMain$Y,cvMain$cvpre,prob=("Y")) cvnb<-100*(cvdcaop[,1]-cvdcaop[,2]) cvtpthres<-cvdcaop[,4]/(100-cvdcaop[,4]) cvnr<-cvnb/cvtpthres predca[cvn,1:99]<-cvnb predca[cvn,101:199]<-cvnr } ================================================================= My questions are 1. How to ensure equal numbers of events in each set in R? 2. A part of stata code is forvalues j=1(1)10{ quietly logit `event' `predictors1' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction1' = ptemp if set==`j' drop ptemp quietly logit `event' `predictors2' if set~=`j' quietly predict ptemp if set==`j' quietly replace `prediction2' = ptemp if set==`j' drop ptemp } I don't understand what's the difference between prediction1 and prediction2 3. Is my code right? Thanks ! Yao Zhu Department of Urology Fudan University Shanghai Cancer Center Shanghai, China [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.