I have data set up like the following: control1 <- sample(1:75, 3947398, replace=TRUE) control2 <- sample(1:75, 28793, replace=TRUE) control3 <- sample(1:100, 392733, replace=TRUE) control4 <- sample(1:75, 858383, replace=TRUE) patient1 <- sample(1:100, 28048, replace=TRUE) patient2 <- sample(1:50, 80400, replace=TRUE) patient3 <- sample(1:100, 48239, replace=TRUE) control <- list(control1, control2, control3, control4) patient <- list(patient1, patient2, patient3)
To classify these samples as either control or patient, I want make frequency distributions of presence of each of the 100 variables being considered. To do this, I randomly sample "s" values from each sample and generate a frequency vector of length 100. This is how I would do it: control_s <- list() patient_s <- list()for (i in 1:length(control)) control_s[[i]] <- sample(control[[i]], s)for (i in 1:length(patient)) patient_s[[i]] <- sample(patient[[i]], s) Once I do this, I generate the frequency vector of length 100 as follows: controlfreq <- list()for (i in 1:length(control_s)){ controlfreq[[i]] <- as.data.frame(prop.table(table(factor( control_s[[i]], levels = 1:100 ))))[,2]} patientfreq <- list()for (i in 1:length(patient_s)){ patientfreq[[i]] <- as.data.frame(prop.table(table(factor( patient_s[[i]], levels = 1:100 ))))[,2]} controlfreq <- t(as.data.frame(controlfreq)) controltrainingset <- transform(controlfreq, status = "control") patientfreq <- t(as.data.frame(patientfreq)) patienttrainingset <- transform(patientfreq, status = "patient") dataset <- rbind(controltrainingset, patienttrainingset) This is the final data frame being used in the classification algorithm. My goal of this post is to figure out how to identify the optimal "s" value so that the highest ROC is achieved. I am using "rf" from the caret package to do classification. library(caret) fitControl <-trainControl(method = "LOOCV", classProbs = T, savePredictions = T) model <- train(status ~ ., data = dataset, method = "rf", trControl = fitControl) How can I automate it to start "s" at 5000, change it to another value, and based on the change in ROC, keep changing "s" to work towards the best possible "s" value? Thanks! [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.