Combine your code into a function: Plant <- function() { train <- sample.int(nrow(A), floor(nrow(A)*.7)) test <- (1:nrow(A))[-train] A.model <- glmmadmb(nat.r ~ isl.sz + nr.mead, random = ~ 1 | site, family = "poisson", data = A[train,]) cor(Atest$nat.r, predict(A.model, newdata = A[test,], type = "response")) }
Test the function. It should return a single correlation and no errors or warnings. Plant() If not, debug and run it again. When it works: Out <- replicate(1000, Plant()) Out should be a vector with 1000 correlation values. hist(Out) # for a histogram of the correlation values David C From: Angela Boag [mailto:angela.b...@colorado.edu] Sent: Friday, August 22, 2014 4:01 PM To: David L Carlson Subject: Re: [R] Subsetting data for split-sample validation, then repeating 1000x Hi David, Thanks for the feedback. I actually sampled without replacement initially but it's been a while since I looked at this code and just changed it because I thought it made more sense logically, but you've reassured me that my original hunch was right. The real issue I'm having is how to use either the replicate() or for(i in 1:1000){} loop code to get the average r value of 1000 repetitions as my output. I'm not familiar with either tool, so any suggestions on what that code would look like would be very helpful. Thanks! Angela -- Angela E. Boag Ph.D. Student, Environmental Studies CAFOR Project Researcher University of Colorado, Boulder Mobile: 720-212-6505 On Fri, Aug 22, 2014 at 2:46 PM, David L Carlson <dcarl...@tamu.edu> wrote: You can use replicate() or a for (i in 1:1000){} loop to do your replications, but you have other issues first. 1. You are sampling with replacement which makes no sense at all. Your 70% sample will contain some observations multiple times and will use less than 70% of the data most of the time. 2. You compute r using cor() and r.squared using summary.lm(). Why? Once you have computed r, r*r or r^2 is equal to r.squared for the simple linear model you are using. # To split your data, you need to sample without replacement, e.g. train <- sample.int(nrow(A), floor(nrow(A)*.7)) test <- (1:nrow(A))[-train] # Now run your analysis on A[train,] and test it on A[test,] # Fit model (I'm modeling native plant richness, 'nat.r') A.model <- glmmadmb(nat.r ~ isl.sz + nr.mead, random = ~ 1 | site, family = "poisson", data = A[train,]) # Correlation between predicted 30% and actual 30% cor <- cor(Atest$nat.r, predict(A.model, newdata = A[test,], type = "response")) ------------------------------------- David L Carlson Department of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Angela Boag Sent: Thursday, August 21, 2014 4:46 PM To: r-help@r-project.org Subject: [R] Subsetting data for split-sample validation, then repeating 1000x Hi all, I'm doing some within-dataset model validation and would like to subset a dataset 70/30 and fit a model to 70% of the data (the training data), then validate it by predicting the remaining 30% (the testing data), and I would like to do this split-sample validation 1000 times and average the correlation coefficient and r2 between the training and testing data. I have the following working for a single iteration, and would like to know how to use either the replicate() or for-loop functions to average the 1000 'r2' and 'cor' outputs. -- # create 70% training sample A.samp <- sample(1:nrow(A),floor(0.7*nrow(A)), replace = TRUE) # Fit model (I'm modeling native plant richness, 'nat.r') A.model <- glmmadmb(nat.r ~ isl.sz + nr.mead, random = ~ 1 | site, family = "poisson", data = A[A.samp,]) # Use the model to predict the remaining 30% of the data A.pred <- predict(A.model, newdata = A[-A.samp,], type = "response") # Correlation between predicted 30% and actual 30% cor <- cor(A[-A.samp,]$nat.r, A.pred, method = "pearson") # r2 between predicted and observed lm.A <- lm(A.pred ~ A[-A.samp,]$nat.r) r2 <- summary(lm.A)$r.squared # print values r2 cor -- Thanks for your time! Cheers, Angela -- Angela E. Boag Ph.D. Student, Environmental Studies CAFOR Project Researcher University of Colorado, Boulder Mobile: 720-212-6505 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.