Hi,
I'm new to R (and statistics) and my boss has thrown me in the deep-end with
the following task:
We want to evaluate the impact that sampling size has on our ability to create
a robust model, or evaluate how robust the model is to sample size for the
purpose of cross-validation i.e. in our current project we have collected a
series of independent data at 250 locations, from which we have built a
predictive model, we want to know whether we could get away with collecting
fewer samples and still build a decent model; for the obvious operational
reasons of cost, time spent in the field etc..
Our thinking was that we could apply a bootstrap type procedure:
We would remove 10 records or samples from the total n=250 and then replace
those 10 removed with replacements (or copies) from the remaining 240. With
this new data-frame we would apply our model and calculate an r², we would then
repeat through looping 1000 times before generating the mean r² from those 1000
r² values generated. After which we would start the process again by remove 20
samples from our data with replacements from the remaining 230 records and so
on...
Below is a simplified version of the real code which contains most of the basic
elements. My main problem is I'm not sure what the 'for(i in 1:nboot)' line is
doing, originally I though what this meant was that it removed 1 sample or
record from the data which was replaced by a copy of one of the records from
the remaining n, such that 'for(i in 10:nboot)' when used in the context of the
below code removed 10 samples with replacements as I have said above. I'm
almost positive that this isn't happening and if not how can I make the code
below for example do what we want it to?
library(utils)
#data
a <- c(5.5, 2.3, 8.5, 9.1, 8.6, 5.1)
b <- c(5.2, 2.2, 8.6, 9.1, 8.8, 5.7)
c <- c(5.0,14.6, 8.9, 9.0, 9.1, 5.5)
#join
abc <- data.frame(a,b,c)
#set column names
names(abc)[1]<-"y"
names(abc)[2]<-"x1"
names(abc)[3]<-"x2"
abc2 <- abc
#sample
abc3 <- as.data.frame(t(as.matrix(data.frame(abc2))))
n <- length(abc2)
npboot.function <- function(nboot)
{
boot.cor <- vector(length=nboot)
for(i in 1:nboot){
rdata <- sample(abc3,n,replace=T)
abc4 <- as.data.frame(t(as.matrix(data.frame(rdata))))
model <- lm(asin(sqrt(abc4$y/100)) ~ I(abc4$x1^2) + abc4$x2)
boot.cor[i] <- cor(abc4$y, model$fit)}
boot.cor
}
bt.cor <- npboot.function(nboot=10)
bootmean <- mean(bt.cor)
Any assistance would be greatly appreciated, also the sooner the better as we
are under pressure to reach a conclusion.
Cheers,
Garth
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.