Hi,
I wonder if someone can help me. I have built a gam model to predict the 
presence of cold water corals and am now trying to evaluate my model by 
splitting my dataset into training/test datasets.

In an ideal world I would use the sample() function to randomly select rows of 
data for me so for example with 936 rows of data in my HH dataset I might say

ss <- sample(nrow(HH), size = nrow(HH)-312, replace = FALSE)
training<-HH[ss,]
test<-HH[-ss,]

in order to create a random training sub-sample of  roughly 65% of my data and 
test of 35%. (I would use a for() loop to automate the process of building the 
datasets and running the prediction e.g.1000times)

The problem is that I do have 2 caveats for the subsampling:


a)      I need to have control over the prevalence (proportion of observed 
presences within the dataset) in my build and test datasets
I realise I could do this by sorting my column of presences and absences and 
then taking a subsample of the required size from the rows containing presences 
then the rows containing absences and combining them.

e.g.        presence_records<-sample(1:117,size=75,replace=FALSE)

absence_records<-sample(118:936,size=549,replace=FALSE)

ss<-c(presence_records,absence_records)
                but...

b)      My samples are within video transects and due to the risk of 
autocorrelation within each transect, ideally it is by transect cluster that 
they will be randomly selected. (a point within a transect cannot be allocated 
to the training dataset when another point from that same transect is already 
allocated to the test dataset)

Is there a way I can fulfil both of these caveats and come out with my 
(slightly less)random subsamples?

Many thanks for your time!
All the best,
Bex


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to