[R] bootstrap subject resampling: resampled subject codes surface as list/vector indices

Aleksander Główka Sat, 19 Aug 2017 08:57:07 -0700

I'm implementing a custom bootstrap resampling procedure in R. Thisprocedure resamples clusters of data points obtained by differentsubjects in an experiment. Since the bootstrap samples need to have thesame size as the original dataset, `target.set.size`, I select speakerscompute their data point contributions to make sure I have a set of theright size.


    set.seed(1)
    target.sample.size = 1742
    count.lookup = rbind(levels(data$subj), as.numeric(table(data$subj)))

To this end, I create a dynamic list of resampled subjects,`sample.subjects`, that keep on being selected and appended to the listas long as their summed data point contributions do not exceed`target.set.size`. To conveniently retrieve the number of data pointsthat a given subject contributes I constructed a reference matrix,`count.lookup`, where the first row contains subject codes and thesecond row contains their respective data point counts.


    > count.lookup

    [,1]  [,2]  [,3]  [,4]  [,5]
    [1,] "5"   "6"   "13"  "18"  "20"
    [2,] "337" "202" "311" "740" "152"

This is how the resampling works:

    for (iter in 1:1000){

      #select first subject
      #empty list overwrites sample subjects from previous iteration
      sample.subjects = list()

sample.subjects[1] = sample(unique(data$subj), 1, replace=TRUE,prob=NULL)


      #determine subject position in data point count lookup

first.subj.pos = which(count.lookup[1,]==sample.subjects,arr.ind=TRUE)


      #add contribution of first subject to data point count
      sample.size = as.numeric(count.lookup[2,first.subj.pos])

      #select subject clusters until you exceed target sample size
      while(sample.size < target.sample.size){

        #add another subject

current.subject = sample(unique(data$subj), 1, replace=TRUE,prob=NULL)

        sample.subjects[length(sample.subjects)+1] = current.subject

        #determine subject's position in data point lookup

curr.subj.pos = which(count.lookup[1,]==current.subject,arr.ind=TRUE)


        #add subject contribution to the data point count

sample.size = sample.size +as.numeric(count.lookup[2,curr.subj.pos])

#initialize intermediate data frame; intermediate because it willbe shortened to fit target size

      inter.set = data.frame(matrix(, nrow = 0, ncol = ncol(data)))

      #build the bootstrap sample from the selected subjects
      for(j in 1:length(sample.subjects)){

inter.set = rbind(inter.set, data[data$subj ==sample.subjects[j],])


      }

      #procustean bed of target sample size
      final.set = inter.set[1:target.sample.size,]

write.csv(final.set, paste("bootstrap_sample_", iter,".csv",sep=""), row.names=FALSE)

      cat("Bootstrap Iteration", iter, "completed\n")

      #clean up sample.size for next bootstrap iteration
      sample.size = 0

    }

My problem is that when I sample the second subject onward and add it to`sample.subjects` (regardless of whether it is a list of a vector), whatactually gets added to `sample.subjects` seems to be the index of thatsubject in `count.lookup`! When I select the first subject code andcreate a list consisting of just that subject code as the only element,everything is fine.

> sample.subjects[1] = sample(unique(tt1$subj), 1, replace=TRUE,prob=NULL)

    > sample.subjects
    [[1]]
    [1] 5

I know this is the actual subject number because when I check the numberof data points that this subject contributes in `count.lookup`, it isthe number that corresponds to subject 5.


    > sample.size = as.numeric(tt1.lookup[2,first.subj.pos])
    > sample.size

However, when I append further sampled subject codes to the list, forsome reason they surface as their index number in count.lookup.


    > sample.subjects
    [[1]]
    [1] 5

    [[2]]
    [1] 5

    [[3]]
    [1] 1

    [[4]]
    [1] 2

    [[5]]
    [1] 5

    [[6]]
    [1] 2

    [[7]]
    [1] 2

    [[8]]
    [1] 3

    [[9]]
    [1] 3

The third element, for example, is 1. This coincides with none of thesubject codes in count.lookup.

It seems the problem lies in how I append to `sample.subjects`. I triedboth vectors and list as data structures in which to store sampledsubject codes. For each data type, I tried two ways of appending: theone I present above, and one that is more idiomatic in R:


sampled.subjects = [current.subject, sampled.subjects] (for lists)

and

sampled.subjects = c(current.subject, sampled.subjects) (for vectors)

Are these appending strategies flawed here or is there some stupid errorI'm making somewhere else that is making the indices to surface insteadof subject codes?


I'd appreciate all your help!

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] bootstrap subject resampling: resampled subject codes surface as list/vector indices

Reply via email to