Thanks for all the help. I had tried using the "index" in caret to try to dictate which rows of the sample would be used in each of the tree building in RF. (e.g. use all data from A B site for training, hold out all data from C site for testing etc)
However after running, when I cross-checked the "index" that goes to train function and the "inbag" in the resulting randomForest object, I found the two didn't match. Shown as below: > data(iris) > tmpIrisIndex <- createDataPartition(iris$Species, p=0.632, times = 10) > head(tmpIrisIndex,3) [[1]] [1] 1 2 3 7 10 11 12 13 16 18 20 22 24 25 26 27 28 29 31 [20] 34 35 36 37 38 39 40 41 43 46 47 48 50 52 53 55 56 57 58 [39] 61 64 65 66 67 68 69 71 74 75 76 77 79 82 83 84 85 86 88 [58] 90 91 92 94 96 98 99 102 103 104 106 108 109 111 112 113 114 115 116 [77] 117 119 120 121 123 126 128 129 130 131 132 134 136 139 140 141 143 146 147 [96] 150 [[2]] [1] 1 3 6 7 8 10 12 13 14 16 18 20 21 22 23 24 26 27 28 [20] 29 30 32 34 35 36 38 42 44 46 47 48 50 51 53 54 55 58 60 [39] 61 62 67 68 69 70 72 73 74 76 77 79 81 82 83 85 86 88 89 [58] 90 92 93 95 97 99 100 103 104 105 107 108 109 111 112 113 114 117 119 [77] 120 121 122 123 124 125 127 130 132 133 134 135 137 139 140 141 142 145 147 [96] 149 [[3]] [1] 1 5 7 9 10 11 12 14 18 20 21 22 23 24 26 29 30 31 33 [20] 34 35 36 37 38 39 40 44 45 46 47 48 49 51 52 53 54 56 58 [39] 61 63 65 66 69 70 72 74 75 76 77 78 79 80 82 83 85 86 87 [58] 90 91 92 93 94 98 100 102 103 105 106 107 109 110 113 114 115 116 117 [77] 121 122 123 124 125 128 129 130 131 132 133 134 135 138 139 140 141 142 146 [96] 150 > irisTrControl <- trainControl(method = "oob", index = tmpIrisIndex) > rf.iris.obj <-train(Species~., data= iris, method = "rf", ntree = 10, > keep.inbag = TRUE, trControl = irisTrControl) Fitting: mtry=2 Fitting: mtry=3 Fitting: mtry=4 > head(rf.iris.obj$finalModel$inbag,20) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 0 1 0 0 0 1 0 1 1 [2,] 1 1 1 1 1 0 1 0 1 0 [3,] 1 1 1 0 0 1 1 0 0 0 [4,] 1 0 1 0 1 1 0 1 0 1 [5,] 0 1 1 1 1 1 0 1 0 1 [6,] 1 1 0 1 0 0 1 1 1 0 [7,] 1 1 0 0 1 1 0 0 0 0 [8,] 1 1 1 1 1 0 1 1 1 1 [9,] 1 1 0 1 0 1 0 1 1 0 [10,] 1 1 1 0 1 1 0 0 0 1 [11,] 1 1 1 1 1 1 1 0 1 0 [12,] 1 1 1 1 1 0 1 0 1 1 [13,] 1 0 1 1 1 1 1 1 0 1 [14,] 0 1 1 1 0 1 0 0 0 0 [15,] 1 1 1 1 1 1 1 1 1 0 [16,] 1 1 0 0 0 0 1 0 1 1 [17,] 1 0 1 0 0 0 1 1 0 1 [18,] 1 0 1 1 1 1 1 1 1 1 [19,] 1 0 1 0 1 1 1 0 1 1 [20,] 1 0 1 0 1 1 1 0 1 0 My understanding is the 1st tree in the RF should be built with tmpIrisIndex[1] i.e. "1 2 3 7 10 11 12 13 ..." ? But the Inbag in the resulting forest is showing it is using "1 2 3 4 6 7 8 9..." for inbag in 1st tree? Why the index passed to train does not match what got from inbag in the rf object? Or I had looked to the wrong place to check this? Any help / comments would be appreciated. Thanks a lot. Regards, Coll -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Strata-tp2295731p2303958.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.