Dear All, Following on from my last post (randomForest warning: The response has five or fewer unique values. Are you sure you want to do regression?) which presented two problems whilst trying to conduct a gradientForest regression, the warning I got was not an issue as Andy kindly pointed out, but I still have the second problem relating to the data structure of my input data and I would really appreciate your help on this. I think this is simply a data structure issue and nothing specific to gradientForest.
I am a relative beginner to R (also not a mathematician) and have tried to figure out how the data is structured to get the analysis to work but to no avail. I can run the analysis with the data provided within the gradientForest package according to the instructions but when I try it with my own data it doesnt work and it does not consider all the response variables (please see output in the previous post below). So my understanding is that gradientForest regression requires a set of response variables and a set of predictor variables which then need to be combined. The structure of the predictor variables according to the example data accompanying the gradientForest package is: load("GZ.phys.site.Rdata") > str(Phys_site) 'data.frame': 197 obs. of 28 variables: $ BATHY : num -16.7 -26.6 -32.8 -32.5 -29.7 ... $ SLOPE : num 0.505 0.784 0.12 0.332 0.467 ... $ ASPECT : num 234 116 192 172 230 ... $ BSTRESS: num 0.218 0.248 0.322 0.374 0.425 ... $ CRBNT : num 98.5 98 98.9 98.3 97.9 ... $ GRAVEL : num 39.3 39.2 30 42.4 38.8 ... $ SAND : num 59.7 59.8 63.9 54.9 62.7 ... $ MUD : num 3.16e-07 2.76e-02 5.17 9.22e-01 2.38 ... $ NO3_AV : num 0.24 0.3 0.24 0.26 0.3 0.24 0.25 0.24 0.23 0.25 ... $ NO3_SR : num 0.33 0.39 0.29 0.16 0.2 0.33 0.31 0.35 0.39 0.19 ... $ PO4_AV : num 0.15 0.15 0.15 0.15 0.16 0.16 0.16 0.16 0.15 0.15 ... $ PO4_SR : num 0.08 0.08 0.07 0.05 0.08 0.07 0.07 0.08 0.08 0.06 ... $ O2_AV : num 4.42 4.44 4.39 4.35 4.33 4.38 4.34 4.37 4.4 4.34 ... $ O2_SR : num 0.4 0.49 0.28 0.26 0.24 0.32 0.27 0.36 0.43 0.24 ... $ S_AV : num 34.9 34.9 35 34.9 34.9 ... $ S_SR : num 1.47 1.29 1.64 1.57 1.58 1.8 1.94 1.81 1.7 1.83 ... $ T_AV : num 28.2 28 28.3 28.6 28.5 ... $ T_SR : num 2.19 2.79 1.8 1.99 2.12 2.03 2.12 2.18 2.15 2.23 ... $ Si_AV : num 2.33 2.67 2.25 1.26 1.21 2.39 2.34 2.46 2.6 1.6 ... $ Si_SR : num 4.3 4.96 3.59 2.59 2.64 3.92 3.56 4.32 4.88 2.97 ... $ CHLA_AV: num 0.499 0.499 0.455 0.594 0.594 ... $ CHLA_SR: num 0.55 0.55 0.669 1.258 1.258 ... $ K490_AV: num 0.0726 0.0726 0.0672 0.075 0.075 ... $ K490_SR: num 0.0489 0.0489 0.0594 0.0732 0.0732 ... $ SST_AV : num 27 27 26.9 26.9 26.9 ... $ SST_SR : num 4.85 4.85 4.81 4.81 4.81 ... $ BIR_AV : num 0.1735 0.0688 0.2397 0.4476 0.5169 ... $ BIR_SR : num 0.1563 0.0885 0.2249 0.2667 0.256 ... Which seems to correspond to the structure of my predictor variables, so I dont think this is the problem: str(enviro) 'data.frame': 14 obs. of 8 variables: $ Temperature : num 24.8 24.4 24.3 23 24.6 24.6 24.8 24.9 24.3 24.5 ... $ Turbidity : num 0.047 0.046 0.052 0.058 0.049 0.047 0.047 0.049 0.049 0.051 ... $ Chlorophyll : num 0.24 0.23 0.29 0.26 0.25 0.23 0.23 0.28 0.3 0.29 ... $ Waveheight : num 2.14 2.13 2.12 2.12 2.12 2.12 2.11 2.12 2.11 2.12 ... $ nLw551 : num 0.231 0.228 0.228 0.236 0.226 ... $ nLw667 : num 1e-04 8e-04 1e-03 1e-03 1e-03 1e-04 1e-04 1e-03 1e-03 1e-04 ... $ Sediment.nlw551.667.: num 0.231 0.229 0.229 0.237 0.227 ... $ Depth : num 4.8 4.1 5 4 6.2 7.7 10.1 4.3 5.1 7.9 ... BUT my set of response variables seems to be in the wrong structure and this is I think the problem and where I need help. This is the structure of the example data provided with gradientForest: > load("GZ.sps.mat.Rdata") > str(Sp_mat) num [1:197, 1:110] 1.04 -2.11 -3.43 -2.36 -1.15 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:197] "1" "2" "3" "4" ... ..$ : chr [1:110] "A1010102" "A1010113" "A1010206" "A1010209" ... And this is the structure that my response variables are currently in (essentially a matrix created from Excel with rows indicating sites (14 of them) and coloumns indicating species (100 hundred of them) abbundances occuring at these sites (Header = TRUE): > # data structure of biological data > str(biological) 'data.frame': 14 obs. of 100 variables: $ a : num 0 0 0 0 0 0 0 0 0 0 ... $ b : num 0 0 0 0 257 ... $ c : int 0 0 0 0 0 0 441 0 0 0 ... $ d : num 179 0 1430 0 0 ... $ e : num 100 0 601 0 123 ... $ f : num 0 0 3 0 1.5 0 0 0 0 4.5 ... $ g : num 0 0 0 0 0 0 0 0 0 0 ... $ h : int 0 0 0 0 0 0 0 0 1 0 ... $ i : num 0 0 0 0 0 0 0 0 0 3.85 ... $ j : num 0 0 0 27.6 3.6 ... $ k : num 0 0 0 0 0 0 0 0 0 1.8 ... $ l : num 0 0 0 0 0 0 0 0 0 0 ... $ m : num 0 0 0 0 0 0 0 0 0 0 ... $ n : num 0 0 0 0 0 0 0 1.1 0 0 ... $ o : num 0 0 0 0 0 0.2 0 0 0 0 ... $ p : num 0 0.15 0 0 0.35 0.9 0 0 0 0 ... $ q : num 0 0 0 0 0 0 0 9.4 0 0 ... $ r : num 0 41 0 0 1.75 0 0 0 0 0 ... $ s : num 0 0 0 0 0 ... $ t : num 0 0 22.1 0 0 ... $ u : num 0 0 0 0 0 0 0 0 0 0 ... $ v : num 0 0 0 0 0.12 0 0 0 0 0 ... $ w : num 0 0 0 0 4.95 6.6 0 3.3 0 3.3 ... $ x : num 0 0 0 0 7.9 ... $ y : int 0 0 0 0 0 1 0 0 0 0 ... $ z : num 0 0 0 0 0 0 0 0 0 0.8 ... $ aa: num 0 0 0 0 0 0 0 0 0 0 ... $ ab: num 0 47 0 136.3 9.4 ... $ ac: num 0 0 0 0 0 0 0 0 0 0 ... $ ad: num 0 4.2 0 8.4 0 0 0 0.7 0 0 ... $ ae: int 0 0 0 2 0 1 0 0 0 0 ... $ af: num 0 92.4 720.7 0 554.4 ... $ ag: int 0 0 0 0 0 0 0 0 0 0 ... $ ah: int 0 0 0 0 0 0 0 0 0 0 ... $ ai: num 43.4 3.4 26.4 0 1.7 ... $ aj: num 0 0 0 0 0 ... $ ak: num 0 0 0.25 0 0 0 0 0 0 0 ... $ al: num 0 0 0 0 0 ... $ am: num 561.6 0 93.6 0 374.4 ... $ an: num 234 0 562 0 187 ... $ ao: num 15.92 2.16 0 0 1.08 ... $ ap: num 31.84 0 1.08 0 3.24 ... $ aq: num 0 0 0 37.8 29.4 0 92.4 0 0 0 ... $ ar: int 0 72 0 76 16 49 0 8 0 0 ... $ as: num 0 0 0 0 0 0 0 0 0 0 ... $ at: num 0 0 0 0 0 0 0 0 0 0 ... $ au: num 0 0 0 0 0 0 0 0 0 0 ... $ av: num 0 31.8 0 25.4 0 ... $ aw: num 0 0 0 0 0 0 0 0 0 0 ... $ ax: num 0 2.7 0 0 0 0 0 2.7 2.7 0 ... $ ay: int 0 0 0 0 0 1 0 0 0 0 ... $ az: num 2.7 0 0 0 0 0 0 0 0 0 ... $ ba: num 7.72 0 0 0 0 0 0 0 0 0 ... $ bb: num 262 0 0 0 0 ... $ bc: num 0 1.6 0 13.6 0 ... $ bd: num 0 0 7.96 0 0 0 0 0 0 0 ... $ be: num 2493 0 1254 0 988 ... $ bf: num 0 46.4 0 72.5 45 ... $ bg: num 218 0 265 0 884 ... $ bh: num 0 0 0 0 0 0 0 2.8 0 0 ... $ bi: num 0 0 0 0 0 ... $ bj: num 0 0 0 0 0 0 0 0 0 0 ... $ bk: num 0 0 0 0 0 1.4 0 0 0 0 ... $ bl: num 0 0 0 0 0 0 3.2 0 0 0 ... $ bm: num 0 2.6 0 72.8 0 ... $ bn: num 0 0 82.8 0 0 0 0 0 0 0 ... $ bo: num 0 0 0 0 0 ... $ bp: int 0 0 0 0 0 0 0 288 0 0 ... $ bq: num 28.4 530.5 433.4 473.9 615.6 ... $ br: num 0 0 0 0 0 0 0 0 0 0 ... $ bs: num 0 0 0 0 0 0 0 0 0 14.5 ... $ bt: num 56.2 0 1125 0 78.8 ... $ bu: num 205.4 7.9 130.3 0 0 ... $ bv: num 1353.2 0 119.4 0 79.6 ... $ bw: num 0 0 0 2.45 0.7 2.1 0 0 0 0 ... $ bx: num 0 0 0 0 0 ... $ by: num 0 0 0 0 0 0 26.4 0 0 0 ... $ bz: num 208 1806 3727 208 8427 ... $ ca: num 49.2 0 32.8 0 57.4 ... $ cb: num 0 7.15 0 0 0 0 1.65 0 0 0 ... $ cc: num 0 590 0 419 0 ... $ cd: num 0 0 0 0 0 0 0 0 1.5 0 ... $ ce: num 1390 0 1394 0 552 ... $ cf: num 75.6 0 0 0 0 ... $ cg: num 3.86 0 0 0 0 0 0 0 0 0 ... $ ch: num 81.3 0 0 0 0 ... $ ci: num 0 0 0 0 12.2 ... $ cj: num 0 1.2 0 0.8 0 0.8 0.8 3.6 0 0 ... $ ck: num 0 0 0 0 0 17.4 0 0 0 0 ... $ cl: int 0 0 0 0 0 0 0 0 0 435 ... $ cm: num 0 0 0 0 0 0 31.2 0 0 0 ... $ cn: num 0 0 0 16.8 0 0 0 0 0 0 ... $ co: num 11.61 0 2.11 0 10.55 ... $ cp: num 15.05 1.4 0.35 0 0 ... $ cq: num 0 0 0 0 0 0 0 4.2 0 0 ... $ cr: int 0 0 0 0 1 0 0 0 0 0 ... $ cs: num 0 0 0 0 0 0 17.1 0 0 0 ... $ ct: num 2.7 0 0 0 0 0 0 0 0 0 ... $ cu: num 0 0 30.9 0 41.2 ... [list output truncated] I thought it may be that some values are numbers and some are integers but I tested this using only numbers and found that this is not the problem. How do I get my response/species data into the correct structure such as in the example (GZ.sps.mat.Rdata) ? Thank you very much Sean -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Sean Porter Sent: 25 March 2014 09:34 AM To: 'Liaw, Andy'; r-help@r-project.org Subject: Re: [R] randomForest warning: The response has five or fewer unique values. Are you sure you want to do regression? Dear Andy, Thank you for your help! Below are the full details of what I am doing in R along with the data structure, so hopefully this will help. Okay so the warning is just a warning and nothing to worry about when doing regression. But why is randomForest only producing regression trees for each of only 3 species when I have 100 species in the matrix, surely this is not correct, what am I doing wrong? Also, what did you mean when you said by using the code I am not using randomForest directly ? Many thanks, Sean > # For Andy > # get biological data into R > biological <- read.table (file = "C:/bio1.txt", header = TRUE) > dim(biological) [1] 14 100 > # get environmental data into R > enviro <- read.table (file = "C:/abio1.txt", header = TRUE) > dim(enviro) [1] 14 8 > # data structure of biological data > str(biological) 'data.frame': 14 obs. of 100 variables: $ a : num 0 0 0 0 0 0 0 0 0 0 ... $ b : num 0 0 0 0 257 ... $ c : int 0 0 0 0 0 0 441 0 0 0 ... $ d : num 179 0 1430 0 0 ... $ e : num 100 0 601 0 123 ... $ f : num 0 0 3 0 1.5 0 0 0 0 4.5 ... $ g : num 0 0 0 0 0 0 0 0 0 0 ... $ h : int 0 0 0 0 0 0 0 0 1 0 ... $ i : num 0 0 0 0 0 0 0 0 0 3.85 ... $ j : num 0 0 0 27.6 3.6 ... $ k : num 0 0 0 0 0 0 0 0 0 1.8 ... $ l : num 0 0 0 0 0 0 0 0 0 0 ... $ m : num 0 0 0 0 0 0 0 0 0 0 ... $ n : num 0 0 0 0 0 0 0 1.1 0 0 ... $ o : num 0 0 0 0 0 0.2 0 0 0 0 ... $ p : num 0 0.15 0 0 0.35 0.9 0 0 0 0 ... $ q : num 0 0 0 0 0 0 0 9.4 0 0 ... $ r : num 0 41 0 0 1.75 0 0 0 0 0 ... $ s : num 0 0 0 0 0 ... $ t : num 0 0 22.1 0 0 ... $ u : num 0 0 0 0 0 0 0 0 0 0 ... $ v : num 0 0 0 0 0.12 0 0 0 0 0 ... $ w : num 0 0 0 0 4.95 6.6 0 3.3 0 3.3 ... $ x : num 0 0 0 0 7.9 ... $ y : int 0 0 0 0 0 1 0 0 0 0 ... $ z : num 0 0 0 0 0 0 0 0 0 0.8 ... $ aa: num 0 0 0 0 0 0 0 0 0 0 ... $ ab: num 0 47 0 136.3 9.4 ... $ ac: num 0 0 0 0 0 0 0 0 0 0 ... $ ad: num 0 4.2 0 8.4 0 0 0 0.7 0 0 ... $ ae: int 0 0 0 2 0 1 0 0 0 0 ... $ af: num 0 92.4 720.7 0 554.4 ... $ ag: int 0 0 0 0 0 0 0 0 0 0 ... $ ah: int 0 0 0 0 0 0 0 0 0 0 ... $ ai: num 43.4 3.4 26.4 0 1.7 ... $ aj: num 0 0 0 0 0 ... $ ak: num 0 0 0.25 0 0 0 0 0 0 0 ... $ al: num 0 0 0 0 0 ... $ am: num 561.6 0 93.6 0 374.4 ... $ an: num 234 0 562 0 187 ... $ ao: num 15.92 2.16 0 0 1.08 ... $ ap: num 31.84 0 1.08 0 3.24 ... $ aq: num 0 0 0 37.8 29.4 0 92.4 0 0 0 ... $ ar: int 0 72 0 76 16 49 0 8 0 0 ... $ as: num 0 0 0 0 0 0 0 0 0 0 ... $ at: num 0 0 0 0 0 0 0 0 0 0 ... $ au: num 0 0 0 0 0 0 0 0 0 0 ... $ av: num 0 31.8 0 25.4 0 ... $ aw: num 0 0 0 0 0 0 0 0 0 0 ... $ ax: num 0 2.7 0 0 0 0 0 2.7 2.7 0 ... $ ay: int 0 0 0 0 0 1 0 0 0 0 ... $ az: num 2.7 0 0 0 0 0 0 0 0 0 ... $ ba: num 7.72 0 0 0 0 0 0 0 0 0 ... $ bb: num 262 0 0 0 0 ... $ bc: num 0 1.6 0 13.6 0 ... $ bd: num 0 0 7.96 0 0 0 0 0 0 0 ... $ be: num 2493 0 1254 0 988 ... $ bf: num 0 46.4 0 72.5 45 ... $ bg: num 218 0 265 0 884 ... $ bh: num 0 0 0 0 0 0 0 2.8 0 0 ... $ bi: num 0 0 0 0 0 ... $ bj: num 0 0 0 0 0 0 0 0 0 0 ... $ bk: num 0 0 0 0 0 1.4 0 0 0 0 ... $ bl: num 0 0 0 0 0 0 3.2 0 0 0 ... $ bm: num 0 2.6 0 72.8 0 ... $ bn: num 0 0 82.8 0 0 0 0 0 0 0 ... $ bo: num 0 0 0 0 0 ... $ bp: int 0 0 0 0 0 0 0 288 0 0 ... $ bq: num 28.4 530.5 433.4 473.9 615.6 ... $ br: num 0 0 0 0 0 0 0 0 0 0 ... $ bs: num 0 0 0 0 0 0 0 0 0 14.5 ... $ bt: num 56.2 0 1125 0 78.8 ... $ bu: num 205.4 7.9 130.3 0 0 ... $ bv: num 1353.2 0 119.4 0 79.6 ... $ bw: num 0 0 0 2.45 0.7 2.1 0 0 0 0 ... $ bx: num 0 0 0 0 0 ... $ by: num 0 0 0 0 0 0 26.4 0 0 0 ... $ bz: num 208 1806 3727 208 8427 ... $ ca: num 49.2 0 32.8 0 57.4 ... $ cb: num 0 7.15 0 0 0 0 1.65 0 0 0 ... $ cc: num 0 590 0 419 0 ... $ cd: num 0 0 0 0 0 0 0 0 1.5 0 ... $ ce: num 1390 0 1394 0 552 ... $ cf: num 75.6 0 0 0 0 ... $ cg: num 3.86 0 0 0 0 0 0 0 0 0 ... $ ch: num 81.3 0 0 0 0 ... $ ci: num 0 0 0 0 12.2 ... $ cj: num 0 1.2 0 0.8 0 0.8 0.8 3.6 0 0 ... $ ck: num 0 0 0 0 0 17.4 0 0 0 0 ... $ cl: int 0 0 0 0 0 0 0 0 0 435 ... $ cm: num 0 0 0 0 0 0 31.2 0 0 0 ... $ cn: num 0 0 0 16.8 0 0 0 0 0 0 ... $ co: num 11.61 0 2.11 0 10.55 ... $ cp: num 15.05 1.4 0.35 0 0 ... $ cq: num 0 0 0 0 0 0 0 4.2 0 0 ... $ cr: int 0 0 0 0 1 0 0 0 0 0 ... $ cs: num 0 0 0 0 0 0 17.1 0 0 0 ... $ ct: num 2.7 0 0 0 0 0 0 0 0 0 ... $ cu: num 0 0 30.9 0 41.2 ... [list output truncated] > # data structure of environmental data > str(enviro) 'data.frame': 14 obs. of 8 variables: $ Temperature : num 24.8 24.4 24.3 23 24.6 24.6 24.8 24.9 24.3 24.5 ... $ Turbidity : num 0.047 0.046 0.052 0.058 0.049 0.047 0.047 0.049 0.049 0.051 ... $ Chlorophyll : num 0.24 0.23 0.29 0.26 0.25 0.23 0.23 0.28 0.3 0.29 ... $ Waveheight : num 2.14 2.13 2.12 2.12 2.12 2.12 2.11 2.12 2.11 2.12 ... $ nLw551 : num 0.231 0.228 0.228 0.236 0.226 ... $ nLw667 : num 1e-04 8e-04 1e-03 1e-03 1e-03 1e-04 1e-04 1e-03 1e-03 1e-04 ... $ Sediment.nlw551.667.: num 0.231 0.229 0.229 0.237 0.227 ... $ Depth : num 4.8 4.1 5 4 6.2 7.7 10.1 4.3 5.1 7.9 ... > # conduct randomForest regression > gf <- gradientForest(cbind(enviro, biological), predictor.vars = colnames(enviro), response.vars = colnames(biological), ntree = 500, transform = NULL, compact = T, nbin = 201, maxLevel = 5, corr.threshold = 0.5) There were 50 or more warnings (use warnings() to see the first 50) > gf A forest of 500 regression trees for each of 3 species Call: gradientForest(data = cbind(enviro, biological), predictor.vars = colnames(enviro), response.vars = colnames(biological), ntree = 500, transform = NULL, maxLevel = 5, corr.threshold = 0.5, compact = T, nbin = 201) Important variables: [1] Sediment.nlw551.667. Depth nLw551 nLw667 Chlorophyll > # End -----Original Message----- From: Liaw, Andy [mailto:andy_l...@merck.com] Sent: 25 March 2014 02:37 AM To: Sean Porter; r-help@r-project.org Subject: RE: [R] randomForest warning: The response has five or fewer unique values. Are you sure you want to do regression? If you are using the code, that's not really using randomForest directly. I don't understand the data structure you have (since you did not show anything) so can't really tell you much. In any case, that warning came from randomForest() when it is run in regression mode but the response has fewer than five distinct values. It may be legitimate regression data, and if so you can safely ignore the warning (that's why it's not an error). It's there to catch the cases when people try to do classification with class labels 1, 2, ..., k and forgot to make it a factor. Best, Andy Liaw -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Sean Porter Sent: Thursday, March 20, 2014 3:27 AM To: r-help@r-project.org Subject: [R] randomForest warning: The response has five or fewer unique values. Are you sure you want to do regression? Hello everyone, Im relatively new to R and new to the randomForest package and have scoured the archives for help with no luck. I am trying to perform a regression on a set of predictors and response variables to determine the most important predictors. I have 100 response variables collected from 14 sites and 8 predictor variables from the same 14 sites. I run the code to perform the randomForest regression given by Pitcher et al 2011 ( http://gradientforest.r-forge.r-project.org/biodiversity-survey.pdf ). However, after running the code I get the warning: " In randomForest.default(m, y, ...) : The response has five or fewer unique values. Are you sure you want to do regression?" And it produces a set of 500 regression trees for each of 3 species only when the number of species in the response file is 100. I noticed that in the example by Pitcher they get 500 trees from only 90 species even though they input 110 species in the response data. Why am I getting the warning/how do I solve it, and why is randomForest producing trees for only 3 species when I am looking at 100 species (response variables)? Many thanks Sean [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:15}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.