Dear experts of boosting! I am planning to build vegetation models via boosting with either gbm or mboost. My problem is that my response variable is the proportion of a vegetation type in natural vegetation at a location.
ResponseA = (area of vegetation type A/area of all natural vegetation types) That means that the response has a continuous distribution between 0 and 1 with many 0s and 1s as well. As I understood from reading these forums, it is pretty close to a beta distribution with the exception that the marginal values (0,1) are also included. Because of the latter feature I cannot even build a beta regression, not that I could do a boosted variant of that. Nevertheless, I can think of my response as a binomial one with values between 0 and 1 and take 1 square meter (as if it was a pixel) of natural vegetation as an observation. This way I can do binomial glms for my data, so that I specify the no. of square meters of natural vegetation as weights (I round them to get integers to be applicable in glm). I hope I am allowed to post a side-question here. I always get a warning with these glms though. I give here a simple one-variable example: Call: tmp <- glm(ossz_ujstand2$k2_stand ~ BIO_1 + I((BIO_1)^2), family=binomial, na.action=na.omit,weights= ossz_ujstand2$weights), Where BIO_1 is a variable describing climate, and weights are the area of natural vegetation rounded to integers for each observation (a vector). Warning: "non-integer #successes in a binomial glm!" I read somewhere on this site that this can be normal, but would be reassured if it was stated that it is indeed so in my case as well. My problem with boosting is that I don’t know how to handle my response variable distribution. I am not quite sure how to treat the loss function either. It seems to me that it somehow corresponds to the link function as it needs to be defined by family() like link functions in glm. The potential choices for family also correspond. At the same time some papers about boosting imply to me that the loss function takes more the role of the curve estimation technique and that data with any distribution can be boosted with any type of loss functions. As a start I tried to do the same with boosting as I did with glms. Here is an example. With mboost: index<-!is.na(ossz_ujstand2$k2_stand) # I need this to remove NAs proba.bb2<-blackboost(k2_stand~BIO_1+BIO_12,data=ossz_ujstand2[index,],weights=ossz_ujstand2$weights[index],family=Binomial()) Error in fam...@check_y(y) : response is not a factor but âfamily = Binomial()â With gbm using the modified code of Elith et al. 2008 Journal of Animal Ecology: index<-!is.na(ossz_ujstand2$k2_stand) k2.tc5.lr01<- gbm.step(data=ossz_ujstand2[index,], gbm.x = 50:147, gbm.y = 27, family = "bernoulli", tree.complexity = 5, learning.rate = 0.1, bag.fraction = 0.75, weights=ossz_ujstand2$weights) Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w, : Bernoulli requires the response to be in {0,1} So obviously the solution with weights does not work. Is there a straightforward way to model my response with the prefabricated families or I have to write a new loss function? I understand that it is possible in mboost, but I would greatly appreciate support on how to do this. Obviously, I am even uncertain about what type of link I should use for my data. Thank you very much! Imelda Somodi Assistant research fellow Institute of Ecology and Botany Hungarian Academy of Science -- View this message in context: http://www.nabble.com/Proportional-response-and-boosting-tp21559467p21559467.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.