Hi,

My data is characterized by many zeros (82%) and overdispersion. I have
chosen to model with hurdle regression (pscl package) with a negative
binomial distribution for the count data. In an effort to validate the
model I would like to calculate the RMSE of the predicted vs. the observed
values. From my reading I understand that this is the calculated on the raw
residuals generated from the model output. This is the formula I used

H1.RMSE <- sqrt(mean(H1$residuals^2))     # Where H1 is my fitted hurdle
model

I get 46.7 as the RMSE. This seems high to me based on the model results.
Assuming my formula and my understanding of RMSE is correct (and please
correct me if I am wrong) I question whether this is an appropriate use of
validation for this particular structure of model. The hurdle model
correctly predicts all of my zeros. The predictions I get from the fitted
model are all values greater than zero. From my readings I understand that
the predictions from the fitted hurdle model are means generated for the
particular covariate environment based on the model coefficients. If this
is truly the case it does not make sense to compare these means to the
observations. This will generate large residuals (only 18% of the
observations contain counts greater than 0, while the predicted counts all
exceed 0). It seems like comparing apples to oranges. Other correlative
tests (Pearson's r, Spearman's p) would seem to be comparing the mean
predicted value for particular covariate to the observed which again is
heavily dominated by zeros.

Any tips on how best to validate hurdle models in R?

Thanks

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to