Re: [R] Zero-inflated regression models: predicting no 0s

Achim Zeileis Wed, 01 Jun 2011 00:27:51 -0700

On Tue, 31 May 2011, Jean-Simon Michaud wrote:

Hi all,


First post for me here, but I have been reading on the forum for almost two
years now. Thanks to everyone who contributed btw!

I have a dataset of 4000 observations of count of a mammal and I am trying
to predict abundance from a inflated-zero model as there is quite a bit of
zeros in the response variable.

I have tried multiple options, but I might do something wrong as everytime I look at the fitted values it do not comprise any 0.

Yes, the fitted() method and also the default of the predict() method isto compute fitted _means_. And the mean of a count distribution willalways be non-zero.

Moreover, even when you round the fitted values to integers, this does_not_ lead to the most likely count. Consider the following simpleexamples:


Probability density for a Poisson distribution with mean 0.8

R> dpois(0:5, lambda = 0.8)
[1] 0.449329 0.359463 0.143785 0.038343 0.007669 0.001227

i.e., zero is still the most likely outcome with a probability of 45% eventhough the mean is 0.8. And for negative binomial distributions, this canbe even more extreme. The probability density for a geometric distribution(negative binomial with size = 1) and mean 2:


R> dnbinom(0:5, mu = 2, size = 1)
[1] 0.33333 0.22222 0.14815 0.09877 0.06584 0.04390

i.e., despite the mean of 2, zero is still the most likely outcome.

You can get the predicted probabilities for all observations via

predict(hurdle1, type = "prob") and predict(zip1A, type = "prob"),respectively. Given the results for your negative binomial hurdle model, Isuspect that the zero-inflated Poisson fit will have a nonsatisfactory fitfor the zeros, even though the predicted means of the two models aresimilar.

See also the paper accompanying the count regression functions in "pscl"(not "lpsc"):


  http://www.jstatsoft.org/v27/i08/

Finally, a short comment on your model formula:

  hurdle(TOT ~ LC80 + LC231 + DEM, data = mydata_purge, ...)

will be easier to read and less confusing (after all "data = food" doesnot appear to be used at all).


hth,
Z

Here is what I tried so far:

"
## - hurdle from the package (lpsc) - ##

hurdle1 = hurdle(formula = mydata_purge2$TOT ~ mydata_purge2$LC80 +

mydata_purge2$LC231 + mydata_purge2$DEM, data = food, dist = "negbin",
zero.dist = "binomial")

summary(hurdle1)


Call:
hurdle(formula = mydata_purge2$TOT ~ mydata_purge2$LC80 +
mydata_purge2$LC231 + mydata_purge2$DEM, data = food,
   dist = "negbin", zero.dist = "binomial")

Pearson residuals:
   Min      1Q  Median      3Q     Max
-1.0833 -0.7448 -0.2801  0.4296  6.7242

Count model coefficients (truncated negbin with log link):
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)          1.7841678  0.0923781  19.314  < 2e-16 ***
mydata_purge2$LC80  -2.5929984  0.4184956  -6.196 5.79e-10 ***
mydata_purge2$LC231  0.2154269  0.1171259   1.839 0.065875 .
mydata_purge2$DEM    0.0007708  0.0002064   3.735 0.000188 ***
Log(theta)           0.3742602  0.0390319   9.589  < 2e-16 ***

Zero hurdle model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)          0.0602347  0.2302370   0.262 0.793614
mydata_purge2$LC80  -3.0590108  0.8360020  -3.659 0.000253 ***
mydata_purge2$LC231  1.7754441  0.3226731   5.502 3.75e-08 ***
mydata_purge2$DEM    0.0031943  0.0005307   6.020 1.75e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Theta: count = 1.4539
Number of iterations in BFGS optimization: 12
Log-likelihood: -1.251e+04 on 9 Df


## - zeroinfl from the package (lpsc) - ##

zip1A = zeroinfl(mydata_purge2$TOT ~ mydata_purge2$LC80 +

mydata_purge2$LC231 + mydata_purge2$DEM, data = food)

summary(zip1A)


Call:
zeroinfl(formula = mydata_purge2$TOT ~ mydata_purge2$LC80 +
mydata_purge2$LC231 + mydata_purge2$DEM, data = food)

Pearson residuals:
   Min      1Q  Median      3Q     Max
-2.2128 -1.2886 -0.5010  0.7594 11.8458

Count model coefficients (poisson with log link):
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)          1.894e+00  3.547e-02  53.401  < 2e-16 ***
mydata_purge2$LC80  -2.249e+00  1.768e-01 -12.725  < 2e-16 ***
mydata_purge2$LC231  1.799e-01  4.492e-02   4.005 6.21e-05 ***
mydata_purge2$DEM    6.670e-04  7.687e-05   8.678  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)         -0.0593751  0.2308068  -0.257 0.796986
mydata_purge2$LC80   2.9428092  0.8523669   3.453 0.000555 ***
mydata_purge2$LC231 -1.7772101  0.3233166  -5.497 3.87e-08 ***
mydata_purge2$DEM   -0.0031901  0.0005319  -5.997 2.01e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Number of iterations in BFGS optimization: 13
Log-likelihood: -1.727e+04 on 8 Df

a1 = predict(zip1A)
b1 = mydata_purge2$TOT
plot(a1,b1)


"

Please find attached the plot of zip1A (which look quite similar to the
hurdle1).

Your help would be much appreciated,

Thanks,

JM.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Zero-inflated regression models: predicting no 0s

Reply via email to