Re: [R] bigglm() results different from glm()

utkarshsinghal Fri, 03 Jul 2009 06:27:49 -0700

Hi Sir,

Thanks for making package available to us. I am facing few problems if 
you can give some hints:


Problem-1:
The model summary and residual deviance matched (in the mail below) but 
I didn't understand why AIC is still different.

 > AIC(m1)
[1] 532965

 > AIC(m1big_longer)
[1] 101442.9


Problem-2:
chunksize argument is there in bigglm but not in biglm, consequently, 
udate.biglm is there, but not update.bigglm
Is my observation correct? If yes, why is this difference?


Regards
Utkarsh

/

/
From: Thomas Lumley <tlumley_at_u.washington.edu 
<mailto:tlumley_at_u.washington.edu?Subject=Re:%20%5BR%5D%20bigglm%28%29%20results%20different%20from%20glm%28%29>>
 

Date: Tue, 17 Mar 2009 00:50:20 -0700 (PDT)

This is a surprisingly interesting problem that took a while to debug, 
because the computations all seemed correct.

Your model hasn't converged yet. You can get the right answer either by 
running longer:

/> summary(m1big_longer) /

Large data regression model: bigglm(y ~ ttment, data = dat, family = 
poisson(link = "log"),

     chunksize = 100000, maxit = 20)
Sample size = 100000

              Coef (95% CI) SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2 0.405 0.401 0.408 0.002 0

or supplying starting values:

/> summary(m1big_started) /

Large data regression model: bigglm(y ~ ttment, data = dat, family = 
poisson(link = "log"),

     chunksize = 100000, start = c(2, 0)) Sample size = 100000

              Coef (95% CI) SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2 0.405 0.401 0.408 0.002 0

The bug is that you weren't told about the lack of convergence. There is 
a flag in the object, but it is only set after the model is converged, 
so it is not there when convergence fails.

/> m1big$converged /

*NULL *
/> m1big_longer$converged /

*[1] TRUE *
/> m1big_started$converged /

*[1] TRUE *For the next version I will make sure there is a clear 
warning when the model hasn't converged. The default maximum number of 
iterations is fairly small, by design --- if it isn't working, you want 
to find out and specify starting values rather than wait for dozens of 
potentially slow iterations. This strategy obviously breaks down when 
you don't notice that failure. :(

      -thomas

On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote:

 > Dear all, 
<http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8169.html#8182qlink1>
/> /
/> I am using the bigglm package to fit a few GLM's to a large dataset 
(3 million /
/> rows, 6 columns). While trying to fit a Poisson GLM I noticed that the /
/> coefficient estimates were very different from what I obtained when 
estimating /
/> the model on a smaller dataset using glm(), I wrote a very basic toy 
example to /
/> compare the results of bigglm() against a glm() call. Consider the 
following /
/> code: /
/> /
/> /
/>> require(biglm) /
/>> options(digits=6, scipen=3, contrasts = c("contr.treatment", 
"contr.poly")) /
/>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), 
ttment=gl(2,50000)) /
/>> m1 <- glm(y~ttment, data=dat, family=poisson(link="log")) /
/>> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log")) /
/>> summary(m1) /
/> /
/> <snipped output for this email> /
/> Coefficients: /
/> Estimate Std. Error z value Pr(>|z|) /
/> (Intercept) 2.30305 0.00141 1629 <2e-16 *** /
/> ttment2 0.40429 0.00183 221 <2e-16 *** /
/> /
/> Null deviance: 151889 on 99999 degrees of freedom /
/> Residual deviance: 101848 on 99998 degrees of freedom /
*/> AIC: 533152 /
*/> /
/>> summary(m1big) /
/> Large data regression model: bigglm(y ~ ttment, data = dat, family = /
/> poisson(link = "log")) /
/> Sample size = 100000 /
/> Coef (95% CI) SE p /
/> (Intercept) 2.651 2.650 2.653 0.001 0 /
/> ttment2 4.346 4.344 4.348 0.001 0 /
/> /
/>> m1big$deviance /
/> [1] 287158986 /
/> /
/> /
/> Notice that the coefficients and deviance are quite different in the 
model /
/> estimated using bigglm(). If I change the chunk to 
seq(1000,10000,1000) the /
/> estimates remain the same. /
/> /
/> Can someone help me understand what is causing these differences? /
/> /
/> Here is my version info: /
/> /
/>> version /
/> _ /
/> platform i386-pc-mingw32 /
/> arch i386 /
/> os mingw32 /
/> system i386, mingw32 /
/> status /
/> major 2 /
/> minor 8.1 /
/> year 2008 /
/> month 12 /
/> day 22 /
/> svn rev 47281 /
/> language R /
/> version.string R version 2.8.1 (2008-12-22) /
/> /
/> /
/> Many thanks in advance for your help, /
/> /
/> Francisco /
/> /
/> -- /
/> Francisco J. Zagmutt /
/> Vose Consulting /
/> 2891 20th Street /
/> Boulder, CO, 80304 /
*/> USA /
*/> www.voseconsulting.com /
/> /
/> ______________________________________________ /
/> R-help_at_r-project.org mailing list /
/> https://stat.ethz.ch/mailman/listinfo/r-help /
/> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html /
/> and provide commented, minimal, self-contained, reproducible code. /
/> /

Thomas Lumley                   Assoc. Professor, Biostatistics
tlumley_at_u.washington.edu     University of Washington, Seattle


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bigglm() results different from glm()

Reply via email to