Thanks again for your replies and for making this library available to the community.
Francisco -- Francisco J. Zagmutt Vose Consulting 2891 20th Street Boulder, CO, 80304 USA www.voseconsulting.com Thomas Lumley wrote:
There's a digits= argument to the print method.Large data regression model: bigglm(ff, data = trees, chunksize = 10, sandwich = TRUE)a <- bigglm(ff,data=trees, chunksize=10, sandwich=TRUE) print(summary(a),digits=5)Sample size = 31 Coef (95% CI) SE p (Intercept) -6.63162 -8.08729 -5.17595 0.72783 0 log(Girth) 1.98265 1.87132 2.09398 0.05567 0 log(Height) 1.11712 0.73339 1.50085 0.19186 0 Sandwich (model-robust) standard errorsI suppose I should make it take its default from options(digits)-3 or something.-thomas On Tue, 17 Mar 2009, Francisco J. Zagmutt wrote:Dear Thomas and John,Thanks for your prompt reply and for looking at your code to explain these differences. I see you replied very late at night, so I am sorry if my little problem kept you awake!As you pointed out, the model indeed converges (as reported in model$converged) once I specify a larger number of iterations.A very minor comment: it seems that the reporting of the estimates in summary.biglm() is not taking the parameters from options(digits). For example, using the same data and models as before:require(biglm) options(digits=8)dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000))m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"), maxit=20)summary(m1)<Snipped> Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.3019509 0.0014147 1627.21 < 2.2e-16 *** ttment2 0.4052002 0.0018264 221.86 < 2.2e-16 ***summary(m1big)Coef (95% CI) SE p (Intercept) 2.302 2.299 2.305 0.001 0 ttment2 0.405 0.402 0.409 0.002 0To get more digits I can extract the point estimates using coef(m1big), but after looking at str(m1big) the only way I could figure to extract the p-values was:summary(m1big)$mat[,"p"](Intercept) ttment2 0 0 Is there a way I can get summary.biglm() to report more digits directly? Thanks again, Francisco -- Francisco J. Zagmutt Vose Consulting 2891 20th Street Boulder, CO, 80304 USA www.voseconsulting.com Thomas Lumley wrote:Yes, the slow convergence is easier to get with the log link. Overshooting the correct coefficient vector has more dramatic effects on the fitted values and weights, and in this example the starting value of (0,0) is a long way from the truth. The same sort of thing happens in the Cox model, where there are real data sets that will cause numeric overflow in a careless implementation.It might be worth trying to guess better starting values: saving an iteration or two would be useful with large data sets.-thomas On Tue, 17 Mar 2009, John Fox wrote:Dear Francisco, I was able to duplicate the problem that you reported, and in addition discovered that the problem seems to be peculiar to the poisson family. lm(y~ttment, data=dat) and biglm(y~ttment, data=dat) produce identical results, as do glm(y~ttment, data=dat) and bigglm(y~ttment, data=dat). Another example, with the binomial family:set.seed(12345) dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),ttment=gl(2,50000))dat$y0 <- ifelse(dat$y > 12, 1, 0) m1b <- glm(y0~ttment, data=dat, family=binomial) m1bigb <- bigglm(y0~ttment , data=dat, family=binomial()) coef(m1b)(Intercept) ttment2 -1.33508 2.34263coef(m1bigb)(Intercept) ttment2 -1.33508 2.34263deviance(m1b)[1] 109244deviance(m1bigb)[1] 109244I'm copying this message to Thomas Lumley, who's the author and maintainerof the biglm package. Regards, John-----Original Message-----From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]OnBehalf Of Francisco J. Zagmutt Sent: March-16-09 10:26 PM To: r-h...@stat.math.ethz.ch Subject: [R] bigglm() results different from glm() Dear all, I am using the bigglm package to fit a few GLM's to a large dataset (3 million rows, 6 columns). While trying to fit a Poisson GLM I noticedthat the coefficient estimates were very different from what I obtainedwhen estimating the model on a smaller dataset using glm(), I wrote a very basic toy example to compare the results of bigglm() against a glm() call. Consider the following code: > require(biglm) > options(digits=6, scipen=3, contrasts = c("contr.treatment", "contr.poly")) > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000)) > m1 <- glm(y~ttment, data=dat, family=poisson(link="log")) > m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log")) > summary(m1) <snipped output for this email> Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.30305 0.00141 1629 <2e-16 *** ttment2 0.40429 0.00183 221 <2e-16 *** Null deviance: 151889 on 99999 degrees of freedom Residual deviance: 101848 on 99998 degrees of freedom AIC: 533152 > summary(m1big) Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log")) Sample size = 100000 Coef (95% CI) SE p (Intercept) 2.651 2.650 2.653 0.001 0 ttment2 4.346 4.344 4.348 0.001 0 > m1big$deviance [1] 287158986 Notice that the coefficients and deviance are quite different in the model estimated using bigglm(). If I change the chunk to seq(1000,10000,1000) the estimates remain the same. Can someone help me understand what is causing these differences? Here is my version info: > version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 8.1 year 2008 month 12 day 22 svn rev 47281 language R version.string R version 2.8.1 (2008-12-22) Many thanks in advance for your help, Francisco -- Francisco J. Zagmutt Vose Consulting 2891 20th Street Boulder, CO, 80304 USA www.voseconsulting.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.Thomas Lumley Assoc. Professor, Biostatistics tlum...@u.washington.edu University of Washington, Seattle ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.Thomas Lumley Assoc. Professor, Biostatistics tlum...@u.washington.edu University of Washington, Seattle ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.