Re: [R] bigglm() results different from glm()+Another question

utkarshsinghal Mon, 06 Jul 2009 23:16:32 -0700

Trust me, it is the same total data I am using, even the chunksizes are 
all equal. I also crosschecked by manually creating the chunks and 
updating as in example given on biglm help page.
 > ?biglm



Regards
Utkarsh



Greg Snow wrote:
>
> Are you sure that you are fitting all the models on the same total 
> data?  A first glance looks like you may be including more data in 
> some of the chunk sizes, or be producing an error that update does not 
> know how to deal with.
>
>  
>
> -- 
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> greg.s...@imail.org
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:utkarsh.sing...@global-analytics.com]
> *Sent:* Monday, July 06, 2009 8:58 AM
> *To:* Thomas Lumley; Greg Snow
> *Cc:* r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
>
> The AIC of the biglm models is highly dependent on the size of chunks 
> selected (example provided below). This I can somehow expect because 
> the model error will increase with the number of chunks.
>
> It will be helpful if you can provide your opinion for comparing 
> different models in such cases:
>
>     * can I compare two models fitted with different chunksizes, or
>       should I always use the same chunk size.
>
>     * although I am not going to use AIC at all in my model selection,
>       but I think any other model parameters will also vary in the
>       same way. Am I right?
>     * what would be the ideal chunksize? should it be the maximum
>       possible size R and my system's RAM is able to handle?
>
> Any comments will be helpful.
>
>
> *Example of AIC variation with chunksize:*
>
> I ran the following code on my data which has 10000 observations and 3 
> independent variables
>
> > chunksize = 500
> > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> > for(i in seq(chunksize,10000,chunksize)) fit=update(fit, 
> data=xx[(i+1):(i+chunksize),])
> > AIC(fit)
> [1] 30647.79
>
> Here are the AIC for other chunksizes:
> chunksize    AIC
> 500          30647.79
> 1000        29647.79
> 2000        27647.79
> 2500        26647.79
> 5000        21647.79
> 10000      11647.79
>
>
> Regards
> Utkarsh
>
>
>
>
> utkarshsinghal wrote:
>
> Thank you Mr. Lumley and Mr. Greg. That was helpful.
>
> Regards
> Utkarsh
>
>
>
> Thomas Lumley wrote:
>
> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>
>
>
> Hi Sir,
>
> Thanks for making package available to us. I am facing few problems if 
> you can give some hints:
>
> Problem-1:
> The model summary and residual deviance matched (in the mail below) 
> but I didn't understand why AIC is still different.
>
>
> AIC(m1)
>
> [1] 532965
>
>
> AIC(m1big_longer)
>
> [1] 101442.9
>
>
> That's because AIC.default uses the unnormalized loglikelihood and 
> AIC.biglm uses the deviance.  Only differences in AIC between models 
> are meaningful, not individual values.
>
>
>
> Problem-2:
> chunksize argument is there in bigglm but not in biglm, consequently, 
> udate.biglm is there, but not update.bigglm
> Is my observation correct? If yes, why is this difference?
>
>
> Because update.bigglm is impossible.
>
> Fitting a glm requires iteration, which means that it requires 
> multiple passes through the data. Fitting a linear model requires only 
> a single pass. update.biglm can take a fitted or partially fitted 
> biglm and add more data. To do the same thing for a bigglm you would 
> need to start over again from the beginning of the data set.
>
> To fit a glm, you need to specify a data source that bigglm() can 
> iterate over.  You do this with a function that can be called 
> repeatedly to return the next chunk of data.
>
>       -thomas
>
> Thomas Lumley            Assoc. Professor, Biostatistics
> tlum...@u.washington.edu <mailto:tlum...@u.washington.edu>    
> University of Washington, Seattle
>
>
>
>
> I don't know why the AIC is different, but remember that there are 
> multiple definitions for AIC (generally differing in the constant 
> added) and it may just be a difference in the constant, or it could be 
> that you have not fit the whole dataset (based on your other question).
>
> For an lm model biglm only needs to make a single pass through the 
> data.  This was the first function written for the package and the 
> update mechanism was an easy way to write the function (and still 
> works well).
>
> The bigglm function came later and the models other than Gaussian 
> require multiple passes through the data so instead of the update 
> mechanism that biglm uses, bigglm requires the data argument to be a 
> function that returns the next chunk of data and can restart to the 
> beginning of the dataset.
>
> Also note that the bigglm function usually only does a few passes 
> through the data, usually this is good enough, but in some cases you 
> may need to increase the number of passes.
>
> Hope this helps,
>
>  
>


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bigglm() results different from glm()+Another question

Reply via email to