The AIC of the biglm models is highly dependent on the size of chunks 
selected (example provided below). This I can somehow expect because the 
model error will increase with the number of chunks.

It will be helpful if you can provide your opinion for comparing 
different models in such cases:

    * can I compare two models fitted with different chunksizes, or
      should I always use the same chunk size.

    * although I am not going to use AIC at all in my model selection,
      but I think any other model parameters will also vary in the same
      way. Am I right?
    * what would be the ideal chunksize? should it be the maximum
      possible size R and my system's RAM is able to handle?

Any comments will be helpful.


*Example of AIC variation with chunksize:*

I ran the following code on my data which has 10000 observations and 3 
independent variables

 > chunksize = 500
 > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
 > for(i in seq(chunksize,10000,chunksize)) fit=update(fit, 
data=xx[(i+1):(i+chunksize),])
 > AIC(fit)
[1] 30647.79

Here are the AIC for other chunksizes:
chunksize    AIC
500          30647.79
1000        29647.79
2000        27647.79
2500        26647.79
5000        21647.79
10000      11647.79


Regards
Utkarsh




utkarshsinghal wrote:
> Thank you Mr. Lumley and Mr. Greg. That was helpful.
>
> Regards
> Utkarsh
>
>
>
> Thomas Lumley wrote:
>> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>>
>>>
>>> Hi Sir,
>>>
>>> Thanks for making package available to us. I am facing few problems 
>>> if you can give some hints:
>>>
>>> Problem-1:
>>> The model summary and residual deviance matched (in the mail below) 
>>> but I didn't understand why AIC is still different.
>>>
>>>> AIC(m1)
>>> [1] 532965
>>>
>>>> AIC(m1big_longer)
>>> [1] 101442.9
>>
>> That's because AIC.default uses the unnormalized loglikelihood and 
>> AIC.biglm uses the deviance.  Only differences in AIC between models 
>> are meaningful, not individual values.
>>
>>>
>>> Problem-2:
>>> chunksize argument is there in bigglm but not in biglm, 
>>> consequently, udate.biglm is there, but not update.bigglm
>>> Is my observation correct? If yes, why is this difference?
>>>
>>
>> Because update.bigglm is impossible.
>>
>> Fitting a glm requires iteration, which means that it requires 
>> multiple passes through the data. Fitting a linear model requires 
>> only a single pass. update.biglm can take a fitted or partially 
>> fitted biglm and add more data. To do the same thing for a bigglm you 
>> would need to start over again from the beginning of the data set.
>>
>> To fit a glm, you need to specify a data source that bigglm() can 
>> iterate over.  You do this with a function that can be called 
>> repeatedly to return the next chunk of data.
>>
>>       -thomas
>>
>> Thomas Lumley            Assoc. Professor, Biostatistics
>> tlum...@u.washington.edu    University of Washington, Seattle
>>
>>
>>
>>
>
> I don't know why the AIC is different, but remember that there are 
> multiple definitions for AIC (generally differing in the constant 
> added) and it may just be a difference in the constant, or it could be 
> that you have not fit the whole dataset (based on your other question).
>
> For an lm model biglm only needs to make a single pass through the 
> data.  This was the first function written for the package and the 
> update mechanism was an easy way to write the function (and still 
> works well).
>
> The bigglm function came later and the models other than Gaussian 
> require multiple passes through the data so instead of the update 
> mechanism that biglm uses, bigglm requires the data argument to be a 
> function that returns the next chunk of data and can restart to the 
> beginning of the dataset.
>
> Also note that the bigglm function usually only does a few passes 
> through the data, usually this is good enough, but in some cases you 
> may need to increase the number of passes.
>
> Hope this helps,


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to