Trust me, it is the same total data I am using, even the chunksizes are all equal. I also crosschecked by manually creating the chunks and updating as in example given on biglm help page. > ?biglm
Regards Utkarsh Greg Snow wrote: > > Are you sure that you are fitting all the models on the same total > data? A first glance looks like you may be including more data in > some of the chunk sizes, or be producing an error that update does not > know how to deal with. > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > greg.s...@imail.org > > 801.408.8111 > > > > *From:* utkarshsinghal [mailto:utkarsh.sing...@global-analytics.com] > *Sent:* Monday, July 06, 2009 8:58 AM > *To:* Thomas Lumley; Greg Snow > *Cc:* r help > *Subject:* Re: [R] bigglm() results different from glm()+Another question > > > > > The AIC of the biglm models is highly dependent on the size of chunks > selected (example provided below). This I can somehow expect because > the model error will increase with the number of chunks. > > It will be helpful if you can provide your opinion for comparing > different models in such cases: > > * can I compare two models fitted with different chunksizes, or > should I always use the same chunk size. > > * although I am not going to use AIC at all in my model selection, > but I think any other model parameters will also vary in the > same way. Am I right? > * what would be the ideal chunksize? should it be the maximum > possible size R and my system's RAM is able to handle? > > Any comments will be helpful. > > > *Example of AIC variation with chunksize:* > > I ran the following code on my data which has 10000 observations and 3 > independent variables > > > chunksize = 500 > > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,]) > > for(i in seq(chunksize,10000,chunksize)) fit=update(fit, > data=xx[(i+1):(i+chunksize),]) > > AIC(fit) > [1] 30647.79 > > Here are the AIC for other chunksizes: > chunksize AIC > 500 30647.79 > 1000 29647.79 > 2000 27647.79 > 2500 26647.79 > 5000 21647.79 > 10000 11647.79 > > > Regards > Utkarsh > > > > > utkarshsinghal wrote: > > Thank you Mr. Lumley and Mr. Greg. That was helpful. > > Regards > Utkarsh > > > > Thomas Lumley wrote: > > On Fri, 3 Jul 2009, utkarshsinghal wrote: > > > > Hi Sir, > > Thanks for making package available to us. I am facing few problems if > you can give some hints: > > Problem-1: > The model summary and residual deviance matched (in the mail below) > but I didn't understand why AIC is still different. > > > AIC(m1) > > [1] 532965 > > > AIC(m1big_longer) > > [1] 101442.9 > > > That's because AIC.default uses the unnormalized loglikelihood and > AIC.biglm uses the deviance. Only differences in AIC between models > are meaningful, not individual values. > > > > Problem-2: > chunksize argument is there in bigglm but not in biglm, consequently, > udate.biglm is there, but not update.bigglm > Is my observation correct? If yes, why is this difference? > > > Because update.bigglm is impossible. > > Fitting a glm requires iteration, which means that it requires > multiple passes through the data. Fitting a linear model requires only > a single pass. update.biglm can take a fitted or partially fitted > biglm and add more data. To do the same thing for a bigglm you would > need to start over again from the beginning of the data set. > > To fit a glm, you need to specify a data source that bigglm() can > iterate over. You do this with a function that can be called > repeatedly to return the next chunk of data. > > -thomas > > Thomas Lumley Assoc. Professor, Biostatistics > tlum...@u.washington.edu <mailto:tlum...@u.washington.edu> > University of Washington, Seattle > > > > > I don't know why the AIC is different, but remember that there are > multiple definitions for AIC (generally differing in the constant > added) and it may just be a difference in the constant, or it could be > that you have not fit the whole dataset (based on your other question). > > For an lm model biglm only needs to make a single pass through the > data. This was the first function written for the package and the > update mechanism was an easy way to write the function (and still > works well). > > The bigglm function came later and the models other than Gaussian > require multiple passes through the data so instead of the update > mechanism that biglm uses, bigglm requires the data argument to be a > function that returns the next chunk of data and can restart to the > beginning of the dataset. > > Also note that the bigglm function usually only does a few passes > through the data, usually this is good enough, but in some cases you > may need to increase the number of passes. > > Hope this helps, > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.