That was definitely helpful. "if you change your seq to: seq(chunksize, 10000-chunksize, chunksize) then you won't get the error messages"
Even previously, I was not getting any error messages or warnings at all, i.e., "update" removes the extra NA rows silently, for example: set.seed(1) xx = data.frame(x1=runif(10000,0,10), x2=runif(10000,0,10), x3=runif(10000,0,10)) xx$y = 3 + xx$x1 + 2*xx$x2 + 3*xx$x3 + rnorm(10000) fit1 = biglm(y~x1+x2+x3, data=xx[1:7000,]) fit2 = update(fit1, moredata=xx[7001:10000,]) fit3 = update(fit2, moredata=xx[7001:14000,]) Here fit2 and fit3 are exactly same. Also if the number of rows is not a multiple of chunksize then your sequence will not accommodate the last small chunk of data . Regards Utkarsh Greg Snow wrote: > > OK, it appears that the problem is the df.resid component of the biglm > object. Everything else is being updated by the update function > except the df.resid piece, so it is based solely on the initial fit > and the chunksize used there. The df.resid piece is then used in the > computation of the AIC and hence the differences that you see. There > could also be a difference in the p-values and confidence intervals, > but at those high of numbers, the differences are smaller than can be > seen at the level of rounding done. > > > > This appears to be a bug/overlooked piece to me, Thomas is cc'd on > this so he should be able to fix this. > > > > A work around in the meantime is to do something like: > > > fit$df.resid <- 10000-4 > > > > Then compute the AIC. > > > > Also as an aside, if you change your seq to: seq(chunksize, > 10000-chunksize, chunksize) then you won't get the error messages. > > > > Hope this helps, > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > greg.s...@imail.org > > 801.408.8111 > > > > *From:* utkarshsinghal [mailto:utkarsh.sing...@global-analytics.com] > *Sent:* Wednesday, July 08, 2009 2:24 AM > *To:* Greg Snow > *Cc:* Thomas Lumley; r help > *Subject:* Re: [R] bigglm() results different from glm()+Another question > > > > Hi Greg, > > Many thanks for your precious time. Here is a workable code: > > set.seed(1) > xx = data.frame(x1=runif(10000,0,10), x2=runif(10000,0,10), > x3=runif(10000,0,10)) > xx$y = 3 + xx$x1 + 2*xx$x2 + 3*xx$x3 + rnorm(10000) > > chunksize = 500 > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,]) > for(i in seq(chunksize,10000,chunksize)) fit=update(fit, > moredata=xx[(i+1):(i+chunksize),]) > AIC(fit) > [1] 28956.91 > > And the AIC for other chunksizes: > chunksize AIC > 500 28956.91 > 1000 27956.91 > 2000 25956.91 > 2500 24956.91 > 5000 19956.91 > 10000 9956.91 > > Also I noted that the estimated coefficients are not dependent on > chunksize and AIC is exactly a linear function of chunksize. So I > guess it is some problem with the calculation of AIC, may be in some > degree of freedom or adding some constant somewhere. > > And my comments below. > > > Regards > Utkarsh > > > Greg Snow wrote: > > How many rows does xx have? > > Let's look at your example for chunksize 10000, you initially fit > the first 10000 observations, then the seq results in just the > value 10000 which means that you do the update based on vaues > 10001 through 20000, if xx only has 10000 rows, then this should > give at least one error. If xx has 20000 or more rows, then only > chunksize 10000 will ever see the 20000^th value, the other > chunksizes will use less of the data. > > Understood your point and apologize that you had to spend time going > into the logic inside for loop. I definitely thought of that but my > actual problem was the variation in AICs (which I was sure about), so > to ignore this loop problem (temporarily), I deliberately chose the > chunksizes such that the number of rows is a multiple of chunksize. I > knew there is still one extra iteration happening and I checked that > it was not causing any problem, the "moredata" in the last iteration > will be all NA's and "update" does nothing in such a case. > > For example: > Let's say chunksize=5000, even though "xx" has only 10000 rows, "fit2" > and "fit3" below are exactly same. > > fit1 = biglm(y~x1+x2+x3, data=xx[1:5000,]) > fit2 = update(fit1, moredata=xx[5001:10000,]) > fit3 = update(fit2, moredata=xx[10001:15000,]) > AIC(fit1); AIC(fit2); AIC(fit3) > [1] 5018.282 > [1] 19956.91 > [1] 19956.91 > > (The AIC matches with the table above and no warnings at all) > > I checked all these things before sending my first mail and dropped > the idea of refining the for loop as this will save me a few lines of > code and also the loop looks good and easy to understand. Moreover it > is neither taking any extra run time nor producing any warnings or errors. > > > > > Also looking at the help for update.biglm, the 2^nd argument is > "moredata" not "data", so if the code below is the code that you > actually ran, then the new data chunks are going into the "..." > argument (and being ignored as that is there for future expansion and > does nothing yet) and the "moredata" argument is left empty, which > should also be giving an error. For the code below, the model is only > being fit to the initial chunk and never updated, so with different > chunk sizes, there is different amounts of data per model. You can > check this by doing summary(fit) and looking at the sample size in the > 2^nd line. > > My fault in writing the mail. In the actual code, I gave "update(fit, > xx[(i+1):(i+chunksize),])" ,i.e., I just passed the new chunk as the > 2nd argument without mentioning the argument name, which is correct, > but while writing the mail I added the argument name as "data" without > checking what it is. > > > > > It is easier for us to help you if you provide code that can be run by > copying and pasting (we don't have xx, so we can't just run the code > below, you could include a line to randomly generate an xx, or a link > to where a copy of xx can be downloaded from). It also helps if you > mention any errors or warnings that you receive in the process of > running your code. > > > > Hope this helps, > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > greg.s...@imail.org <mailto:greg.s...@imail.org> > > 801.408.8111 > > > > *From:* utkarshsinghal [mailto:utkarsh.sing...@global-analytics.com] > *Sent:* Tuesday, July 07, 2009 12:10 AM > *To:* Greg Snow > *Cc:* Thomas Lumley; r help > *Subject:* Re: [R] bigglm() results different from glm()+Another question > > > > Trust me, it is the same total data I am using, even the chunksizes > are all equal. I also crosschecked by manually creating the chunks and > updating as in example given on biglm help page. > > ?biglm > > > Regards > Utkarsh > > > > Greg Snow wrote: > > Are you sure that you are fitting all the models on the same total > data? A first glance looks like you may be including more data in > some of the chunk sizes, or be producing an error that update does not > know how to deal with. > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > greg.s...@imail.org <mailto:greg.s...@imail.org> > > 801.408.8111 > > > > *From:* utkarshsinghal [mailto:utkarsh.sing...@global-analytics.com] > *Sent:* Monday, July 06, 2009 8:58 AM > *To:* Thomas Lumley; Greg Snow > *Cc:* r help > *Subject:* Re: [R] bigglm() results different from glm()+Another question > > > > > The AIC of the biglm models is highly dependent on the size of chunks > selected (example provided below). This I can somehow expect because > the model error will increase with the number of chunks. > > It will be helpful if you can provide your opinion for comparing > different models in such cases: > > * can I compare two models fitted with different chunksizes, or > should I always use the same chunk size. > > * although I am not going to use AIC at all in my model selection, > but I think any other model parameters will also vary in the > same way. Am I right? > * what would be the ideal chunksize? should it be the maximum > possible size R and my system's RAM is able to handle? > > Any comments will be helpful. > > > *Example of AIC variation with chunksize:* > > I ran the following code on my data which has 10000 observations and 3 > independent variables > > > chunksize = 500 > > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,]) > > for(i in seq(chunksize,10000,chunksize)) fit=update(fit, > data=xx[(i+1):(i+chunksize),]) > > AIC(fit) > [1] 30647.79 > > Here are the AIC for other chunksizes: > chunksize AIC > 500 30647.79 > 1000 29647.79 > 2000 27647.79 > 2500 26647.79 > 5000 21647.79 > 10000 11647.79 > > > Regards > Utkarsh > > > > > utkarshsinghal wrote: > > Thank you Mr. Lumley and Mr. Greg. That was helpful. > > Regards > Utkarsh > > > > Thomas Lumley wrote: > > > > On Fri, 3 Jul 2009, utkarshsinghal wrote: > > > > > > Hi Sir, > > Thanks for making package available to us. I am facing few problems if > you can give some hints: > > Problem-1: > The model summary and residual deviance matched (in the mail below) > but I didn't understand why AIC is still different. > > > > > AIC(m1) > > [1] 532965 > > > > > AIC(m1big_longer) > > [1] 101442.9 > > > That's because AIC.default uses the unnormalized loglikelihood and > AIC.biglm uses the deviance. Only differences in AIC between models > are meaningful, not individual values. > > > > > > Problem-2: > chunksize argument is there in bigglm but not in biglm, consequently, > udate.biglm is there, but not update.bigglm > Is my observation correct? If yes, why is this difference? > > > Because update.bigglm is impossible. > > Fitting a glm requires iteration, which means that it requires > multiple passes through the data. Fitting a linear model requires only > a single pass. update.biglm can take a fitted or partially fitted > biglm and add more data. To do the same thing for a bigglm you would > need to start over again from the beginning of the data set. > > To fit a glm, you need to specify a data source that bigglm() can > iterate over. You do this with a function that can be called > repeatedly to return the next chunk of data. > > -thomas > > Thomas Lumley Assoc. Professor, Biostatistics > tlum...@u.washington.edu <mailto:tlum...@u.washington.edu> > University of Washington, Seattle > > > > > > > I don't know why the AIC is different, but remember that there are > multiple definitions for AIC (generally differing in the constant > added) and it may just be a difference in the constant, or it could be > that you have not fit the whole dataset (based on your other question). > > For an lm model biglm only needs to make a single pass through the > data. This was the first function written for the package and the > update mechanism was an easy way to write the function (and still > works well). > > The bigglm function came later and the models other than Gaussian > require multiple passes through the data so instead of the update > mechanism that biglm uses, bigglm requires the data argument to be a > function that returns the next chunk of data and can restart to the > beginning of the dataset. > > Also note that the bigglm function usually only does a few passes > through the data, usually this is good enough, but in some cases you > may need to increase the number of passes. > > Hope this helps, > > > > > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.