Re: [R] bigglm() results different from glm()+Another question

utkarshsinghal Wed, 08 Jul 2009 23:19:59 -0700

That was definitely helpful.

"if you change your seq to: seq(chunksize, 10000-chunksize, chunksize) 
then you won't get the error messages"


Even previously, I was not getting any error messages or warnings at 
all, i.e., "update" removes the extra NA rows silently, for example:

set.seed(1)
xx = data.frame(x1=runif(10000,0,10), x2=runif(10000,0,10), 
x3=runif(10000,0,10))
xx$y = 3 + xx$x1 + 2*xx$x2 + 3*xx$x3 + rnorm(10000)
fit1 = biglm(y~x1+x2+x3, data=xx[1:7000,])
fit2 = update(fit1, moredata=xx[7001:10000,])
fit3 = update(fit2, moredata=xx[7001:14000,])

Here fit2 and fit3 are exactly same.
Also if the number of rows is not a multiple of chunksize then your 
sequence will not accommodate the last small chunk of data .

Regards
Utkarsh


Greg Snow wrote:
>
> OK, it appears that the problem is the df.resid component of the biglm 
> object.  Everything else is being updated by the update function 
> except the df.resid piece, so it is based solely on the initial fit 
> and the chunksize used there.  The df.resid piece is then used in the 
> computation of the AIC and hence the differences that you see.  There 
> could also be a difference in the p-values and confidence intervals, 
> but at those high of numbers, the differences are smaller than can be 
> seen at the level of rounding done.
>
>  
>
> This appears to be a bug/overlooked piece to me, Thomas is cc'd on 
> this so he should be able to fix this.
>
>  
>
> A work around in the meantime is to do something like:
>
> > fit$df.resid <- 10000-4
>
>  
>
> Then compute the AIC.
>
>  
>
> Also as an aside, if you change your seq to: seq(chunksize, 
> 10000-chunksize, chunksize) then you won't get the error messages.
>
>  
>
> Hope this helps,
>
>  
>
> -- 
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> greg.s...@imail.org
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:utkarsh.sing...@global-analytics.com]
> *Sent:* Wednesday, July 08, 2009 2:24 AM
> *To:* Greg Snow
> *Cc:* Thomas Lumley; r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
> Hi Greg,
>
> Many thanks for your precious time. Here is a workable code:
>
> set.seed(1)
> xx = data.frame(x1=runif(10000,0,10), x2=runif(10000,0,10), 
> x3=runif(10000,0,10))
> xx$y = 3 + xx$x1 + 2*xx$x2 + 3*xx$x3 + rnorm(10000)
>
> chunksize = 500
> fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> for(i in seq(chunksize,10000,chunksize)) fit=update(fit, 
> moredata=xx[(i+1):(i+chunksize),])
> AIC(fit)
> [1] 28956.91
>
> And the AIC for other chunksizes:
> chunksize    AIC
> 500          28956.91
> 1000        27956.91
> 2000        25956.91
> 2500        24956.91
> 5000        19956.91
> 10000        9956.91
>
> Also I noted that the estimated coefficients are not dependent on 
> chunksize and AIC is exactly a linear function of chunksize. So I 
> guess it is some problem with the calculation of AIC, may be in some 
> degree of freedom or adding some constant somewhere.
>
> And my comments below.
>
>
> Regards
> Utkarsh
>
>
> Greg Snow wrote:
>
> How many rows does xx have?
>
>     Let's look at your example for chunksize 10000, you initially fit
>     the first 10000 observations, then the seq results in just the
>     value 10000 which means that you do the update based on vaues
>     10001 through 20000, if xx only has 10000 rows, then this should
>     give at least one error.  If xx has 20000 or more rows, then only
>     chunksize 10000 will ever see the 20000^th value, the other
>     chunksizes will use less of the data.
>
> Understood your point and apologize that you had to spend time going 
> into the logic inside for loop. I definitely thought of that but my 
> actual problem was the variation in AICs (which I was sure about), so 
> to ignore this loop problem (temporarily), I deliberately chose the 
> chunksizes such that the number of rows is a multiple of chunksize. I 
> knew there is still one extra iteration happening and I checked that 
> it was not causing any problem, the "moredata" in the last iteration 
> will be all NA's and "update" does nothing in such a case.
>
> For example:
> Let's say chunksize=5000, even though "xx" has only 10000 rows, "fit2" 
> and "fit3" below are exactly same.
>
> fit1 = biglm(y~x1+x2+x3, data=xx[1:5000,])
> fit2 = update(fit1, moredata=xx[5001:10000,])
> fit3 = update(fit2, moredata=xx[10001:15000,])
> AIC(fit1); AIC(fit2); AIC(fit3)
> [1] 5018.282
> [1] 19956.91
> [1] 19956.91
>
> (The AIC matches with the table above and no warnings at all)
>
> I checked all these things before sending my first mail and dropped 
> the idea of refining the for loop as this will save me a few lines of 
> code and also the loop looks good and easy to understand. Moreover it 
> is neither taking any extra run time nor producing any warnings or errors.
>
>
>  
>
> Also looking at the help for update.biglm, the 2^nd argument is 
> "moredata" not "data", so if the code below is the code that you 
> actually ran, then the new data chunks are going into the "..." 
> argument (and being ignored as that is there for future expansion and 
> does nothing yet) and the "moredata" argument is left empty, which 
> should also be giving an error.  For the code below, the model is only 
> being fit to the initial chunk and never updated, so with different 
> chunk sizes, there is different amounts of data per model.  You can 
> check this by doing summary(fit) and looking at the sample size in the 
> 2^nd line.
>
> My fault in writing the mail. In the actual code, I gave "update(fit, 
> xx[(i+1):(i+chunksize),])" ,i.e., I just passed the new chunk as the 
> 2nd argument without mentioning the argument name, which is correct, 
> but while writing the mail I added the argument name as "data" without 
> checking what it is.
>
>
>  
>
> It is easier for us to help you if you provide code that can be run by 
> copying and pasting (we don't have xx, so we can't just run the code 
> below, you could include a line to randomly generate an xx, or a link 
> to where a copy of xx can be downloaded from).  It also helps if you 
> mention any errors or warnings that you receive in the process of 
> running your code.
>
>  
>
> Hope this helps,
>
>  
>
> -- 
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> greg.s...@imail.org <mailto:greg.s...@imail.org>
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:utkarsh.sing...@global-analytics.com]
> *Sent:* Tuesday, July 07, 2009 12:10 AM
> *To:* Greg Snow
> *Cc:* Thomas Lumley; r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
> Trust me, it is the same total data I am using, even the chunksizes 
> are all equal. I also crosschecked by manually creating the chunks and 
> updating as in example given on biglm help page.
> > ?biglm
>
>
> Regards
> Utkarsh
>
>
>
> Greg Snow wrote:
>
> Are you sure that you are fitting all the models on the same total 
> data?  A first glance looks like you may be including more data in 
> some of the chunk sizes, or be producing an error that update does not 
> know how to deal with.
>
>  
>
> -- 
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> greg.s...@imail.org <mailto:greg.s...@imail.org>
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:utkarsh.sing...@global-analytics.com]
> *Sent:* Monday, July 06, 2009 8:58 AM
> *To:* Thomas Lumley; Greg Snow
> *Cc:* r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
>
> The AIC of the biglm models is highly dependent on the size of chunks 
> selected (example provided below). This I can somehow expect because 
> the model error will increase with the number of chunks.
>
> It will be helpful if you can provide your opinion for comparing 
> different models in such cases:
>
>     * can I compare two models fitted with different chunksizes, or
>       should I always use the same chunk size.
>
>     * although I am not going to use AIC at all in my model selection,
>       but I think any other model parameters will also vary in the
>       same way. Am I right?
>     * what would be the ideal chunksize? should it be the maximum
>       possible size R and my system's RAM is able to handle?
>
> Any comments will be helpful.
>
>
> *Example of AIC variation with chunksize:*
>
> I ran the following code on my data which has 10000 observations and 3 
> independent variables
>
> > chunksize = 500
> > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> > for(i in seq(chunksize,10000,chunksize)) fit=update(fit, 
> data=xx[(i+1):(i+chunksize),])
> > AIC(fit)
> [1] 30647.79
>
> Here are the AIC for other chunksizes:
> chunksize    AIC
> 500          30647.79
> 1000        29647.79
> 2000        27647.79
> 2500        26647.79
> 5000        21647.79
> 10000      11647.79
>
>
> Regards
> Utkarsh
>
>
>
>
> utkarshsinghal wrote:
>
> Thank you Mr. Lumley and Mr. Greg. That was helpful.
>
> Regards
> Utkarsh
>
>
>
> Thomas Lumley wrote:
>
>
>
> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>
>
>
>
>
> Hi Sir,
>
> Thanks for making package available to us. I am facing few problems if 
> you can give some hints:
>
> Problem-1:
> The model summary and residual deviance matched (in the mail below) 
> but I didn't understand why AIC is still different.
>
>
>
>
> AIC(m1)
>
> [1] 532965
>
>
>
>
> AIC(m1big_longer)
>
> [1] 101442.9
>
>
> That's because AIC.default uses the unnormalized loglikelihood and 
> AIC.biglm uses the deviance.  Only differences in AIC between models 
> are meaningful, not individual values.
>
>
>
>
>
> Problem-2:
> chunksize argument is there in bigglm but not in biglm, consequently, 
> udate.biglm is there, but not update.bigglm
> Is my observation correct? If yes, why is this difference?
>
>
> Because update.bigglm is impossible.
>
> Fitting a glm requires iteration, which means that it requires 
> multiple passes through the data. Fitting a linear model requires only 
> a single pass. update.biglm can take a fitted or partially fitted 
> biglm and add more data. To do the same thing for a bigglm you would 
> need to start over again from the beginning of the data set.
>
> To fit a glm, you need to specify a data source that bigglm() can 
> iterate over.  You do this with a function that can be called 
> repeatedly to return the next chunk of data.
>
>       -thomas
>
> Thomas Lumley            Assoc. Professor, Biostatistics
> tlum...@u.washington.edu <mailto:tlum...@u.washington.edu>    
> University of Washington, Seattle
>
>
>
>
>
>
> I don't know why the AIC is different, but remember that there are 
> multiple definitions for AIC (generally differing in the constant 
> added) and it may just be a difference in the constant, or it could be 
> that you have not fit the whole dataset (based on your other question).
>
> For an lm model biglm only needs to make a single pass through the 
> data.  This was the first function written for the package and the 
> update mechanism was an easy way to write the function (and still 
> works well).
>
> The bigglm function came later and the models other than Gaussian 
> require multiple passes through the data so instead of the update 
> mechanism that biglm uses, bigglm requires the data argument to be a 
> function that returns the next chunk of data and can restart to the 
> beginning of the dataset.
>
> Also note that the bigglm function usually only does a few passes 
> through the data, usually this is good enough, but in some cases you 
> may need to increase the number of passes.
>
> Hope this helps,
>
>  
>
>  
>
>  
>


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bigglm() results different from glm()+Another question

Reply via email to