Thank you all for your help and advice. This wasn't quite the answer I was looking for, but these concepts make more sense to me now and I think I should be able to resolve the issues I've been having.
Thanks again! On Sun, Sep 30, 2012 at 6:26 PM, David Winsemius <dwinsem...@comcast.net> wrote: > > On Sep 30, 2012, at 4:47 AM, Josh Browning wrote: > >> Hi David, >> >> Yes, I agree that the results are "very similar" but I don't >> understand why they are not exactly equal given that the data sets are >> identical. >> >> And yes, this 1% numerical difference is hugely important to me. > > I believe you will find it is not important at all. > >> I have another data set (much larger than this toy example) that works >> on the aggregated data (returning a coefficient of about 1) but >> returns the warning about perfect separation on the non-aggregated >> data (and a coefficient of about 1e15). So, I'd at least like to be >> able to understand where this numerical difference is coming from and, >> preferably, a way to tweak my glm() runs (possibly adjusting the >> numerical precision somehow???) so that this doesn't happen. > > Here. This shows how you can ratchet up your desired precision: > > dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length ) > colnames(dAgg) = c("RESP","INDEP","WT") > > glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = > glm.control(epsilon = 1e-10, > maxit = 50) ) > glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = > glm.control(epsilon = 1e-11, > maxit = 30) ) > glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = > glm.control(epsilon = 1e-12, > maxit = 50) ) > glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = > glm.control(epsilon = 1e-13, > maxit = 50) ) > > ... and finally get a message that _should_ demonstrate you why I thought > asking for greater numerical precision in an extreme case such as this was a > "Fool's errand". The "true" value is positive infinity, but numerical > precision is only 8 bytes so we could only get to an estimated odds ratio of > exp(32.975 ) = 2.09344e+14, which corresponds to odds of 200 trillion to 1. > Further increases on the 'maxit' and decreases in 'epsilon' are vigorously > ignored, .... and rightfully so. > > After you do such analyses long enough you learn that coefficients above 5 > are suspicious and those above 10 are usually a sign of an error in data > preparation. > > (I do not think this is an example of complete separation, since the range > of the predictors overlap for the two outcomes. But who know I have made > (many) errors before.) > > -- > David, > > >> >> Josh >> >> On Sat, Sep 29, 2012 at 7:50 PM, David Winsemius <dwinsem...@comcast.net> >> wrote: >>> >>> On Sep 29, 2012, at 7:10 AM, Josh Browning wrote: >>> >>>> Hi useRs, >>>> >>>> I'm experiencing something quite weird with glm() and weights, and >>>> maybe someone can explain what I'm doing wrong. I have a dataset >>>> where each row represents a single case, and I run >>>> glm(...,family="binomial") and get my coefficients. However, some of >>>> my cases have the exact same values for predictor variables, so I >>>> should be able to aggregate up my data frame and run glm(..., >>>> family="binomial",weights=wts) and get the same coefficients (maybe >>>> this is my incorrect assumption, but I can't see why it would be). >>>> Anyways, here's a minimum working example below: >>>> >>>>> d = data.frame( RESP=c(rep(1,5),rep(0,5)), INDEP=c(1,1,1,1,0,0,0,0,0,0) ) >>>>> glm( RESP ~ INDEP, family="binomial", data=d ) >>>> >>>> Call: glm(formula = RESP ~ INDEP, family = "binomial", data = d) >>>> >>>> Coefficients: >>>> (Intercept) INDEP >>>> -1.609 21.176 >>>> >>>> Degrees of Freedom: 9 Total (i.e. Null); 8 Residual >>>> Null Deviance: 13.86 >>>> Residual Deviance: 5.407 AIC: 9.407 >>>>> dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length ) >>>>> colnames(dAgg) = c("RESP","INDEP","WT") >>>>> glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT ) >>>> >>>> Call: glm(formula = RESP ~ INDEP, family = "binomial", data = dAgg, >>>> weights = WT) >>>> >>>> Coefficients: >>>> (Intercept) INDEP >>>> -1.609 20.975 >>>> >>>> Degrees of Freedom: 2 Total (i.e. Null); 1 Residual >>>> Null Deviance: 13.86 >>>> Residual Deviance: 5.407 AIC: 9.407 >>> >>> Those two results look very similar and it is with a data situation that >>> seems somewhat extreme. The concern is for the 1% numerical difference in >>> the regression coefficient? Am I reading you correctly? >>> >>> -- >>> David Winsemius, MD >>> Alameda, CA, USA >>> > > David Winsemius, MD > Alameda, CA, USA > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.