Bert Gunter <gunter.berton <at> gene.com> writes: > > I haven't followed this thread closely, but if perfect separation in a > binomial glm is the problem, google it. e.g. > > http://www.ats.ucla.edu/stat/mult_pkg/faq/general > /complete_separation_logit_models.htm > > This presumably explains your concerns about coefficient agreement. >
Agreed. The rest of my answer is below. Josh Browning <rockclimber112358 <at> gmail.com> writes: > Yes, I agree that the results are "very similar" but I don't > understand why they are not exactly equal given that the data sets are > identical. > > And yes, this 1% numerical difference is hugely important to me. I > have another data set (much larger than this toy example) that works > on the aggregated data (returning a coefficient of about 1) but > returns the warning about perfect separation on the non-aggregated > data (and a coefficient of about 1e15). So, I'd at least like to be > able to understand where this numerical difference is coming from and, > preferably, a way to tweak my glm() runs (possibly adjusting the > numerical precision somehow???) so that this doesn't happen. > > Josh I played around with this a bit, and I think the problem is so numerically unstable that you really can't just tweak the settings on glm() to make it work. (When a problem is numerically unstable, nearly trivial differences like the order of operations or even the compiler used can make big differences in the results.) There's a very nice blog post about the numerics of GLM here: http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/ One of the conclusions is And most practitioners are unfamiliar with this situation [numerical instability of GLMs in some cases] because: * They rightly do not concern themselves with the implementation details, as these are best left to the software implementors. * They are very likely to encounter issues arise from separation, which will mask other issues. You appear to have a (near- or complete-) separation problem. I would strongly recommend the logistf package (when I tried it, I got near-identical results from the aggregated and disaggregated data). I would also argue that if a 1% difference in the estimate of a parameter whose confidence interval is essentially undefined (try MASS:::confint() on your results) is concerning you, then you have some bigger problems to wrestle with ... good luck Ben Bolker ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.