On 30-Jun-09 17:41:20, Marc Schwartz wrote: > On Jun 30, 2009, at 10:44 AM, Ted Harding wrote: > >> >> On 30-Jun-09 14:52:20, Marc Schwartz wrote: >>> On Jun 30, 2009, at 4:54 AM, Renzi Fabrizio wrote: >>> >>>> I would like to know how R computes the probability of an event >>>> in a logistic model (P(y=1)) from the score s, linear combination >>>> of x and beta. >>>> >>>> I noticed that there are differences (small, less than e-16) between >>>> the fitting values automatically computed in the glm procedure by R, >>>> and the values "manually" computed by me applying the reverse >>>> formula p=e^s/(1+e^s); moreover I noticed that the minimum value >>>> of the fitting values in my estimation is 2.220446e-16, and there >>>> are many observation with this probability (instead the minimum >>>> value obtained by "manually" estimation is 2.872636e-152). >>> >>> It would be helpful to see at least a subset of the output from your >>> model and your manual computations so that we can at least visually >>> compare the results to see where the differences may be. >>> >>> The model object returned from using glm() will contain both the >>> linear predictors on the link scale (model$linear.predictors) and >>> the fitted values (model$fitted.values). The latter can be accessed >>> using the fitted() extractor function. >>> >>> To use an example, let's create a simple LR model using the infert >>> data set as referenced in ?glm. >>> >>> model1 <- glm(case ~ spontaneous + induced, data = infert, >>> family = binomial()) >>> >>>> model1 >>> Call: glm(formula = case ~ spontaneous + induced, >> family = binomial(), data = infert) >>> >>> Coefficients: >>> (Intercept) spontaneous induced >>> -1.7079 1.1972 0.4181 >>> >>> Degrees of Freedom: 247 Total (i.e. Null); 245 Residual >>> Null Deviance: 316.2 >>> Residual Deviance: 279.6 AIC: 285.6 >>> >>> # Get the coefficients >>>> coef(model1) >>> (Intercept) spontaneous induced >>> -1.7078601 1.1972050 0.4181294 >>> >>> # get the linear predictor values >>> # log odds scale for binomial glm >>>> head(model1$linear.predictors) >>> 1 2 3 4 5 >>> 6 >>> 1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564 >>> 0.32560375 >>> >>> You can also get the above by using the coefficients and the model >>> matrix for comparison: >>> # the full set of 248 >>> # coef(model1) %*% t(model.matrix(model1)) >>>> head(as.vector(coef(model1) %*% t(model.matrix(model1)))) >>> [1] 1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564 >>> 0.32560375 >>> >>> # get fitted response values (predicted probs) >>>> head(fitted(model1)) >>> 1 2 3 4 5 6 >>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893 >>> >>> We can also get the fitted values from the linear predictor values >>> by using: >>> >>> LP <- model1$linear.predictors >>>> head(exp(LP) / (1 + exp(LP))) >>> 1 2 3 4 5 6 >>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893 >>> >>> You can also get the above by using the predict.glm() function with >>> type == "response". The default type of "link" will get you the >>> linear >>> predictor values as above. predict.glm() would typically be used to >>> generate predictions using the model on new data. >>> >>>> head(predict(model1, type = "response")) >>> 1 2 3 4 5 6 >>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893 >>> >>> In glm.fit(), which is the workhorse function in glm(), the fitted >>> values returned in the model object are actually computed by using >>> the inverse link function for the family passed to glm(): >>> >>>> binomial()$linkinv >>> function (eta) >>> .Call("logit_linkinv", eta, PACKAGE = "stats") >>> <environment: 0x171c8b10> >>> >>> Thus: >>>> head(binomial()$linkinv(model1$linear.predictors)) >>> 1 2 3 4 5 6 >>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893 >>> >>> So those are the same values as we saw above using the other methods. >>> So, all is consistent across the various methods. >>> >>> Perhaps the above provides some insights for you into how R does some >>> of these things and also to point out, as is frequently the case with >>> R, there is more than one way to get the same result. >>> >>> HTH, >>> Marc Schwartz >> >> An interesting and informative reply, Marc; but I think it misses >> the point of Fabrizio's query. I think Fabrizio's point is the >> following: >> >> set.seed(54321) >> X <- sort(rnorm(100)) >> a0 <- 1.0 ; b0 <- 0.5 >> Y <- 1*(runif(100) < a0 + b0*X) >> GLM <- glm(Y~X,family=binomial) >> C <- GLM$coef >> C >> # (Intercept) X >> # 2.726966 2.798429 >> a1 <- C[1] ; b1 <- C[2] >> Fit0 <- GLM$fit >> S <- a1 + b1*X >> Fit1 <- exp(S)/(1+exp(S)) >> max(abs(Fit1 - Fit0)) >> # [1] 1.110223e-16 >> >> This discrepancy is of course, in magnitude, a typical "rounding >> error". But the fact that it occurred indicates that when glm() >> computed the fitted values it did not do so by using the fitted >> coefficients GLM$coef, then creating the fitted score (linear >> predictor) S (as above), then applyhing to this the "inverse link" >> exp(S)/(1+exp(S)), since doing that would replicate the above >> calculation and should yield identical results. >> >> In fact, a bit more probing to GLM shows that there is already >> a discrepancy at the "score" level: >> >> S0 <- GLM$linear.predictors >> max(abs(S0-S)) >> # [1] 8.881784e-16 >> >> so S0 has not been calculated by applying GLM$coef to X. Also, if >> we apply the inverse link to S0, then: >> >> max(abs(Fit0 - exp(S0)/(1+exp(S0)))) >> # [1] 1.110223e-16 >> >> which is the same discrepancy as between Fit1 and Fit0. >> >> max(abs(Fit1 - exp(S0)/(1+exp(S0)))) >> # [1] 1.110223e-16 >> >> the same again!! >> >> Hence, if I have understood him aright, Fabrizio's question. >> >> Hoping this helps, >> Ted. > > > Ted, > > What OS and version of R are you using? > > I am on OSX 10.5.7 and using 32 bit: > > R version 2.9.1 Patched (2009-06-30 r48877) > > The reason that I ask is that using your data and model, I get: > > > max(abs(Fit1 - Fit0)) > [1] 0 > > > max(abs(S0-S)) > [1] 0 > > > max(abs(Fit0 - exp(S0)/(1+exp(S0)))) > [1] 0 > > > Truth be told, I was initially leaning in the direction of a rounding > error, but first wanted to be sure that there was not some underlying > methodological error that Fabrizio was encountering. Hence the focus > of my reply was to present a level of consistency in getting the > results using multiple methods. > > When I saw your post, I started thinking that my example, which did > not have a numeric (eg. non-integer) covariate, might have been too > simple and missed introducing an additional source of rounding error, > but then I could not replicate your findings either... > > Thanks, > Marc
I'm using Debian Linux: Linux deb 2.6.18-6-686 #1 SMP Tue May 5 00:40:20 UTC 2009 i686 GNU/Linux (32-bit as far as 5 know) and R version: version.string R version 2.9.0 (2009-04-17) After posting, I began to suspect that the compiled code which does the actual comnputations may be working to higher precision internally, but rounding to R's: $double.eps [1] 2.220446e-16 $double.neg.eps [1] 1.110223e-16 before attaching the results to the return-list for glm() (compare the numerical discrepancies which I found above). However, that it only surmise! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 30-Jun-09 Time: 18:54:19 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.