Re: [R] Extreme AIC in glm(), perfect separation, svm() tuning

Maggie Wang Thu, 26 Mar 2009 01:01:02 -0700

Dear List,
With regard to the question I previously raised, here is the result I
obtained right now, brglm() does help, but there are two situations:


1) Classifiers with extremely high AIC (over 200),  no perfect separation,
coefficients converge. in this case, using brglm() does help!  It stabilize
the AIC, and the classification power is better.

Code and output:  (need to install package: brglm)

matrix <- read.table("http://ihome.ust.hk/~haitian/sample.txt";)
names(matrix)<- c("g0","g761","g2809","g3106","g4373","g4583")
fo <- as.formula(g0 ~ g761 * g2809 * g3106 * g4373 * g4583)
library(MASS)
library(brglm)

lr <- brglm(formula= fo, family=binomial(link=logit), data=matrix)
summary(lr)

Coefficients:
                              Estimate Std. Error z value Pr(>|z|)
(Intercept)                     1.2829     0.8281   1.549   0.1214
g761                            4.0619     5.2519   0.773   0.4393
g2809                          -2.2775     4.7237  -0.482   0.6297
g3106                          -2.4431     3.8504  -0.635   0.5258
g4373                           1.2095     2.7312   0.443   0.6579
g4583                           1.0475     6.3020   0.166   0.8680
g761:g2809                    -11.8279    22.0052  -0.538   0.5909
g761:g3106                    -57.7909    35.6418  -1.621   0.1049
...... (omitted)......
g761:g2809:g3106:g4373:g4583 -864.0858  2879.2579  -0.300   0.7641
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 78.708  on 86  degrees of freedom
Residual deviance: 56.600  on 55  degrees of freedom
Penalized deviance: 261.7148
AIC:  120.6


2) Classifiers with perfect separation with a too small AIC(around 50),
co-efficients does Not converge.  The prediction error of itself by glm() is
0!  brglm() is no better than glm() in this case.  Code and output of glm():


matrix2 <- read.table("http://ihome.ust.hk/~haitian/sample2.txt";)
names(matrix2)<- c("g0","g28","g1334","g1871","g3639","g4295")
library(MASS)
fo2 <- as.formula(g0 ~ g28 * g1334 * g1871 * g3639 * g4295)
lr2 <- glm(fo2, family=binomial(link=logit), data=matrix2)
summary(lr2)

Deviance Residuals:
       Min          1Q      Median          3Q         Max
-4.527e-05  -2.107e-08  -2.107e-08   2.107e-08   5.802e-05

Coefficients:
                              Estimate Std. Error   z value Pr(>|z|)
(Intercept)                  6.028e+01  1.006e+07  5.99e-06        1
g28                          4.569e+04  8.566e+07     0.001        1
g1334                        1.733e+04  3.568e+07  4.86e-04        1
g1871                        2.917e+02  7.194e+06  4.05e-05        1
g3639                        1.936e+02  1.159e+08  1.67e-06        1
g4295                       -3.642e+02  8.580e+06 -4.24e-05        1
g28:g1334                    2.643e+05  3.732e+08     0.001        1
....(omitted) ....
g28:g1334:g1871:g3639:g4295 -1.084e+06  2.209e+09 -4.91e-04        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.2032e+02  on 86  degrees of freedom
Residual deviance: 1.8272e-08  on 55  degrees of freedom
AIC: 64

Number of Fisher Scoring iterations: 25

Now another question arise, if a perfect separation plane exists for (2),
then Support Vector Machine should do a perfect job. But right now it's not
the case: when I tried to use svm(), the error is over 0.3.   Do you know if
this is a tuning problem?
Does svm() automatically consider all the inter-action terms? ( i actually
tried to input interaction terms manually, and the result is still the
same)

(..cont', need to install package: e1071)
library(e1071)
library(rpart)
train.x<- matrix2[,-1]
train.y<- matrix2[,1]
svm.tune <- tune(svm, train.x, train.y, validation.x= train.x, validation.y=
train.y,
 ranges = list(gamma = 2^(-10:5), cost = 2^(-10:4)))
Cost<- svm.tune$best.parameters$cost
Gamma<- svm.tune$best.parameters$gamma

svm.model <-svm(x=train.x, y=train.y, kernel = "polynomial", cost= Cost,
Gamma=Gamma,na.action=na.fail, probability =TRUE)
svm.pre <- predict(svm.model, train.x, probability=TRUE)

vote <- ifelse(svm.pre > 0.5, 1, 0)
err.indicator <- ifelse(vote == train.y, 0, 1)
error <- sum(err.indicator)/length(train.y)
error

I'm really sorry for such a long mail!  And for my limited knowledge too!
 Would you please advise if there is any better way of tuning svm()? or what
should i do to obtain a reasonable co-efficients for case (2)? Thank you so
much!!

Best Regards,
Maggie


-----------------------------------
Haitian Wang
PhD Student in Statistics
ISOM Department, HKUST, Hong Kong





On Fri, Mar 20, 2009 at 4:44 PM, Gavin Simpson <gavin.simp...@ucl.ac.uk>wrote:

> On Fri, 2009-03-20 at 12:39 +1100, Gad Abraham wrote:
> > Maggie Wang wrote:
> > > Hi, Dieter, Gad, and all,
> > >
> > > Thank you very much for your reply!
> > >
> > > So here is my data,  you can copy it into a file names "sample.txt"
> >
> > Hi Maggie,
> >
> > With this data (allowing for more iterations) I get:
> >
> > > lr <- glm(fo, family=binomial(link=logit), data=matrix,
> > control=glm.control(maxit=100))
> > Warning message:
> > In glm.fit(x = X, y = Y, weights = weights, start = start, etastart =
> > etastart,  :
> >    fitted probabilities numerically 0 or 1 occurred
> >
> > which indicates, as Thomas has said, perfect separation, which occurs
> > because you're trying to fit too many variables with not enough data.
>
> It is worth mentioning that, in and of itself, that warning does not
> necessarily indicate a separation issue - something I was unsure about
> recently. You can get that warning (and I did for several data sets in a
> recent problem I enquired on the list about) where the fitted values
> really do become numerically 0 or 1 without separation.
>
> For example, see this response to my original question on the list:
>
> http://article.gmane.org/gmane.comp.lang.r.general/134472/
>
> There Ioannis Kosmidis presents a number of ways to investigate the
> results of a logit model fit for such issues.
>
> G
>
> >
> > Cheers,
> > Gad
> >
> --
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>  Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
>  ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
>  Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
>  Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
>  UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Extreme AIC in glm(), perfect separation, svm() tuning

Reply via email to