Re: [R] prediction error for test set-cross validation

Frank E Harrell Jr Wed, 11 Mar 2009 06:04:33 -0700

Uwe Ligges wrote:

Mehmet U Ayvaci wrote:
Hi,
I have a database of 2211 rows with 31 entries each and I manuallysplit mydata into 10 folds for cross validation. I build logistic regressionmodelas:
model <- glm(qual ~ AgGr + FaHx + PrHx + PrSr + PaLp + SvD + IndExam +
            Rad +BrDn + BRDS + PrinFin+ SkRtr + NpRtr + SkThck +TrThkc +
SkLes + AxAdnp + ArcDst + MaDen + CaDt + MaMG +
            MaMrp + MaSh + SCTub + SCFoc + MaSz,
family=binomial(link=logit));
Where the variables are taken from the trainSet of size 1989x31. Thetest
set is sized 222x31. Now my question is when I try to predict on the test
set it gives me the error:
predict.glm(model, testSet, type ="response")
"Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
  subscript out of bounds"
It does fine on trainSet. so it is something about the testSet. On theother
hand, I realized that some independent variables say "MaSz" takes 3
different values in the trainset vs. 4 different ones in the testSet.I am
not sure if this is the cause.If so, what would be the remedy?
Since I can retrieve the coefficients of the logistic regression, I could
manually calculate response for each entry in the testSet. This couldsolvemy problem although burdensome. But, I don't know an easy way of doingit as
my logistic regression have 80+ coefficients.
Well, if "MaSz takes 3 different values in the trainset vs. 4 differentones in the testSet", then you won't even be able to calculate it byhand, because you got no coefficients for the 4th level of that factor.Either you need the data to estimate coefficients from or you cannotpredict.
Uwe Ligges

And note that your test sample is far too small to yield reliableresults. You need to use resampling (e.g., bootstrap or 50-fold repeatsof 10-fold cross-validation). See the validate function in the Designpackage. Note that validate does not implement the proportionclassified correctly because this is an improper scoring rule withminimum information/lowest precision/lowest power.


Frank Harrell

Could somebody advise?

Thanks,
M


    [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help

PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help

PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] prediction error for test set-cross validation

Reply via email to