Re: [R] How to improve, at all, a simple GLM code

Abigail Clifton Thu, 29 Mar 2012 16:06:18 -0700

Hi Ben,

Thank you for all your help so far! I appreciate it.


I am wanting to find a good predictive model, yes. It's part of a project so if 
I have time after finding the model I may want to find some patterns but it's 
not a priority. I just want the model for now (I need the coefficients above 
all).

It's all categorical data, I categorised any continuous data before I started 
trying to fit the glm.

I was unsure of how to get the csv file to you,however, I have uploaded it and 
it should be available for download from here:
http://www.filedropper.com/prepareddata

If not, let me know and I can attach it.

Hopefully this explains a bit more of what I am aiming to do.

Thanks again,

AJC


On 29 Mar, 2012,at 10:19 PM, Ben Bolker <bbol...@gmail.com> wrote:

Abigail Clifton <abigailclifton <at> me.com> writes:

> I am trying to fit a logit model to some data in a CSV file in R.

It would be helpful to link back to your previous question:

http://thread.gmane.org/gmane.comp.lang.r.general/259353

> Here is my code:
>
> Prepared_Data = read.csv("Prepared_Data.csv", header=TRUE)
> Prepared_Data
> attach(Prepared_Data)
> lrfit<-glm(C3~A1*B2*D4*E5,family = binomial)
> anova(lrfit, test="Chisq")
> write.csv(anova(lrfit, test="Chisq"), file="CWModelA.csv")
> shell.exec("CWModelA.csv")

This is still not a reproducible example, although
it's a little closer. Did you read the "recommended reading"
in my previous answer???

> I am unsure as to how many methods there are of choosing a suitable model,

Lots, and it depends very much on why you are doing the analysis in
the first place. Are you (1) trying to find a good predictive model?
(2) Looking for interesting patterns in the data? (3) Trying to test
hypotheses about which predictors have a significant effect on the
outcome? (4) Partition the variance explained by different predictors?

> however, I was hoping to fit the
> full/saturated model and choose the significant terms only as
> my final model.

In general this is a poor choice for goal #1 above, not necessarily
bad for #2, absolutely terrible for #3, irrelevant for #4. I'm
guessing you are interested in the best predictive model, since you
mentioned something in your previous message about working out the
probability of default on loan applications. I would say your best
bet is to use penalized approaches (see the glmnet package, and
library("sos"); findFn("lasso")).

> My first question therefore: is there a better way to fit a model to
> some data? Is there a function or way of getting R to print the
> optimum model?

> My CSV file, when opened in excel, contains approximately 3500 rows
> x 27 columns. I can only seem to run 'anova()' on the saturated/full
> model including the first four columns/factors. If I take any more
> into consideration (e.g. if I did C3~A1*B2*D4*E5*F6*G7), R stops
> responding/I have to force quit. Why is this? How can I get around
> it as I need to include all 27 columns?

For continuous predictors, the number of parameters of the
saturated models grows as 2^n; 2^27 is >134 million, so you
probably don't want to do that. It's potentially even worse
for categorical predictors (prod(levels(f)), so e.g. 3^n > 7*10^12
for three-level predictors).

It's still not sufficiently clear why you're having a problem
because you haven't given enough information: in the example I
gave in my previous answer, I used 7 continuous variables for
128 parameters without too much difficulty, but if you had (say)
5 levels for each of 7 predictors then you would be trying
to estimate 78125 parameters ...

Bottom line, it may simply not be reasonable to fit the
saturated model. Hard-core machine learning approaches (and
*maybe* the penalized regression approaches) might be able
to handle a few thousand predictors for n=3500, but a model
with tens of thousands of parameters (or more) feels somewhat crazy.
(Someone else is welcome to tell me how this could be done.)

Ben Bolker

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to improve, at all, a simple GLM code

Reply via email to