Hi Ben, Thank you for all your help so far! I appreciate it.
I am wanting to find a good predictive model, yes. It's part of a project so if I have time after finding the model I may want to find some patterns but it's not a priority. I just want the model for now (I need the coefficients above all). It's all categorical data, I categorised any continuous data before I started trying to fit the glm. I was unsure of how to get the csv file to you,however, I have uploaded it and it should be available for download from here: http://www.filedropper.com/prepareddata If not, let me know and I can attach it. Hopefully this explains a bit more of what I am aiming to do. Thanks again, AJC On 29 Mar, 2012,at 10:19 PM, Ben Bolker <bbol...@gmail.com> wrote:
Abigail Clifton <abigailclifton <at> me.com> writes: > I am trying to fit a logit model to some data in a CSV file in R. It would be helpful to link back to your previous question: http://thread.gmane.org/gmane.comp.lang.r.general/259353 > Here is my code: > > Prepared_Data = read.csv("Prepared_Data.csv", header=TRUE) > Prepared_Data > attach(Prepared_Data) > lrfit<-glm(C3~A1*B2*D4*E5,family = binomial) > anova(lrfit, test="Chisq") > write.csv(anova(lrfit, test="Chisq"), file="CWModelA.csv") > shell.exec("CWModelA.csv") This is still not a reproducible example, although it's a little closer. Did you read the "recommended reading" in my previous answer??? > I am unsure as to how many methods there are of choosing a suitable model, Lots, and it depends very much on why you are doing the analysis in the first place. Are you (1) trying to find a good predictive model? (2) Looking for interesting patterns in the data? (3) Trying to test hypotheses about which predictors have a significant effect on the outcome? (4) Partition the variance explained by different predictors? > however, I was hoping to fit the > full/saturated model and choose the significant terms only as > my final model. In general this is a poor choice for goal #1 above, not necessarily bad for #2, absolutely terrible for #3, irrelevant for #4. I'm guessing you are interested in the best predictive model, since you mentioned something in your previous message about working out the probability of default on loan applications. I would say your best bet is to use penalized approaches (see the glmnet package, and library("sos"); findFn("lasso")). > My first question therefore: is there a better way to fit a model to > some data? Is there a function or way of getting R to print the > optimum model? > My CSV file, when opened in excel, contains approximately 3500 rows > x 27 columns. I can only seem to run 'anova()' on the saturated/full > model including the first four columns/factors. If I take any more > into consideration (e.g. if I did C3~A1*B2*D4*E5*F6*G7), R stops > responding/I have to force quit. Why is this? How can I get around > it as I need to include all 27 columns? For continuous predictors, the number of parameters of the saturated models grows as 2^n; 2^27 is >134 million, so you probably don't want to do that. It's potentially even worse for categorical predictors (prod(levels(f)), so e.g. 3^n > 7*10^12 for three-level predictors). It's still not sufficiently clear why you're having a problem because you haven't given enough information: in the example I gave in my previous answer, I used 7 continuous variables for 128 parameters without too much difficulty, but if you had (say) 5 levels for each of 7 predictors then you would be trying to estimate 78125 parameters ... Bottom line, it may simply not be reasonable to fit the saturated model. Hard-core machine learning approaches (and *maybe* the penalized regression approaches) might be able to handle a few thousand predictors for n=3500, but a model with tens of thousands of parameters (or more) feels somewhat crazy. (Someone else is welcome to tell me how this could be done.) Ben Bolker ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.