I have a dataframe with 2 million rows and approximately 200 columns / features. Approximately 30-40% of the entries are blank. I am trying to find important features for a binary response variable. The predictors may be categorical or continuous.
I started with applying logistic regression, but having so much missing entries I feel that this is not a good approach as glm discard all records which have any item blank. So I am now looking to apply tree based algorithms (rpart or gbm) which are capable to handle missing data in a better way. Since my data is too big for rpart or gbm, I decided to randomly fetch 10,000 records from original data, apply rpart on that, and keep building a pool of important variables. However, even this 10,000 records seem to be too much for the rpart algorithm. What can I do in this situation? Is there any switch that I can use to make it fast? Or it is impossible to apply rpart on my data. I am using the following rpart command: varimp = rpart(fmla, dat=tmpData, method = "class")$variable.importance Thanks -- ------------- Mary Kindall Yorktown Heights, NY USA [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.