I have a dataframe with 2 million rows and approximately 200 columns /
features. Approximately 30-40% of the entries are blank. I am trying to
find important features for a binary response variable. The predictors may
be categorical or continuous.

I started with applying logistic regression, but having so much missing
entries I feel that this is not a good approach as glm discard all records
which have any item blank. So I am now looking to apply tree based
algorithms (rpart or gbm) which are capable to handle missing data in a
better way.

Since my data is too big for rpart or gbm, I decided to randomly fetch
10,000 records from original data, apply rpart on that, and keep building a
pool of important variables. However, even this 10,000 records seem to be
too much for the rpart algorithm.

What can I do in this situation? Is there any switch that I can use to make
it fast? Or it is impossible to apply rpart on my data.

I am using the following rpart command:

varimp = rpart(fmla,  dat=tmpData, method = "class")$variable.importance

Thanks

-- 
-------------
Mary Kindall
Yorktown Heights, NY
USA

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to