Andy, You're right, I didn't supply any code, because my call was very simple and it was the call itself at question. However, here is the associated code I am using:
naFixTime <- system.time( { if (fltrResponse) { ## TRUE: there are no NA's in the response... cleared via earlier steps message(paste(iAm,": Missing values will now be imputed...\n", sep="")) try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet), response)], dataSet[,response]) ) } else { ## In this case, there is no "response" column in the data set message(paste(iAm,": Missing values will now be filled in with median", " values or most frequent levels", sep="")) try( dataSet <- na.roughfix(dataSet) ) } } ) As you can see, the "na.roughfix" call is made as simply as possible: I supply the entire dataSet (only parameters, no responses). I am not doing the prediction here (that is done later, and the prediction itself is not taking very long). Here are some calculation times that I experienced: # rows # cols time to run na.roughfix ======= ======= ==================== 2046 2833 ~ 2 minutes 2066 5626 ~ 6 minutes 2134 14037 ~ 30 minutes These numbers are on a Windows server using the 64-bit version of 'R'. Regards, Mike "Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here." -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_l...@merck.com> wrote: > You have not shown any code on exactly how you use na.roughfix(), so I > can only guess. > > If you are doing something like: > > randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...) > > I would not be surprised that it's taking very long on large datasets. > Most likely it's caused by the formula interface, not na.roughfix() > itself. > > If that is your case, try doing the imputation beforehand and run > randomForest() afterward; e.g., > > myroughfixed <- na.roughfix(mybigdata) > randomForest(myroughfixed[list.of.predictor.columns], > myroughfixed[[myresponse]],...) > > HTH, > Andy > > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] > On Behalf Of Mike Williamson > Sent: Wednesday, June 30, 2010 7:53 PM > To: r-help > Subject: [R] anyone know why package "RandomForest" na.roughfix is so > slow?? > > Hi all, > > I am using the package "random forest" for random forest > predictions. I > like the package. However, I have fairly large data sets, and it can > often > take *hours* just to go through the "na.roughfix" call, which simply > goes > through and cleans up any NA values to either the median (numerical > data) or > the most frequent occurrence (factors). > I am going to start doing some comparisons between na.roughfix() and > some apply() functions which, it seems, are able to do the same job more > quickly. But I hesitate to duplicate a function that is already in the > package, since I presume the na.roughfix should be as quick as possible > and > it should also be well "tailored" to the requirements of random forest. > > Has anyone else seen that this is really slow? (I haven't noticed > rfImpute to be nearly as slow, but I cannot say for sure: my "predict" > data > sets are MUCH larger than my model data sets, so cleaning the prediction > data set simply takes much longer.) > If so, any ideas how to speed this up? > > Thanks! > Mike > > > > "Telescopes and bathyscaphes and sonar probes of Scottish lakes, > Tacoma Narrows bridge collapse explained with abstract phase-space maps, > Some x-ray slides, a music score, Minard's Napoleanic war: > The most exciting frontier is charting what's already here." > -- xkcd > > -- > Help protect Wikipedia. Donate now: > http://wikimediafoundation.org/wiki/Support_Wikipedia/en > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Notice: This e-mail message, together with any attach...{{dropped:16}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.