Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

Mike Williamson Thu, 01 Jul 2010 09:48:36 -0700

Andy,

    You're right, I didn't supply any code, because my call was very simple
and it was the call itself at question.  However, here is the associated
code I am using:



        naFixTime <- system.time( {
            if (fltrResponse) {  ## TRUE: there are no NA's in the
response... cleared via earlier steps
                message(paste(iAm,": Missing values will now be
imputed...\n", sep=""))
        try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet),
response)],
                                         dataSet[,response]) )
            } else {  ## In this case, there is no "response" column in the
data set
                message(paste(iAm,": Missing values will now be filled in
with median",
                              " values or most frequent levels", sep=""))
                try( dataSet <- na.roughfix(dataSet) )
            }
        } )



    As you can see, the "na.roughfix" call is made as simply as possible:  I
supply the entire dataSet (only parameters, no responses).  I am not doing
the prediction here (that is done later, and the prediction itself is not
taking very long).
    Here are some calculation times that I experienced:

# rows       # cols       time to run na.roughfix
=======     =======     ====================
  2046          2833             ~ 2 minutes
  2066          5626             ~ 6 minutes
  2134         14037             ~ 30 minutes

    These numbers are on a Windows server using the 64-bit version of 'R'.

                                          Regards,
                                                   Mike


"Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here."
 -- xkcd

--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en


On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_l...@merck.com> wrote:

> You have not shown any code on exactly how you use na.roughfix(), so I
> can only guess.
>
> If you are doing something like:
>
>  randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)
>
> I would not be surprised that it's taking very long on large datasets.
> Most likely it's caused by the formula interface, not na.roughfix()
> itself.
>
> If that is your case, try doing the imputation beforehand and run
> randomForest() afterward; e.g.,
>
> myroughfixed <- na.roughfix(mybigdata)
> randomForest(myroughfixed[list.of.predictor.columns],
> myroughfixed[[myresponse]],...)
>
> HTH,
> Andy
>
> -----Original Message-----
> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
> On Behalf Of Mike Williamson
> Sent: Wednesday, June 30, 2010 7:53 PM
> To: r-help
> Subject: [R] anyone know why package "RandomForest" na.roughfix is so
> slow??
>
> Hi all,
>
>    I am using the package "random forest" for random forest
> predictions.  I
> like the package.  However, I have fairly large data sets, and it can
> often
> take *hours* just to go through the "na.roughfix" call, which simply
> goes
> through and cleans up any NA values to either the median (numerical
> data) or
> the most frequent occurrence (factors).
>    I am going to start doing some comparisons between na.roughfix() and
> some apply() functions which, it seems, are able to do the same job more
> quickly.  But I hesitate to duplicate a function that is already in the
> package, since I presume the na.roughfix should be as quick as possible
> and
> it should also be well "tailored" to the requirements of random forest.
>
>    Has anyone else seen that this is really slow?  (I haven't noticed
> rfImpute to be nearly as slow, but I cannot say for sure:  my "predict"
> data
> sets are MUCH larger than my model data sets, so cleaning the prediction
> data set simply takes much longer.)
>    If so, any ideas how to speed this up?
>
>                              Thanks!
>                                   Mike
>
>
>
> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
> Some x-ray slides, a music score, Minard's Napoleanic war:
> The most exciting frontier is charting what's already here."
>  -- xkcd
>
> --
> Help protect Wikipedia. Donate now:
> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> Notice:  This e-mail message, together with any attach...{{dropped:16}}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

Reply via email to