Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

Mike Williamson Thu, 01 Jul 2010 17:42:10 -0700

Hadley,

    Thanks!  Yes... as.data.frame() is quite slow.  (And it forces the
column names to become "acceptable" names, which is a hassle to fix all the
time.)  I just hadn't thought of something as clever as what you wrote
below.


    I'll try out this suggestion.  :)

                              Mike

"Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here."
 -- xkcd

--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en


On Thu, Jul 1, 2010 at 5:07 PM, Hadley Wickham <had...@rice.edu> wrote:

> Here's another version that's a bit easier to read:
>
> na.roughfix2 <- function (object, ...) {
>  res <- lapply(object, roughfix)
>  structure(res, class = "data.frame", row.names = seq_len(nrow(object)))
> }
>
> roughfix <- function(x) {
>  missing <- is.na(x)
>  if (!any(missing)) return(x)
>
>  if (is.numeric(x)) {
>    x[missing] <- median.default(x[!missing])
>  } else if (is.factor(x)) {
>    freq <- table(x)
>    x[missing] <- names(freq)[which.max(freq)]
>  } else {
>    stop("na.roughfix only works for numeric or factor")
>  }
>  x
> }
>
> I'm cheating a bit because as.data.frame is so slow.
>
> Hadley
>
> On Thu, Jul 1, 2010 at 6:44 PM, Mike Williamson <this.is....@gmail.com>
> wrote:
> > Jim, Andy,
> >
> >    Thanks for your suggestions!
> >
> >    I found some time today to futz around with it, and I found a "home
> > made" script to fill in NA values to be much quicker.  For those who are
> > interested, instead of using:
> >
> >          dataSet <- na.roughfix(dataSet)
> >
> >
> >
> >    I used:
> >
> >                    origCols <- names(dataSet)
> >                    ## Fix numeric values...
> >                    dataSet <- as.data.frame(lapply(dataSet,
> FUN=function(x)
> > {
> >                        if(!is.numeric(x)) { x } else {
> >                            ifelse(is.na(x), median(x, na.rm=TRUE), x) }
> }
> > ),
> >                                             row.names=row.names(dataSet)
> )
> >                    ## Fix factors...
> >                    dataSet <- as.data.frame(lapply(dataSet,
> FUN=function(x)
> > {
> >                        if(!is.factor(x)) { x } else {
> >                            levels(x)[ifelse(!is.na
> > (x),x,table(max(table(x)))
> >                                                          ) ] } } ),
> >                                             row.names=row.names(dataSet)
> )
> >                    names(dataSet) <- origCols
> >
> >
> >
> >    In one case study that I ran, the na.roughfix() algo took 296 seconds
> > whereas the homemade one above took 16 seconds.
> >
> >                                      Regards,
> >                                            Mike
> >
> >
> >
> > "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
> > Tacoma Narrows bridge collapse explained with abstract phase-space maps,
> > Some x-ray slides, a music score, Minard's Napoleanic war:
> > The most exciting frontier is charting what's already here."
> >  -- xkcd
> >
> > --
> > Help protect Wikipedia. Donate now:
> > http://wikimediafoundation.org/wiki/Support_Wikipedia/en
> >
> >
> > On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy <andy_l...@merck.com> wrote:
> >
> >>  You need to isolate the problem further, or give more detail about your
> >> data.  This is what I get:
> >>
> >> R> nr <- 2134
> >> R> nc <- 14037
> >> R> x <- matrix(runif(nr*nc), nr, nc)
> >> R> n.na <- round(nr*nc/10)
> >> R> x[sample(nr*nc, n.na)] <- NA
> >> R> system.time(x.fixed <- na.roughfix(x))
> >>    user  system elapsed
> >>    8.44    0.39    8.85
> >> R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with
> 2GB
> >> ram.
> >>
> >> Andy
> >>
> >>  ------------------------------
> >> *From:* Mike Williamson [mailto:this.is....@gmail.com]
> >> *Sent:* Thursday, July 01, 2010 12:48 PM
> >> *To:* Liaw, Andy
> >> *Cc:* r-help
> >> *Subject:* Re: [R] anyone know why package "RandomForest" na.roughfix is
> >> so slow??
> >>
> >> Andy,
> >>
> >>     You're right, I didn't supply any code, because my call was very
> simple
> >> and it was the call itself at question.  However, here is the associated
> >> code I am using:
> >>
> >>
> >>         naFixTime <- system.time( {
> >>             if (fltrResponse) {  ## TRUE: there are no NA's in the
> >> response... cleared via earlier steps
> >>                 message(paste(iAm,": Missing values will now be
> >> imputed...\n", sep=""))
> >>         try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet),
> >> response)],
> >>                                          dataSet[,response]) )
> >>             } else {  ## In this case, there is no "response" column in
> the
> >> data set
> >>                 message(paste(iAm,": Missing values will now be filled
> in
> >> with median",
> >>                               " values or most frequent levels",
> sep=""))
> >>                 try( dataSet <- na.roughfix(dataSet) )
> >>             }
> >>         } )
> >>
> >>
> >>
> >>     As you can see, the "na.roughfix" call is made as simply as
> possible:
> >> I supply the entire dataSet (only parameters, no responses).  I am not
> doing
> >> the prediction here (that is done later, and the prediction itself is
> not
> >> taking very long).
> >>     Here are some calculation times that I experienced:
> >>
> >> # rows       # cols       time to run na.roughfix
> >> =======     =======     ====================
> >>   2046          2833             ~ 2 minutes
> >>   2066          5626             ~ 6 minutes
> >>   2134         14037             ~ 30 minutes
> >>
> >>     These numbers are on a Windows server using the 64-bit version of
> 'R'.
> >>
> >>                                           Regards,
> >>                                                    Mike
> >>
> >>
> >> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
> >> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
> >> Some x-ray slides, a music score, Minard's Napoleanic war:
> >> The most exciting frontier is charting what's already here."
> >>  -- xkcd
> >>
> >> --
> >> Help protect Wikipedia. Donate now:
> >> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
> >>
> >>
> >> On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_l...@merck.com> wrote:
> >>
> >>> You have not shown any code on exactly how you use na.roughfix(), so I
> >>> can only guess.
> >>>
> >>> If you are doing something like:
> >>>
> >>>  randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)
> >>>
> >>> I would not be surprised that it's taking very long on large datasets.
> >>> Most likely it's caused by the formula interface, not na.roughfix()
> >>> itself.
> >>>
> >>> If that is your case, try doing the imputation beforehand and run
> >>> randomForest() afterward; e.g.,
> >>>
> >>> myroughfixed <- na.roughfix(mybigdata)
> >>> randomForest(myroughfixed[list.of.predictor.columns],
> >>> myroughfixed[[myresponse]],...)
> >>>
> >>> HTH,
> >>> Andy
> >>>
> >>> -----Original Message-----
> >>> From: r-help-boun...@r-project.org [mailto:
> r-help-boun...@r-project.org]
> >>> On Behalf Of Mike Williamson
> >>> Sent: Wednesday, June 30, 2010 7:53 PM
> >>> To: r-help
> >>> Subject: [R] anyone know why package "RandomForest" na.roughfix is so
> >>> slow??
> >>>
> >>> Hi all,
> >>>
> >>>    I am using the package "random forest" for random forest
> >>> predictions.  I
> >>> like the package.  However, I have fairly large data sets, and it can
> >>> often
> >>> take *hours* just to go through the "na.roughfix" call, which simply
> >>> goes
> >>> through and cleans up any NA values to either the median (numerical
> >>> data) or
> >>> the most frequent occurrence (factors).
> >>>    I am going to start doing some comparisons between na.roughfix() and
> >>> some apply() functions which, it seems, are able to do the same job
> more
> >>> quickly.  But I hesitate to duplicate a function that is already in the
> >>> package, since I presume the na.roughfix should be as quick as possible
> >>> and
> >>> it should also be well "tailored" to the requirements of random forest.
> >>>
> >>>    Has anyone else seen that this is really slow?  (I haven't noticed
> >>> rfImpute to be nearly as slow, but I cannot say for sure:  my "predict"
> >>> data
> >>> sets are MUCH larger than my model data sets, so cleaning the
> prediction
> >>> data set simply takes much longer.)
> >>>    If so, any ideas how to speed this up?
> >>>
> >>>                              Thanks!
> >>>                                   Mike
> >>>
> >>>
> >>>
> >>> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
> >>> Tacoma Narrows bridge collapse explained with abstract phase-space
> maps,
> >>> Some x-ray slides, a music score, Minard's Napoleanic war:
> >>> The most exciting frontier is charting what's already here."
> >>>  -- xkcd
> >>>
> >>> --
> >>> Help protect Wikipedia. Donate now:
> >>> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
> >>>
> >>>        [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help@r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>> Notice:  This e-mail message, together with any attachments, contains
> >>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> >>> New Jersey, USA 08889), and/or its affiliates Direct contact
> information
> >>> for affiliates is available at
> >>> http://www.merck.com/contact/contacts.html) that may be confidential,
> >>> proprietary copyrighted and/or legally privileged. It is intended
> solely
> >>> for the use of the individual or entity named on this message. If you
> are
> >>> not the intended recipient, and have received this message in error,
> >>> please notify us immediately by reply e-mail and then delete it from
> >>> your system.
> >>>
> >>>
> >> Notice:  This e-mail message, together with any attach...{{dropped:15}}
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

Reply via email to