Hadley, Thanks! Yes... as.data.frame() is quite slow. (And it forces the column names to become "acceptable" names, which is a hassle to fix all the time.) I just hadn't thought of something as clever as what you wrote below.
I'll try out this suggestion. :) Mike "Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here." -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en On Thu, Jul 1, 2010 at 5:07 PM, Hadley Wickham <had...@rice.edu> wrote: > Here's another version that's a bit easier to read: > > na.roughfix2 <- function (object, ...) { > res <- lapply(object, roughfix) > structure(res, class = "data.frame", row.names = seq_len(nrow(object))) > } > > roughfix <- function(x) { > missing <- is.na(x) > if (!any(missing)) return(x) > > if (is.numeric(x)) { > x[missing] <- median.default(x[!missing]) > } else if (is.factor(x)) { > freq <- table(x) > x[missing] <- names(freq)[which.max(freq)] > } else { > stop("na.roughfix only works for numeric or factor") > } > x > } > > I'm cheating a bit because as.data.frame is so slow. > > Hadley > > On Thu, Jul 1, 2010 at 6:44 PM, Mike Williamson <this.is....@gmail.com> > wrote: > > Jim, Andy, > > > > Thanks for your suggestions! > > > > I found some time today to futz around with it, and I found a "home > > made" script to fill in NA values to be much quicker. For those who are > > interested, instead of using: > > > > dataSet <- na.roughfix(dataSet) > > > > > > > > I used: > > > > origCols <- names(dataSet) > > ## Fix numeric values... > > dataSet <- as.data.frame(lapply(dataSet, > FUN=function(x) > > { > > if(!is.numeric(x)) { x } else { > > ifelse(is.na(x), median(x, na.rm=TRUE), x) } > } > > ), > > row.names=row.names(dataSet) > ) > > ## Fix factors... > > dataSet <- as.data.frame(lapply(dataSet, > FUN=function(x) > > { > > if(!is.factor(x)) { x } else { > > levels(x)[ifelse(!is.na > > (x),x,table(max(table(x))) > > ) ] } } ), > > row.names=row.names(dataSet) > ) > > names(dataSet) <- origCols > > > > > > > > In one case study that I ran, the na.roughfix() algo took 296 seconds > > whereas the homemade one above took 16 seconds. > > > > Regards, > > Mike > > > > > > > > "Telescopes and bathyscaphes and sonar probes of Scottish lakes, > > Tacoma Narrows bridge collapse explained with abstract phase-space maps, > > Some x-ray slides, a music score, Minard's Napoleanic war: > > The most exciting frontier is charting what's already here." > > -- xkcd > > > > -- > > Help protect Wikipedia. Donate now: > > http://wikimediafoundation.org/wiki/Support_Wikipedia/en > > > > > > On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy <andy_l...@merck.com> wrote: > > > >> You need to isolate the problem further, or give more detail about your > >> data. This is what I get: > >> > >> R> nr <- 2134 > >> R> nc <- 14037 > >> R> x <- matrix(runif(nr*nc), nr, nc) > >> R> n.na <- round(nr*nc/10) > >> R> x[sample(nr*nc, n.na)] <- NA > >> R> system.time(x.fixed <- na.roughfix(x)) > >> user system elapsed > >> 8.44 0.39 8.85 > >> R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with > 2GB > >> ram. > >> > >> Andy > >> > >> ------------------------------ > >> *From:* Mike Williamson [mailto:this.is....@gmail.com] > >> *Sent:* Thursday, July 01, 2010 12:48 PM > >> *To:* Liaw, Andy > >> *Cc:* r-help > >> *Subject:* Re: [R] anyone know why package "RandomForest" na.roughfix is > >> so slow?? > >> > >> Andy, > >> > >> You're right, I didn't supply any code, because my call was very > simple > >> and it was the call itself at question. However, here is the associated > >> code I am using: > >> > >> > >> naFixTime <- system.time( { > >> if (fltrResponse) { ## TRUE: there are no NA's in the > >> response... cleared via earlier steps > >> message(paste(iAm,": Missing values will now be > >> imputed...\n", sep="")) > >> try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet), > >> response)], > >> dataSet[,response]) ) > >> } else { ## In this case, there is no "response" column in > the > >> data set > >> message(paste(iAm,": Missing values will now be filled > in > >> with median", > >> " values or most frequent levels", > sep="")) > >> try( dataSet <- na.roughfix(dataSet) ) > >> } > >> } ) > >> > >> > >> > >> As you can see, the "na.roughfix" call is made as simply as > possible: > >> I supply the entire dataSet (only parameters, no responses). I am not > doing > >> the prediction here (that is done later, and the prediction itself is > not > >> taking very long). > >> Here are some calculation times that I experienced: > >> > >> # rows # cols time to run na.roughfix > >> ======= ======= ==================== > >> 2046 2833 ~ 2 minutes > >> 2066 5626 ~ 6 minutes > >> 2134 14037 ~ 30 minutes > >> > >> These numbers are on a Windows server using the 64-bit version of > 'R'. > >> > >> Regards, > >> Mike > >> > >> > >> "Telescopes and bathyscaphes and sonar probes of Scottish lakes, > >> Tacoma Narrows bridge collapse explained with abstract phase-space maps, > >> Some x-ray slides, a music score, Minard's Napoleanic war: > >> The most exciting frontier is charting what's already here." > >> -- xkcd > >> > >> -- > >> Help protect Wikipedia. Donate now: > >> http://wikimediafoundation.org/wiki/Support_Wikipedia/en > >> > >> > >> On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_l...@merck.com> wrote: > >> > >>> You have not shown any code on exactly how you use na.roughfix(), so I > >>> can only guess. > >>> > >>> If you are doing something like: > >>> > >>> randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...) > >>> > >>> I would not be surprised that it's taking very long on large datasets. > >>> Most likely it's caused by the formula interface, not na.roughfix() > >>> itself. > >>> > >>> If that is your case, try doing the imputation beforehand and run > >>> randomForest() afterward; e.g., > >>> > >>> myroughfixed <- na.roughfix(mybigdata) > >>> randomForest(myroughfixed[list.of.predictor.columns], > >>> myroughfixed[[myresponse]],...) > >>> > >>> HTH, > >>> Andy > >>> > >>> -----Original Message----- > >>> From: r-help-boun...@r-project.org [mailto: > r-help-boun...@r-project.org] > >>> On Behalf Of Mike Williamson > >>> Sent: Wednesday, June 30, 2010 7:53 PM > >>> To: r-help > >>> Subject: [R] anyone know why package "RandomForest" na.roughfix is so > >>> slow?? > >>> > >>> Hi all, > >>> > >>> I am using the package "random forest" for random forest > >>> predictions. I > >>> like the package. However, I have fairly large data sets, and it can > >>> often > >>> take *hours* just to go through the "na.roughfix" call, which simply > >>> goes > >>> through and cleans up any NA values to either the median (numerical > >>> data) or > >>> the most frequent occurrence (factors). > >>> I am going to start doing some comparisons between na.roughfix() and > >>> some apply() functions which, it seems, are able to do the same job > more > >>> quickly. But I hesitate to duplicate a function that is already in the > >>> package, since I presume the na.roughfix should be as quick as possible > >>> and > >>> it should also be well "tailored" to the requirements of random forest. > >>> > >>> Has anyone else seen that this is really slow? (I haven't noticed > >>> rfImpute to be nearly as slow, but I cannot say for sure: my "predict" > >>> data > >>> sets are MUCH larger than my model data sets, so cleaning the > prediction > >>> data set simply takes much longer.) > >>> If so, any ideas how to speed this up? > >>> > >>> Thanks! > >>> Mike > >>> > >>> > >>> > >>> "Telescopes and bathyscaphes and sonar probes of Scottish lakes, > >>> Tacoma Narrows bridge collapse explained with abstract phase-space > maps, > >>> Some x-ray slides, a music score, Minard's Napoleanic war: > >>> The most exciting frontier is charting what's already here." > >>> -- xkcd > >>> > >>> -- > >>> Help protect Wikipedia. Donate now: > >>> http://wikimediafoundation.org/wiki/Support_Wikipedia/en > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> Notice: This e-mail message, together with any attachments, contains > >>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, > >>> New Jersey, USA 08889), and/or its affiliates Direct contact > information > >>> for affiliates is available at > >>> http://www.merck.com/contact/contacts.html) that may be confidential, > >>> proprietary copyrighted and/or legally privileged. It is intended > solely > >>> for the use of the individual or entity named on this message. If you > are > >>> not the intended recipient, and have received this message in error, > >>> please notify us immediately by reply e-mail and then delete it from > >>> your system. > >>> > >>> > >> Notice: This e-mail message, together with any attach...{{dropped:15}} > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Assistant Professor / Dobelman Family Junior Chair > Department of Statistics / Rice University > http://had.co.nz/ > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.