Bill, Jim and Martin, Great! The code is much faster and even looks more R-ish. I'm very new to R and have some difficulties getting rid my procedural programming habits.
Many thanks to all for the great help. Olivier. On Wed, Mar 18, 2009 at 9:38 PM, William Dunlap <wdun...@tibco.com> wrote: > Olivier, > You can profile R code with Rprof(). E.g., > Rprof(tmp<-tempfile()) # start profiling, saving results in a file > ... run your code: sapply(x, pn_formatting) ... > Rprof() # stop profiling > summaryRprof(tmp) # analyze the file and present results in a pair of > data.frames > $by.self > self.time self.pct total.time total.pct > sub 2.26 24.3 2.26 24.3 > structure 0.68 7.3 1.42 15.3 > FUN 0.66 7.1 8.82 94.8 > withCallingHandlers 0.62 6.7 4.14 44.5 > paste 0.54 5.8 0.56 6.0 > toupper 0.44 4.7 0.46 4.9 > as.integer 0.42 4.5 3.34 35.9 > makeRestartList 0.38 4.1 1.36 14.6 > unlist 0.34 3.7 0.40 4.3 > substr 0.30 3.2 0.36 3.9 > ... > $by.total > total.time total.pct self.time self.pct > sapply 9.30 100.0 0.00 0.0 > lapply 8.88 95.5 0.08 0.9 > FUN 8.82 94.8 0.66 7.1 > digits_only 4.18 44.9 0.02 0.2 > suppressWarnings 4.16 44.7 0.02 0.2 > withCallingHandlers 4.14 44.5 0.62 6.7 > as.integer 3.34 35.9 0.42 4.5 > withRestarts 2.92 31.4 0.10 1.1 > .signalSimpleWarning 2.92 31.4 0.00 0.0 > trim 2.34 25.2 0.08 0.9 > sub 2.26 24.3 2.26 24.3 > > You didn't mention that you used sapply() on your pn_formatting > function, which I think you must have since you use a non-vectorized > if statement in it. If you vectorize the numeric/non-numeric choice, > as in the following code, you get a huge speedup because you don't > have to use sapply: > > pn_formatting1 <- > function(pn_in) { > pn_out = trim(pn_in) > numeric <- digits_only(pn_out) > pn_out[!numeric] <- toupper(pn_out[!numeric]) > pn_out[numeric] <- { > # Zero padding > tmp <- paste("000000000000000000", pn_out[numeric], sep="") > pn_len <- nchar(tmp) > substr(tmp, pn_len - 17, pn_len) > } > pn_out > } > > Jim's code is a bit cleaner (to my taste) but runs at the same > speed as yours after this simple modification. His contains an > error, in that it uses integer subscripts and does not check that > there are at least one numeric entry in the input (in that case > pn_in[-which.num] returns all of pn_in and sprintf() dies because > one of its arguments is 0-long). > > Bill Dunlap > TIBCO Software Inc - Spotfire Division > wdunlap tibco.com > ------------------------------------------------------------------------ > ----------- > [R] Profiling question: string formatting extremely slow > > jim holtman jholtman at gmail.com > Wed Mar 18 18:09:37 CET 2009 > Previous message: [R] Profiling question: string formatting extremely > slow > Next message: [R] Updated R on Debian testing machine... > Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] > Try this way. Took less than 1 second for 50,000 > > > system.time({ > + x <- sample(50000) # test data > + x[sample(50000,10000)] <- 'asdfasdf' # characters strings > + which.num <- grep("^[ 0-9]+$", x) # find numbers > + # convert to leading 0 > + x[which.num] <- sprintf("%018.0f", as.numeric(x[which.num])) > + x[-which.num] <- toupper(x[-which.num]) > + }) > user system elapsed > 0.25 0.00 0.25 > > > > > > > > head(x,30) > [1] "000000000000026550" "000000000000019100" "000000000000045961" > "000000000000031473" "000000000000005031" "000000000000012266" > [7] "000000000000034418" "000000000000042279" "000000000000041193" > "ASDFASDF" "000000000000005760" "000000000000035659" > [13] "ASDFASDF" "000000000000008420" "000000000000042220" > "ASDFASDF" "000000000000039903" "000000000000032234" > [19] "000000000000024125" "000000000000032970" "000000000000006814" > "000000000000000215" "ASDFASDF" "000000000000045239" > [25] "ASDFASDF" "ASDFASDF" "000000000000043065" > "ASDFASDF" "000000000000007642" "000000000000019196" > > > > > On Wed, Mar 18, 2009 at 12:16 PM, Olivier Boudry > <olivier.boudry at gmail.com> wrote: > > Hi all, > > > > I'm using R to find duplicates in a set of 6 files containing Part > Number > > information. Before applying the intersect method to identify the > duplicates > > I need to normalize the P/Ns. Converting the P/N to uppercase if > > alphanumerical and applying an 18 char long zero padding if numerical. > > > > When I apply the pn_formatting function (see code below) to "Part > Number" > > column of the data.frame (character vectors up to 18 char long) it > consumes > > a lot of memory, my computer (Windows XP SP3) starts to swap memory, > CPU > > goes to zero and completion takes hours to complete. Part Number > columns can > > have from 7'000 to 80'000 records and I've never got enough patience > to wait > > for completion of more than 17'000 records. > > > > Is there a way to find out which of the function used below is the > > bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a > > profiler for R and if yes where could I find some documentation on how > to > > use it? > > > > The code: > > > > # String contains digits only (can be converted to an integer) > > digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) } > > > > # Remove blanks at both ends of a string > > trim <- function (x) { > > sub("^\\s+((.*\\S)\\s+)?$", "\\2", x) > > } > > > > # P/N formatting > > pn_formatting <- function(pn_in) { > > > > pn_out = trim(pn_in) > > if (digits_only(pn_out)) { > > > > # Zero padding > > pn_out <- paste("000000000000000000", pn_out, sep="") > > pn_len <- nchar(pn_out) > > pn_out <- substr(pn_out, pn_len - 17, pn_len) > > > > } else { > > # Uppercase > > pn_out <- toupper(pn_out) > > } > > pn_out > > } > > > > Thanks, > > > > Olivier. > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.