Try this way. Took less than 1 second for 50,000 > system.time({ + x <- sample(50000) # test data + x[sample(50000,10000)] <- 'asdfasdf' # characters strings + which.num <- grep("^[ 0-9]+$", x) # find numbers + # convert to leading 0 + x[which.num] <- sprintf("%018.0f", as.numeric(x[which.num])) + x[-which.num] <- toupper(x[-which.num]) + }) user system elapsed 0.25 0.00 0.25 > > > > head(x,30) [1] "000000000000026550" "000000000000019100" "000000000000045961" "000000000000031473" "000000000000005031" "000000000000012266" [7] "000000000000034418" "000000000000042279" "000000000000041193" "ASDFASDF" "000000000000005760" "000000000000035659" [13] "ASDFASDF" "000000000000008420" "000000000000042220" "ASDFASDF" "000000000000039903" "000000000000032234" [19] "000000000000024125" "000000000000032970" "000000000000006814" "000000000000000215" "ASDFASDF" "000000000000045239" [25] "ASDFASDF" "ASDFASDF" "000000000000043065" "ASDFASDF" "000000000000007642" "000000000000019196" >
On Wed, Mar 18, 2009 at 12:16 PM, Olivier Boudry <olivier.bou...@gmail.com> wrote: > Hi all, > > I'm using R to find duplicates in a set of 6 files containing Part Number > information. Before applying the intersect method to identify the duplicates > I need to normalize the P/Ns. Converting the P/N to uppercase if > alphanumerical and applying an 18 char long zero padding if numerical. > > When I apply the pn_formatting function (see code below) to "Part Number" > column of the data.frame (character vectors up to 18 char long) it consumes > a lot of memory, my computer (Windows XP SP3) starts to swap memory, CPU > goes to zero and completion takes hours to complete. Part Number columns can > have from 7'000 to 80'000 records and I've never got enough patience to wait > for completion of more than 17'000 records. > > Is there a way to find out which of the function used below is the > bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a > profiler for R and if yes where could I find some documentation on how to > use it? > > The code: > > # String contains digits only (can be converted to an integer) > digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) } > > # Remove blanks at both ends of a string > trim <- function (x) { > sub("^\\s+((.*\\S)\\s+)?$", "\\2", x) > } > > # P/N formatting > pn_formatting <- function(pn_in) { > > pn_out = trim(pn_in) > if (digits_only(pn_out)) { > > # Zero padding > pn_out <- paste("000000000000000000", pn_out, sep="") > pn_len <- nchar(pn_out) > pn_out <- substr(pn_out, pn_len - 17, pn_len) > > } else { > # Uppercase > pn_out <- toupper(pn_out) > } > pn_out > } > > Thanks, > > Olivier. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.