Hi all, I'm using R to find duplicates in a set of 6 files containing Part Number information. Before applying the intersect method to identify the duplicates I need to normalize the P/Ns. Converting the P/N to uppercase if alphanumerical and applying an 18 char long zero padding if numerical.
When I apply the pn_formatting function (see code below) to "Part Number" column of the data.frame (character vectors up to 18 char long) it consumes a lot of memory, my computer (Windows XP SP3) starts to swap memory, CPU goes to zero and completion takes hours to complete. Part Number columns can have from 7'000 to 80'000 records and I've never got enough patience to wait for completion of more than 17'000 records. Is there a way to find out which of the function used below is the bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a profiler for R and if yes where could I find some documentation on how to use it? The code: # String contains digits only (can be converted to an integer) digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) } # Remove blanks at both ends of a string trim <- function (x) { sub("^\\s+((.*\\S)\\s+)?$", "\\2", x) } # P/N formatting pn_formatting <- function(pn_in) { pn_out = trim(pn_in) if (digits_only(pn_out)) { # Zero padding pn_out <- paste("000000000000000000", pn_out, sep="") pn_len <- nchar(pn_out) pn_out <- substr(pn_out, pn_len - 17, pn_len) } else { # Uppercase pn_out <- toupper(pn_out) } pn_out } Thanks, Olivier. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.