Hi all,

I'm using R to find duplicates in a set of 6 files containing Part Number
information. Before applying the intersect method to identify the duplicates
I need to normalize the P/Ns. Converting the P/N to uppercase if
alphanumerical and applying an 18 char long zero padding if numerical.

When I apply the pn_formatting function (see code below) to "Part Number"
column of the data.frame (character vectors up to 18 char long) it consumes
a lot of memory, my computer (Windows XP SP3) starts to swap memory, CPU
goes to zero and completion takes hours to complete. Part Number columns can
have from 7'000 to 80'000 records and I've never got enough patience to wait
for completion of more than 17'000 records.

Is there a way to find out which of the function used below is the
bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a
profiler for R and if yes where could I find some documentation on how to
use it?

The code:

# String contains digits only (can be converted to an integer)
digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) }

# Remove blanks at both ends of a string
trim <- function (x) {
  sub("^\\s+((.*\\S)\\s+)?$", "\\2", x)
}

# P/N formatting
pn_formatting <- function(pn_in) {

  pn_out = trim(pn_in)
  if (digits_only(pn_out)) {

    # Zero padding
    pn_out <- paste("000000000000000000", pn_out, sep="")
    pn_len <- nchar(pn_out)
    pn_out <- substr(pn_out, pn_len - 17, pn_len)

  } else {
    # Uppercase
    pn_out <- toupper(pn_out)
  }
  pn_out
}

Thanks,

Olivier.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to