Bill, Jim and Martin,

Great! The code is much faster and even looks more R-ish. I'm very new to R
and have some difficulties getting rid my procedural programming habits.

Many thanks to all for the great help.

Olivier.

On Wed, Mar 18, 2009 at 9:38 PM, William Dunlap <wdun...@tibco.com> wrote:

> Olivier,
> You can profile R code with Rprof().  E.g.,
>  Rprof(tmp<-tempfile()) # start profiling, saving results in a file
>  ... run your code: sapply(x, pn_formatting) ...
>  Rprof()   # stop profiling
>  summaryRprof(tmp) # analyze the file and present results in a pair of
> data.frames
>  $by.self
>                       self.time self.pct total.time total.pct
>  sub                       2.26     24.3       2.26      24.3
>  structure                 0.68      7.3       1.42      15.3
>  FUN                       0.66      7.1       8.82      94.8
>  withCallingHandlers       0.62      6.7       4.14      44.5
>  paste                     0.54      5.8       0.56       6.0
>  toupper                   0.44      4.7       0.46       4.9
>  as.integer                0.42      4.5       3.34      35.9
>  makeRestartList           0.38      4.1       1.36      14.6
>  unlist                    0.34      3.7       0.40       4.3
>  substr                    0.30      3.2       0.36       3.9
>  ...
>  $by.total
>                       total.time total.pct self.time self.pct
>  sapply                     9.30     100.0      0.00      0.0
>  lapply                     8.88      95.5      0.08      0.9
>  FUN                        8.82      94.8      0.66      7.1
>  digits_only                4.18      44.9      0.02      0.2
>  suppressWarnings           4.16      44.7      0.02      0.2
>  withCallingHandlers        4.14      44.5      0.62      6.7
>  as.integer                 3.34      35.9      0.42      4.5
>  withRestarts               2.92      31.4      0.10      1.1
>  .signalSimpleWarning       2.92      31.4      0.00      0.0
>  trim                       2.34      25.2      0.08      0.9
>  sub                        2.26      24.3      2.26     24.3
>
> You didn't mention that you used sapply() on your pn_formatting
> function, which I think you must have since you use a non-vectorized
> if statement in it.  If you vectorize the numeric/non-numeric choice,
> as in the following code, you get a huge speedup because you don't
> have to use sapply:
>
> pn_formatting1 <-
> function(pn_in) {
>  pn_out = trim(pn_in)
>   numeric <- digits_only(pn_out)
>  pn_out[!numeric] <- toupper(pn_out[!numeric])
>  pn_out[numeric] <- {
>    # Zero padding
>    tmp <- paste("000000000000000000", pn_out[numeric], sep="")
>    pn_len <- nchar(tmp)
>    substr(tmp, pn_len - 17, pn_len)
>  }
>  pn_out
> }
>
> Jim's code is a bit cleaner (to my taste) but runs at the same
> speed as yours after this simple modification.  His contains an
> error, in that it uses integer subscripts and does not check that
> there are at least one numeric entry in the input (in that case
> pn_in[-which.num] returns all of pn_in and sprintf() dies because
> one of its arguments is 0-long).
>
> Bill Dunlap
> TIBCO Software Inc - Spotfire Division
> wdunlap tibco.com
> ------------------------------------------------------------------------
> -----------
> [R] Profiling question: string formatting extremely slow
>
> jim holtman jholtman at gmail.com
> Wed Mar 18 18:09:37 CET 2009
> Previous message: [R] Profiling question: string formatting extremely
> slow
> Next message: [R] Updated R on Debian testing machine...
> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> Try this way.  Took less than 1 second for 50,000
>
> > system.time({
> +     x <- sample(50000)  # test data
> +     x[sample(50000,10000)] <- 'asdfasdf'  # characters strings
> +     which.num <- grep("^[ 0-9]+$", x)  # find numbers
> +     # convert to leading 0
> +     x[which.num] <- sprintf("%018.0f", as.numeric(x[which.num]))
> +     x[-which.num] <- toupper(x[-which.num])
> + })
>   user  system elapsed
>   0.25    0.00    0.25
> >
> >
> >
> > head(x,30)
>  [1] "000000000000026550" "000000000000019100" "000000000000045961"
> "000000000000031473" "000000000000005031" "000000000000012266"
>  [7] "000000000000034418" "000000000000042279" "000000000000041193"
> "ASDFASDF"           "000000000000005760" "000000000000035659"
> [13] "ASDFASDF"           "000000000000008420" "000000000000042220"
> "ASDFASDF"           "000000000000039903" "000000000000032234"
> [19] "000000000000024125" "000000000000032970" "000000000000006814"
> "000000000000000215" "ASDFASDF"           "000000000000045239"
> [25] "ASDFASDF"           "ASDFASDF"           "000000000000043065"
> "ASDFASDF"           "000000000000007642" "000000000000019196"
> >
>
>
> On Wed, Mar 18, 2009 at 12:16 PM, Olivier Boudry
> <olivier.boudry at gmail.com> wrote:
> > Hi all,
> >
> > I'm using R to find duplicates in a set of 6 files containing Part
> Number
> > information. Before applying the intersect method to identify the
> duplicates
> > I need to normalize the P/Ns. Converting the P/N to uppercase if
> > alphanumerical and applying an 18 char long zero padding if numerical.
> >
> > When I apply the pn_formatting function (see code below) to "Part
> Number"
> > column of the data.frame (character vectors up to 18 char long) it
> consumes
> > a lot of memory, my computer (Windows XP SP3) starts to swap memory,
> CPU
> > goes to zero and completion takes hours to complete. Part Number
> columns can
> > have from 7'000 to 80'000 records and I've never got enough patience
> to wait
> > for completion of more than 17'000 records.
> >
> > Is there a way to find out which of the function used below is the
> > bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a
> > profiler for R and if yes where could I find some documentation on how
> to
> > use it?
> >
> > The code:
> >
> > # String contains digits only (can be converted to an integer)
> > digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) }
> >
> > # Remove blanks at both ends of a string
> > trim <- function (x) {
> >  sub("^\\s+((.*\\S)\\s+)?$", "\\2", x)
> > }
> >
> > # P/N formatting
> > pn_formatting <- function(pn_in) {
> >
> >  pn_out = trim(pn_in)
> >  if (digits_only(pn_out)) {
> >
> >    # Zero padding
> >    pn_out <- paste("000000000000000000", pn_out, sep="")
> >    pn_len <- nchar(pn_out)
> >    pn_out <- substr(pn_out, pn_len - 17, pn_len)
> >
> >  } else {
> >    # Uppercase
> >    pn_out <- toupper(pn_out)
> >  }
> >  pn_out
> > }
> >
> > Thanks,
> >
> > Olivier.
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to