[Rd] strcapture performance when perl = TRUE

Tim Taylor Mon, 29 Jan 2024 06:05:50 -0800

I wanted to raise the possibility of improving strcapture performance in
cases where perl = TRUE. I believe we can do this in a non-breaking way
by calling regexpr instead of regexec (conditionally when perl = TRUE).
To illustrate this I've put together a 'proof of concept' function called
strcapture2 that utilises output from regexpr directly (following a very
nice substring approach that I've seen implemented by Toby Hocking
in the nc package - nc::capture_first_vec).


strcapture2 <- function(pattern, x, proto, perl = FALSE, useBytes = FALSE) {
    if (isTRUE(perl)) {
        m <- regexpr(pattern = pattern, text = x, perl = TRUE, useBytes = 
useBytes)
        nomatch <- is.na(m) | m == -1L
        ntokens <- length(proto)
        if (any(!nomatch)) {
            length <- attr(m, "match.length")
            start <- attr(m, "capture.start")
            length <- attr(m, "capture.length")
            end <- start + length - 1L
            end[nomatch, ] <- start[nomatch, ] <- NA
            res <- substring(x, start, end)
            out <- matrix(res, length(m))
            if (ncol(out) != ntokens) {
                stop("The number of captures in 'pattern' != 'length(proto)'")
            }
        } else {
            out <- matrix(NA_character_, length(m), ntokens)
        }
        utils:::conformToProto(out,proto)
    } else {
        strcapture(pattern,x,proto,perl,useBytes)
    }
}

Now comparing with strcapture we can expand the named capture example
from the grep documentation:

notables <- c(
    "  Ben Franklin and Jefferson Davis",
    "\tMillard Fillmore",
    "Bob",
    NA_character_
)

regex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
proto = data.frame("", "")

(strcapture(regex, notables, proto, perl = TRUE))
      X..    X...1
1     Ben Franklin
2 Millard Fillmore
3    <NA>     <NA>
4    <NA>     <NA>

(strcapture2(regex, notables, proto, perl = TRUE))
      X..    X...1
1     Ben Franklin
2 Millard Fillmore
3    <NA>     <NA>
4    <NA>     <NA>

Now to compare timings over multiple reps:

lengths <- sort(outer(c(1, 2, 5), 10^(1:4)))
reps <- 20 

time_strcapture <- function(text, length, regex, proto, reps) {
    text <- rep_len(text, length)
    str <- system.time(for (i in seq_len(reps)) strcapture(regex, text, proto, 
perl = TRUE))
    str2 <- system.time(for (i in seq_len(reps)) strcapture2(regex, text, 
proto, perl = TRUE))
    c(strcapture = str[["user.self"]], strcapture2 = str2[["user.self"]])
}
timings <- sapply(
    lengths,
    time_strcapture,
    text = notables, regex = regex, reps = reps, proto = proto
)
cbind(lengths, t(timings))
      lengths strcapture strcapture2
 [1,]      10      0.005       0.003
 [2,]      20      0.005       0.002
 [3,]      50      0.008       0.003
 [4,]     100      0.012       0.002
 [5,]     200      0.021       0.003
 [6,]     500      0.051       0.003
 [7,]    1000      0.097       0.004
 [8,]    2000      0.171       0.005
 [9,]    5000      0.517       0.011
[10,]   10000      1.203       0.018
[11,]   20000      2.563       0.037
[12,]   50000      7.276       0.090

I've attached a plot of these timings in case helpful.

I appreciate that changing strcapture in this way does make it more
complicated but I think the performance improvements make it worth
considering. Note that I've not thoroughly tested the above implementation
as wanted to get feedback from the list before proceeding further.

Hope all this make sense. Cheers

Tim

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] strcapture performance when perl = TRUE

Reply via email to