Thanks Tim. I confirm the proposed solution is over 10x faster, see https://github.com/tdhock/atime/issues/29#issuecomment-1943037753 for figure and source code.
On Mon, Jan 29, 2024 at 7:05 AM Tim Taylor <tim.tay...@hiddenelephants.co.uk> wrote: > > I wanted to raise the possibility of improving strcapture performance in > cases where perl = TRUE. I believe we can do this in a non-breaking way > by calling regexpr instead of regexec (conditionally when perl = TRUE). > To illustrate this I've put together a 'proof of concept' function called > strcapture2 that utilises output from regexpr directly (following a very > nice substring approach that I've seen implemented by Toby Hocking > in the nc package - nc::capture_first_vec). > > strcapture2 <- function(pattern, x, proto, perl = FALSE, useBytes = FALSE) { > if (isTRUE(perl)) { > m <- regexpr(pattern = pattern, text = x, perl = TRUE, useBytes = > useBytes) > nomatch <- is.na(m) | m == -1L > ntokens <- length(proto) > if (any(!nomatch)) { > length <- attr(m, "match.length") > start <- attr(m, "capture.start") > length <- attr(m, "capture.length") > end <- start + length - 1L > end[nomatch, ] <- start[nomatch, ] <- NA > res <- substring(x, start, end) > out <- matrix(res, length(m)) > if (ncol(out) != ntokens) { > stop("The number of captures in 'pattern' != 'length(proto)'") > } > } else { > out <- matrix(NA_character_, length(m), ntokens) > } > utils:::conformToProto(out,proto) > } else { > strcapture(pattern,x,proto,perl,useBytes) > } > } > > Now comparing with strcapture we can expand the named capture example > from the grep documentation: > > notables <- c( > " Ben Franklin and Jefferson Davis", > "\tMillard Fillmore", > "Bob", > NA_character_ > ) > > regex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)" > proto = data.frame("", "") > > (strcapture(regex, notables, proto, perl = TRUE)) > X.. X...1 > 1 Ben Franklin > 2 Millard Fillmore > 3 <NA> <NA> > 4 <NA> <NA> > > (strcapture2(regex, notables, proto, perl = TRUE)) > X.. X...1 > 1 Ben Franklin > 2 Millard Fillmore > 3 <NA> <NA> > 4 <NA> <NA> > > Now to compare timings over multiple reps: > > lengths <- sort(outer(c(1, 2, 5), 10^(1:4))) > reps <- 20 > > time_strcapture <- function(text, length, regex, proto, reps) { > text <- rep_len(text, length) > str <- system.time(for (i in seq_len(reps)) strcapture(regex, text, > proto, perl = TRUE)) > str2 <- system.time(for (i in seq_len(reps)) strcapture2(regex, text, > proto, perl = TRUE)) > c(strcapture = str[["user.self"]], strcapture2 = str2[["user.self"]]) > } > timings <- sapply( > lengths, > time_strcapture, > text = notables, regex = regex, reps = reps, proto = proto > ) > cbind(lengths, t(timings)) > lengths strcapture strcapture2 > [1,] 10 0.005 0.003 > [2,] 20 0.005 0.002 > [3,] 50 0.008 0.003 > [4,] 100 0.012 0.002 > [5,] 200 0.021 0.003 > [6,] 500 0.051 0.003 > [7,] 1000 0.097 0.004 > [8,] 2000 0.171 0.005 > [9,] 5000 0.517 0.011 > [10,] 10000 1.203 0.018 > [11,] 20000 2.563 0.037 > [12,] 50000 7.276 0.090 > > I've attached a plot of these timings in case helpful. > > I appreciate that changing strcapture in this way does make it more > complicated but I think the performance improvements make it worth > considering. Note that I've not thoroughly tested the above implementation > as wanted to get feedback from the list before proceeding further. > > Hope all this make sense. Cheers > > Tim > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel