I wanted to raise the possibility of improving strcapture performance in cases where perl = TRUE. I believe we can do this in a non-breaking way by calling regexpr instead of regexec (conditionally when perl = TRUE). To illustrate this I've put together a 'proof of concept' function called strcapture2 that utilises output from regexpr directly (following a very nice substring approach that I've seen implemented by Toby Hocking in the nc package - nc::capture_first_vec).
strcapture2 <- function(pattern, x, proto, perl = FALSE, useBytes = FALSE) { if (isTRUE(perl)) { m <- regexpr(pattern = pattern, text = x, perl = TRUE, useBytes = useBytes) nomatch <- is.na(m) | m == -1L ntokens <- length(proto) if (any(!nomatch)) { length <- attr(m, "match.length") start <- attr(m, "capture.start") length <- attr(m, "capture.length") end <- start + length - 1L end[nomatch, ] <- start[nomatch, ] <- NA res <- substring(x, start, end) out <- matrix(res, length(m)) if (ncol(out) != ntokens) { stop("The number of captures in 'pattern' != 'length(proto)'") } } else { out <- matrix(NA_character_, length(m), ntokens) } utils:::conformToProto(out,proto) } else { strcapture(pattern,x,proto,perl,useBytes) } } Now comparing with strcapture we can expand the named capture example from the grep documentation: notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore", "Bob", NA_character_ ) regex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)" proto = data.frame("", "") (strcapture(regex, notables, proto, perl = TRUE)) X.. X...1 1 Ben Franklin 2 Millard Fillmore 3 <NA> <NA> 4 <NA> <NA> (strcapture2(regex, notables, proto, perl = TRUE)) X.. X...1 1 Ben Franklin 2 Millard Fillmore 3 <NA> <NA> 4 <NA> <NA> Now to compare timings over multiple reps: lengths <- sort(outer(c(1, 2, 5), 10^(1:4))) reps <- 20 time_strcapture <- function(text, length, regex, proto, reps) { text <- rep_len(text, length) str <- system.time(for (i in seq_len(reps)) strcapture(regex, text, proto, perl = TRUE)) str2 <- system.time(for (i in seq_len(reps)) strcapture2(regex, text, proto, perl = TRUE)) c(strcapture = str[["user.self"]], strcapture2 = str2[["user.self"]]) } timings <- sapply( lengths, time_strcapture, text = notables, regex = regex, reps = reps, proto = proto ) cbind(lengths, t(timings)) lengths strcapture strcapture2 [1,] 10 0.005 0.003 [2,] 20 0.005 0.002 [3,] 50 0.008 0.003 [4,] 100 0.012 0.002 [5,] 200 0.021 0.003 [6,] 500 0.051 0.003 [7,] 1000 0.097 0.004 [8,] 2000 0.171 0.005 [9,] 5000 0.517 0.011 [10,] 10000 1.203 0.018 [11,] 20000 2.563 0.037 [12,] 50000 7.276 0.090 I've attached a plot of these timings in case helpful. I appreciate that changing strcapture in this way does make it more complicated but I think the performance improvements make it worth considering. Note that I've not thoroughly tested the above implementation as wanted to get feedback from the list before proceeding further. Hope all this make sense. Cheers Tim
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel