Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent yesterday)
I believe that I have found another bug, this time in the substring function. The use case that I am concerned with is when there is a single (character scalar) text/subject, and many substrings to extract. For example substring("AAAA", 1:4, 1:4) or more generally, N=1000 substring(paste(rep("A", N), collapse=""), 1:N, 1:N) The problem I observe is that the time complexity is quadratic in N, as shown on this figure https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png source: https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R I expected the time complexity to be linear in N. The example above may seem contrived/trivial, but it is indeed relevant to a number of packages (rex, rematch2, namedCapture) which provide functions that use gregexpr and then substring to extract the text in the captured sub-patterns. The figure https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png shows the issue: these packages have quadratic time complexity, whereas other packages (and the gregexpr function with perl=TRUE after applying the patch discussed yesterday) have linear time complexity. I believe the problem is the substring function. Source for this figure: https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R I suspect that a fix can be accomplished by optimizing the implementation of substring, for the special case when the text/subject is a single element (character scalar). Right now I notice that the substring R code uses rep_len so that the text/subject which is passed to the C code is a character vector with the same length as the number of substrings to extract. Maybe the C code is calling strlen for each of these (identical) text/subject elements? Anyway, it would be useful to have some feedback to make sure this is indeed a bug before I post on bugzilla. (btw thanks Martin for signing me up for an account) Toby [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel