Update: I have observed that stringi::stri_sub is linear time complexity, and it computes the same thing as base::substring. figure https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png source: https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
To me this is a clear indication of a bug in substring, but again it would be nice to have some feedback/confirmation before posting on bugzilla. Also this suggests a fix -- just need to copy whatever stringi::stri_sub is doing. On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdho...@gmail.com> wrote: > Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent > yesterday) > > I believe that I have found another bug, this time in the substring > function. The use case that I am concerned with is when there is a single > (character scalar) text/subject, and many substrings to extract. For example > > substring("AAAA", 1:4, 1:4) > > or more generally, > > N=1000 > substring(paste(rep("A", N), collapse=""), 1:N, 1:N) > > The problem I observe is that the time complexity is quadratic in N, as > shown on this figure > https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png > source: > https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R > > I expected the time complexity to be linear in N. > > The example above may seem contrived/trivial, but it is indeed relevant to > a number of packages (rex, rematch2, namedCapture) which provide functions > that use gregexpr and then substring to extract the text in the captured > sub-patterns. The figure > https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png > shows the issue: these packages have quadratic time complexity, whereas > other packages (and the gregexpr function with perl=TRUE after applying the > patch discussed yesterday) have linear time complexity. I believe the > problem is the substring function. Source for this figure: > https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R > > I suspect that a fix can be accomplished by optimizing the implementation > of substring, for the special case when the text/subject is a single > element (character scalar). Right now I notice that the substring R code > uses rep_len so that the text/subject which is passed to the C code is a > character vector with the same length as the number of substrings to > extract. Maybe the C code is calling strlen for each of these (identical) > text/subject elements? > > Anyway, it would be useful to have some feedback to make sure this is > indeed a bug before I post on bugzilla. (btw thanks Martin for signing me > up for an account) > > Toby > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel