Hi all, Several people have noticed that gregexpr is very slow for large subject strings when perl=TRUE is specified. - https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings - http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html - https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html
I figured out the issue, which is fixed by changing 1 line of code in src/main/grep.c -- there is a strlen function call which is currently inside of the while loop over matches, and the patch moves it before the loop. https://github.com/tdhock/namedCapture-article/blob/master/linear-time-gregexpr-perl.patch I made some figures that show the quadratic time complexity before applying the patch, and the linear time complexity after applying the patch https://github.com/tdhock/namedCapture-article#19-feb-2019 I would have posted a bug report on bugs.r-project.org but I do not have an account. So can an R-devel person please either (1) accept this patch, or (2) give me an account so I can post the patch on the bug tracker? Finally I would like to mention that Bill Dunlap noticed a similar problem (time complexity which is quadratic in subject size) for strsplit with perl=TRUE. My patch does NOT fix that, but I suspect that a similar fix could be accomplished (because I see that strlen is being called in a while loop in do_strsplit as well). Thanks Toby Dylan Hocking [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel