Thanks! I think your suggested additions to the docs are perfect.
Duncan Murdoch On 2024-08-09 5:01 a.m., Tomas Kalibera wrote:
On 8/1/24 20:55, Duncan Murdoch wrote:Thanks Tomas. Do note that my original post also mentioned a bug or doc error in the PCRE docs for this regexp:- perl = TRUE does *not* give the documented result on at least one system (which is "123456789", because "{,5}" is documented to not be a quantifier, so it should only match the literal string "{,5}").This is a change in documented behavior in PCRE. PCRE2 10.43 (share/man/man3/pcre2pattern.3) says: "If the first number is omitted, the lower limit is taken as zero; in this case the upper limit must be present. X{,4} is interpreted as X{0,4}. In earlier versions such a sequence was not interpreted as a quantifier. Other regular expression engines may behave either way." And the changelog: "29. Perl 5.34.0 changed the meaning of (for example) {,3} which did not used to be treated as a quantifier. Now it is interpreted as {0,3} and PCRE2 has changed to match. Note that {,} is still not a quantifier." Sadly the previous behavior was also documented in pcre2pattern.3: "For example, {,6} is not a quantifier, but a literal string of four characters" I've confirmed with R built with PCRE2 10.42, 10.43 and 10.44. In practice, users would most likely see the new behavior on Windows, where Rtools44 has PCRE2 10.43. The R documentation (?regex) refers to the PCRE2 documentation for "complete details", mentioning how to find out what is the version of PCRE(2) used. I've now added a warning about that PCRE behavior may change between versions, with the {,m} as an example. I don't think we can do much more - I don't think we should be replicating the PCRE documentation/changelog - but we could add more examples, if any important appear. Also, we don't want to write R programs that depend on concrete versions of PCRE. It is a good thing that ?regex doesn't document "{,m}", because it cannot be used reliably/portably. One should use some of the documented forms, instead, i.e. "{0,m}". Indeed there is the problem of how to use only the documented subset of behavior (in ?regex), because one also needs to avoid accidentally running into undocumented expressions with special meaning, like in this case. But perhaps still authors could try to defensively avoid risky expressions in literals in patterns, such as those involving "{}" or otherwise similar to documented expressions with a special meaning. Best TomasDuncan On 2024-08-01 6:49 a.m., Tomas Kalibera wrote:On 7/29/24 09:37, Ivan Krylov via R-devel wrote:В Sun, 28 Jul 2024 20:02:21 -0400 Duncan Murdoch <murdoch.dun...@gmail.com> пишет:gsub("^([0-9]{,5}).*","\\1","123456789") [1] "123456"This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns {.rm_so = 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})" and above it returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}. Compiling with TRE_DEBUG, I see it parsed correctly: catenation, sub 0, 0 tags assertions: bol iteration {-1, 2}, sub -1, 0 tags, greedy literal (0, 9) (48, 57), pos 0, sub -1, 0 tags ...but after tre_expand_ast I see catenation, sub 0, 1 tags assertions: bol catenation, sub -1, 1 tags tag 0 union, sub -1, 0 tags literal empty catenation, sub -1, 0 tags literal (0, 9) (48, 57), pos 2, sub -1, 0 tags union, sub -1, 0 tags literal empty catenation, sub -1, 0 tags literal (0, 9) (48, 57), pos 1, sub -1, 0 tags union, sub -1, 0 tags literal empty literal (0, 9) (48, 57), pos 0, sub -1, 0 tags ...which has one too many copies of "literal (0,9)". I think it's due to the expansion loop on line 942 of src/extra/tre/tre-compile.c being for (j = iter->min; j < iter->max; j++) ...where 'min' is -1 to denote no minimum. This is further confirmed by "{0,3}", "{1,3}", "{2,3}", "{3,3}" all working correctly. Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax: from my reading, it looks like if the upper boundary is specified, the lower boundary must be specified too. But if we do want to fix this, it will have to be a special case for iter->min == -1.Thanks. It seems that TRE is now maintained again upstream, so it would be best to discuss this with TRE maintainers directly (if not already solved by https://github.com/laurikari/tre/pull/98). The same applies to any other open TRE issues. Best Tomas
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel