Thanks Tomas. Do note that my original post also mentioned a bug or doc error in the PCRE docs for this regexp:

- perl = TRUE does *not* give the documented result on at least one system (which is "123456789", because "{,5}" is documented to not be a quantifier, so it should only match the literal string "{,5}").

Duncan

On 2024-08-01 6:49 a.m., Tomas Kalibera wrote:

On 7/29/24 09:37, Ivan Krylov via R-devel wrote:
В Sun, 28 Jul 2024 20:02:21 -0400
Duncan Murdoch <murdoch.dun...@gmail.com> пишет:

gsub("^([0-9]{,5}).*","\\1","123456789")
[1] "123456"
This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns {.rm_so
= 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})" and above it
returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}.

Compiling with TRE_DEBUG, I see it parsed correctly:

catenation, sub 0, 0 tags
    assertions: bol
    iteration {-1, 2}, sub -1, 0 tags, greedy
      literal (0, 9) (48, 57), pos 0, sub -1, 0 tags

...but after tre_expand_ast I see

catenation, sub 0, 1 tags
    assertions: bol
    catenation, sub -1, 1 tags
      tag 0
      union, sub -1, 0 tags
        literal empty
        catenation, sub -1, 0 tags
          literal (0, 9) (48, 57), pos 2, sub -1, 0 tags
          union, sub -1, 0 tags
            literal empty
            catenation, sub -1, 0 tags
              literal (0, 9) (48, 57), pos 1, sub -1, 0 tags
              union, sub -1, 0 tags
                literal empty
                literal (0, 9) (48, 57), pos 0, sub -1, 0 tags

...which has one too many copies of "literal (0,9)". I think it's due
to the expansion loop on line 942 of src/extra/tre/tre-compile.c being

for (j = iter->min; j < iter->max; j++)

...where 'min' is -1 to denote no minimum. This is further confirmed by
"{0,3}", "{1,3}", "{2,3}", "{3,3}" all working correctly.

Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax:
from my reading, it looks like if the upper boundary is specified, the
lower boundary must be specified too. But if we do want to fix this, it
will have to be a special case for iter->min == -1.

Thanks. It seems that TRE is now maintained again upstream, so it would
be best to discuss this with TRE maintainers directly (if not already
solved by https://github.com/laurikari/tre/pull/98).

The same applies to any other open TRE issues.

Best Tomas


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to