В Sun, 28 Jul 2024 20:02:21 -0400 Duncan Murdoch <murdoch.dun...@gmail.com> пишет:
> gsub("^([0-9]{,5}).*","\\1","123456789") > [1] "123456" This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns {.rm_so = 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})" and above it returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}. Compiling with TRE_DEBUG, I see it parsed correctly: catenation, sub 0, 0 tags assertions: bol iteration {-1, 2}, sub -1, 0 tags, greedy literal (0, 9) (48, 57), pos 0, sub -1, 0 tags ...but after tre_expand_ast I see catenation, sub 0, 1 tags assertions: bol catenation, sub -1, 1 tags tag 0 union, sub -1, 0 tags literal empty catenation, sub -1, 0 tags literal (0, 9) (48, 57), pos 2, sub -1, 0 tags union, sub -1, 0 tags literal empty catenation, sub -1, 0 tags literal (0, 9) (48, 57), pos 1, sub -1, 0 tags union, sub -1, 0 tags literal empty literal (0, 9) (48, 57), pos 0, sub -1, 0 tags ...which has one too many copies of "literal (0,9)". I think it's due to the expansion loop on line 942 of src/extra/tre/tre-compile.c being for (j = iter->min; j < iter->max; j++) ...where 'min' is -1 to denote no minimum. This is further confirmed by "{0,3}", "{1,3}", "{2,3}", "{3,3}" all working correctly. Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax: from my reading, it looks like if the upper boundary is specified, the lower boundary must be specified too. But if we do want to fix this, it will have to be a special case for iter->min == -1. -- Best regards, Ivan [1] https://laurikari.net/tre/documentation/regex-syntax/ [2] https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#tag_09_03_06 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel