[adding gnulib] On 11/14/2012 02:26 PM, РоманДонченко wrote: > Here's what POSIX says > <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html>: > > 9.3.6 BREs Matching Multiple Characters > > <...> The string matched by a contained subexpression shall be within the > string matched by the containing subexpression. If the containing > subexpression does not match, or if there is no match for the contained > subexpression within the string matched by the containing subexpression, then > back-reference expressions corresponding to the contained subexpression shall > not match. <...> For example, <...> the expression "\(a\(b\)*\)*\2" fails to > match 'abab' <...> .
Thanks for the report. Yep, I agree you've found a bug in glibc's regex engine. Other useful context from that POSIX page: If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched... Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string. For this purpose, a null string shall be considered to be longer than no match at all. For example, matching the BRE "\(.*\).*" against "abcdef" , the subexpression "(\1)" is "abcdef" , and matching the BRE "\(a*\)*" against "bc" , the subexpression "(\1)" is the null string... A subexpression repeated by an <asterisk> ( '*' ) or an interval expression shall not match a null expression unless this is the only match for the repetition or it is necessary to satisfy the exact or minimum number of occurrences for the interval expression. My read of POSIX is that given the string 'abab', if you let '\(a\(b\)*\)*' match 'abab', then \2 would also have to match 'b', which is not present at that point in the string. Trying successively shorter options, if you try to match 'aba', then the second occurrence of the containing subexpression \(a\(b\)*\) has no matches of the contained subexpression \(b\), so \2 must fail to match. If you try to match 'ab', then \2 would still have to match 'b'. If you let \(b\)* match the null string, then \(a\(b\)*\)* matches 'a'; but in this case, there is no match for the contained subexpression '\(b\)' within the containing subexpression, so any attempt to use \2 in that scenario must fail. Finally, if you let '\(a\(b\)*\)*' match the null string, then you again have a case where \(b\) was not matched, so \2 again fails. So POSIX is right that 'abab' cannot match, no matter how you slice the string; and therefore grep has a bug. More precisely, glibc has a bug; you can trigger the same bug via other programs, such as m4: $ printf %s 'regexp(abab, \(a\(b\)*\)*\2, \&-\1-\2)' | m4 abab-a-b which shows that glibc is treating \2 as the successful backreference to 'b' from the FIRST match of a\(b\)*, even though POSIX requires it to refer to the second match where \(b\) did not match. For reference, on Cygwin (where m4 uses gnulib's regex engine, rather than glibc's) I get correct behavior: $ printf %s 'regexp(abab, \(a\(b\)*\)*\2, \&-\1-\2)' | m4 -1 > And yet... > > $ src/grep '\(a\(b\)*\)*\2' <<< 'abab' > abab > > I'm guessing that it assumes that \1 covers character #3, but \2 covers > character #2, which is against the spec. Yep, your analysis is matching mine. We should get glibc to fix their bug; but we should also get gnulib to work around the bug. -- Eric Blake ebl...@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature