Bug#500501: More detailed analysis

Paolo Bonzini Thu, 19 Nov 2009 10:21:34 -0800

 From http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html


"A period ( '.' ), when used outside a bracket expression, is a BRE
that shall match any character in the supported character set except
NUL."


My point here is that current implementation of regexes makes '.' NOT
match some sequences.  And that is very nasty, because it is expected
that period will match anything.

$ printf 'aaa\x80bbb' | sed -e 's/^.*$/x/g' | xxd -
0000000: 6161 6180 6262 62                        aaa.bbb

The line contained '\x80' =>  no match.

That's true, that's why GNU sed 4.2 added the `z' command for the mostcommon use case s/^.*$// (which is remarkably similar to yours above).I think if you cannot be sure your UTF-8 is valid you should use LANG=C(or LC_CTYPE=C LC_COLLATE=C).

FWIW, I agree with Ulrich Drepper on what you mentioned in bug #555922.There's no point in being extra-conservative about UTF-8, especiallywhen UTF-8 files with unpaired surrogates are found in the wild andwchar_t is 32-bit. I might agree with adding a stricter UTF-8 mode, butnot with making it the default.


Paolo



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#500501: More detailed analysis

Reply via email to