From http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

"A period ( '.' ), when used outside a bracket expression, is a BRE
that shall match any character in the supported character set except
NUL."

My point here is that current implementation of regexes makes '.' NOT
match some sequences.  And that is very nasty, because it is expected
that period will match anything.

$ printf 'aaa\x80bbb' | sed -e 's/^.*$/x/g' | xxd -
0000000: 6161 6180 6262 62                        aaa.bbb

The line contained '\x80' =>  no match.

That's true, that's why GNU sed 4.2 added the `z' command for the most common use case s/^.*$// (which is remarkably similar to yours above). I think if you cannot be sure your UTF-8 is valid you should use LANG=C (or LC_CTYPE=C LC_COLLATE=C).

FWIW, I agree with Ulrich Drepper on what you mentioned in bug #555922. There's no point in being extra-conservative about UTF-8, especially when UTF-8 files with unpaired surrogates are found in the wild and wchar_t is 32-bit. I might agree with adding a stricter UTF-8 mode, but not with making it the default.

Paolo



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to