Vincent Lefevre wrote:
Glibc regards it as ASCII:
You're right. Sorry, I was confused. FreeBSD, Solaris, and AIX work
the way that I thought, though. Plus, in GNU regular expressions the
pattern "." works the way that I thought with LC_ALL=C; my guess
(without investigating this) is that this is because whoever wrote the
regex code assumed the BSDish behavior. Arguably this is a glitch in
the GNU regex code, in that for consistency "." should not match
encoding errors in unibyte locales.
Here's a pair of test cases to illustrate the glitch:
$ printf '\200\n' | LC_ALL=en_US.utf8 grep '.' | wc
0 0 0
$ printf '\200\n' | LC_ALL=C grep '.' | wc
1 0 2
I just mean that "grep ." is a method given by some people, that
was working before UTF-8.
And it still works, if by "." one means "match one character".
Unfortunately there is no POSIX regular expression that does what you're
looking for (match either one character, or a single byte that is an
encoding error). This is because POSIX says the behavior is undefined
on encoding errors. The GNU syntax for regular expressions extends
POSIX and does not dump core, but it still provides no way to write the
pattern you're asking for, and the behavior is unspecified on encoding
errors. Perhaps this should be improved by fixing the abovementioned
glitch and by providing a syntax extension for matching encoding errors,
though we'd need a volunteer to do that.
The situation with libpcre is weirder: there's a pattern '\C' for
matching a single byte even if it's an encoding error, but as far as I
can tell there's no way to use regular expressions safely on arbitrary
data containing encoding errors unless you're in unibyte mode (in which
case '\C' provides no extra power). I.e., \C appears to be useless in
any program for which undefined behavior is unacceptable.
--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org