Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

Paul Eggert Fri, 12 Sep 2014 09:21:21 -0700

Vincent Lefevre wrote:

Glibc regards it as ASCII:

You're right. Sorry, I was confused. FreeBSD, Solaris, and AIX workthe way that I thought, though. Plus, in GNU regular expressions thepattern "." works the way that I thought with LC_ALL=C; my guess(without investigating this) is that this is because whoever wrote theregex code assumed the BSDish behavior. Arguably this is a glitch inthe GNU regex code, in that for consistency "." should not matchencoding errors in unibyte locales.


Here's a pair of test cases to illustrate the glitch:

$ printf '\200\n' | LC_ALL=en_US.utf8 grep '.' | wc
      0       0       0
$ printf '\200\n' | LC_ALL=C grep '.' | wc
      1       0       2

I just mean that "grep ." is a method given by some people, that
was working before UTF-8.


And it still works, if by "." one means "match one character".

Unfortunately there is no POSIX regular expression that does what you'relooking for (match either one character, or a single byte that is anencoding error). This is because POSIX says the behavior is undefinedon encoding errors. The GNU syntax for regular expressions extendsPOSIX and does not dump core, but it still provides no way to write thepattern you're asking for, and the behavior is unspecified on encodingerrors. Perhaps this should be improved by fixing the abovementionedglitch and by providing a syntax extension for matching encoding errors,though we'd need a volunteer to do that.

The situation with libpcre is weirder: there's a pattern '\C' formatching a single byte even if it's an encoding error, but as far as Ican tell there's no way to use regular expressions safely on arbitrarydata containing encoding errors unless you're in unibyte mode (in whichcase '\C' provides no extra power). I.e., \C appears to be useless inany program for which undefined behavior is unacceptable.



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

Reply via email to