Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

Vincent Lefevre Fri, 12 Sep 2014 14:33:23 -0700

On 2014-09-12 09:16:45 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >I just mean that "grep ." is a method given by some people, that
> >was working before UTF-8.
> 
> And it still works, if by "." one means "match one character".


No, by "working", I mean that "grep ." was matching any non-empty
line. A non-empty line is anything that is not "\n", with valid
characters and/or invalid byte sequences.

> Unfortunately there is no POSIX regular expression that does what you're
> looking for (match either one character, or a single byte that is an
> encoding error).  This is because POSIX says the behavior is undefined on
> encoding errors.

But since the behavior is undefined, a grep implementation is free
to do anything it likes, such as make "." match invalid bytes. See
below for details.

> The GNU syntax for regular expressions extends POSIX and does not
> dump core, but it still provides no way to write the pattern you're
> asking for, and the behavior is unspecified on encoding errors.
> Perhaps this should be improved by fixing the abovementioned glitch
> and by providing a syntax extension for matching encoding errors,
> though we'd need a volunteer to do that.

I'm not sure that a syntax extension would really be useful. I think
that an option to control what happens on encoding errors would be
better and sufficient. For instance, a choice between the 4 following
behaviors:

1. If an encoding error is encountered, grep returns an error. Some
encoding errors may remained unnoticed, e.g. if -m is used and the
max count has been reached (you can see the behavior of such an error
as being similar to a file read error). The error may be signaled
immediately, even when there is a match before.

2. An encoding error is never matched. I suppose that this is the
current behavior in UTF-8.

3. An encoding error is regarded as a special character different
from the other characters. In particular it will be matched by "."
and "[^...]". Whether a sequence of invalid bytes is regarded as a
single special character or several ones could be specified or not
(in practice, there could be 2 possibilities: either regard each
byte as a special character, or regard each longest valid prefix
as a special character). The properties of this special character
could be specified or not, concerning character classes (I would
say that the character doesn't fall in any class, possibly except
cntrl).

4. Like (3), but the character could be an existing one (such as \0).
The idea behind this behavior is that the user may not really care,
but wants grep to be fast. Now, unless \0 appears in the pattern
under some form, replacing the encoding error by a null character
would be equivalent to "(3) + the special character is in the cntrl
character class".

> The situation with libpcre is weirder: there's a pattern '\C' for
> matching a single byte even if it's an encoding error, but as far as
> I can tell there's no way to use regular expressions safely on
> arbitrary data containing encoding errors unless you're in unibyte
> mode (in which case '\C' provides no extra power). I.e., \C appears
> to be useless in any program for which undefined behavior is
> unacceptable.

In the context of libpcre (which doesn't support encoding errors,
contrary to Perl if I understand correctly), \C can still be used
and be useful when there are no encoding errors. But not that the
pcresyntax(3) man page says "best avoided", the pcrepattern(3) man
page says that it can yield undefined behavior (but gives a complex
example where it can be used), and the perlre(1) man page says that
\C is deprecated. So, grep could say that \C is not supported.

-- 
Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#758105: bug#18266: handling bytes not part of the charset, and other garbage

Reply via email to