On 2014-09-12 09:16:45 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >I just mean that "grep ." is a method given by some people, that > >was working before UTF-8. > > And it still works, if by "." one means "match one character".
No, by "working", I mean that "grep ." was matching any non-empty line. A non-empty line is anything that is not "\n", with valid characters and/or invalid byte sequences. > Unfortunately there is no POSIX regular expression that does what you're > looking for (match either one character, or a single byte that is an > encoding error). This is because POSIX says the behavior is undefined on > encoding errors. But since the behavior is undefined, a grep implementation is free to do anything it likes, such as make "." match invalid bytes. See below for details. > The GNU syntax for regular expressions extends POSIX and does not > dump core, but it still provides no way to write the pattern you're > asking for, and the behavior is unspecified on encoding errors. > Perhaps this should be improved by fixing the abovementioned glitch > and by providing a syntax extension for matching encoding errors, > though we'd need a volunteer to do that. I'm not sure that a syntax extension would really be useful. I think that an option to control what happens on encoding errors would be better and sufficient. For instance, a choice between the 4 following behaviors: 1. If an encoding error is encountered, grep returns an error. Some encoding errors may remained unnoticed, e.g. if -m is used and the max count has been reached (you can see the behavior of such an error as being similar to a file read error). The error may be signaled immediately, even when there is a match before. 2. An encoding error is never matched. I suppose that this is the current behavior in UTF-8. 3. An encoding error is regarded as a special character different from the other characters. In particular it will be matched by "." and "[^...]". Whether a sequence of invalid bytes is regarded as a single special character or several ones could be specified or not (in practice, there could be 2 possibilities: either regard each byte as a special character, or regard each longest valid prefix as a special character). The properties of this special character could be specified or not, concerning character classes (I would say that the character doesn't fall in any class, possibly except cntrl). 4. Like (3), but the character could be an existing one (such as \0). The idea behind this behavior is that the user may not really care, but wants grep to be fast. Now, unless \0 appears in the pattern under some form, replacing the encoding error by a null character would be equivalent to "(3) + the special character is in the cntrl character class". > The situation with libpcre is weirder: there's a pattern '\C' for > matching a single byte even if it's an encoding error, but as far as > I can tell there's no way to use regular expressions safely on > arbitrary data containing encoding errors unless you're in unibyte > mode (in which case '\C' provides no extra power). I.e., \C appears > to be useless in any program for which undefined behavior is > unacceptable. In the context of libpcre (which doesn't support encoding errors, contrary to Perl if I understand correctly), \C can still be used and be useful when there are no encoding errors. But not that the pcresyntax(3) man page says "best avoided", the pcrepattern(3) man page says that it can yield undefined behavior (but gives a complex example where it can be used), and the perlre(1) man page says that \C is deprecated. So, grep could say that \C is not supported. -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org