to, 2009-04-23 kello 16:52 +0200, Jakub Wilk kirjoitti:
> Package: moreutils
> Version: 0.34
> Severity: normal
> File: /usr/bin/isutf8
> 
> $ man utf-8 | grep -A 2 UTF-16 | sed -e 's/^ *//'
> The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe
> and 0xffff (UCS non-characters) should not appear in  conforming  UTF-8
> streams.
> 
> $ s='\xed\xa0\x88\xed\xbd\x85' # 0xd808 + 0xdf45
> $ printf $s | isutf8 && echo $?
> 0

Thanks for the bug report. You report very clear bugs!

Attached is a patch that should fix the issue. Jakub, could you test it
and verify that I've understood things correctly and that it really
fixes the problem?
diff --git a/check-isutf8 b/check-isutf8
index 3abb315..83a4eed 100755
--- a/check-isutf8
+++ b/check-isutf8
@@ -39,5 +39,8 @@ check 1 '\xc2'
 check 1 '\xc2\x20'
 check 1 '\x20\xc2'
 check 1 '\300\200'
+check 1 '\xed\xa0\x88\xed\xbd\x85' # UTF-16 surrogates
+check 1 '\xef\xbf\xbe' # 0xFFFE
+check 1 '\xef\xbf\xbf' # 0xFFFF
 
 exit $failed
diff --git a/isutf8.c b/isutf8.c
index 4306c7d..c5f5eeb 100644
--- a/isutf8.c
+++ b/isutf8.c
@@ -127,6 +127,14 @@ static unsigned long decodeutf8(unsigned char *buf, int nbytes)
                             return INVALID_CHAR;
                 u = (u << 6) | (buf[j] & 0x3f);
         }
+
+        /* Conforming UTF-8 cannot contain codes 0xd800–0xdfff (UTF-16 
+           surrogates) as well as 0xfffe and 0xffff. */
+        if (u >= 0xD800 && u <= 0xDFFF)
+            return INVALID_CHAR;
+        if (u == 0xFFFE || u == 0xFFFF)
+            return INVALID_CHAR;
+
         return u;
 }
 
@@ -145,7 +153,7 @@ static int is_utf8_byte_stream(FILE *file, char *filename, int quiet) {
         int nbytes, nbytes2;
         int c;
         unsigned long code;
-	unsigned long line, col, byteoff;
+        unsigned long line, col, byteoff;
 
         nbytes = 0;
         line = 1;

Reply via email to