Bug#375790: uni2ascii: Failure when read returns a partial result in the middle of a UTF-8 sequence

Dylan Thurston Tue, 27 Jun 2006 22:10:35 -0700

Package: uni2ascii
Version: 3.9-1
Severity: normal
Tags: patch

uni2ascii fails if a read returns a partial result in the middle of a
multi-byte UTF-8 sequence.  This only happens if the UTF-8 sequence is
at least 3 bytes long, because of the detailed logic.  Here's a test
case:


------------------------------------------------------------
[EMAIL PROTECTED]:~$ (echo -ne '\344\270'; echo -e '\200') | uni2ascii
0x4E00
1 tokens converted out of 2 characters
[EMAIL PROTECTED]:~$ (echo -ne '\344\270'; sleep 2; echo -e '\200') | uni2ascii
Truncated UTF-8 sequence encountered at byte 0, character 0.
------------------------------------------------------------

The byte sequence '\342\270\200' is a single UTF-8 sequence.  In the
first command, with no pause, the read returns the expected result.
In the second command we insert a pause between the 2nd and 3rd bytes,
with a resulting spurious error message.

(The bug was found by Chung-Chieh Shan, cc:ed.)

Patch attached.  I note that the file patched, Get_UTF32_From_UTF8.c,
is written by Bill Poser, who is not listed as the author of
uni2ascii; perhaps this file came from some other package, and the
patch should be passed upstream?

Peace,
        Dylan Thurston

-- System Information:
Debian Release: testing/unstable
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.17.1
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)

Versions of packages uni2ascii depends on:
ii  libc6                         2.3.6-15   GNU C Library: Shared libraries

uni2ascii recommends no packages.

-- no debconf information

--- uni2ascii-3.9/Get_UTF32_From_UTF8.c	2006-05-11 21:20:37.000000000 -0400
+++ uni2ascii-3.9.new/Get_UTF32_From_UTF8.c	2006-06-28 00:44:18.000000000 -0400
@@ -86,6 +86,7 @@
 
 UTF32 Get_UTF32_From_UTF8 (int fd, int *bytes, unsigned char **bstr)
 {
+  int BytesSoFar;
   int BytesRead;
   int BytesNeeded;		/* Additional bytes after initial byte */
   static unsigned char c[6];
@@ -102,9 +103,13 @@
 
   /* Now get the remaining bytes */
   BytesNeeded = (int) TrailingBytesForUTF8[c[0]];
-  BytesRead = read(fd,(void *) &c[1],(size_t) BytesNeeded);
-  if(BytesRead != BytesNeeded) return(UTF8_NOTENOUGHBYTES);
-  *bytes = BytesRead+1;
+  BytesSoFar = 0;
+  do {
+    BytesRead = read(fd,(void *) &c[BytesSoFar+1],(size_t) (BytesNeeded-BytesSoFar));
+    BytesSoFar += BytesRead;
+  } while (BytesRead > 0 || BytesSoFar < BytesNeeded);
+  if(BytesSoFar != BytesNeeded) return(UTF8_NOTENOUGHBYTES);
+  *bytes = BytesNeeded+1;
   *bstr = &c[0];
 
   /* Check validity of source */

Bug#375790: uni2ascii: Failure when read returns a partial result in the middle of a UTF-8 sequence

Reply via email to