Package: libc6 Version: 2.10.1-5 Severity: normal
libc's decoding of UTF-8 is not conforming to the Unicode standard. In particular, it processes: * 5 and 6-byte sequences, that are not described in the Unicode standard. * 4-byte sequences that decode to code points above U+10FFFF. * surrogates U+D800 .. U+DFFF. Also it doesn't replace ill-formed sequences with replacement characters, nor it reports an error to the calling program. All these sequences are ill-formed according to the Unicode standard. [1], pages 92-94, tables 3-6 and 3-7 define well-formed UTF-8 sequences. glibc is not going to fix it. See [2]. Nevertheless, such behavior makes glibc and eglibc not conforming to the Unicode standard. See [1], pages 59-62, conformance clauses C1, C9, C10. In particular, C7 reads: > All processes and higher-level protocols are required to abide by conformance > clause C7 at a minimum. Such not-conforming behavior directly affects all programs that link with libc and rely on its functions. In particular: * sed's regexps can't match overlong byte sequences, continuation bytes that are not parts of a sequence and first bytes that are not followed by continuation. * sed matches 5 and 6-byte sequences and surrogates in UTF-8. $ printf 'a\xf8\x88\x80\x80\x80b' | sed -e 's/./x/g' xxx * the same applies to tac(1) in regexp mode: $ printf 'aaa\xf8\x88\x80\x80\x80bbb' | tac -r -s $(printf '\xf8\x88\x80\x80\x80') | xxd - 0000000: 6262 6261 6161 f888 8080 80 bbbaaa..... $ printf 'aaa\xf8\x88\x80\x80\x80bbb' | tac -r -s '.' | xxd - 0000000: 6262 62f8 8880 8080 6161 61 bbb.....aaa * iconv() processes some ill-formed sequences, thus rendering it unusable for santinizing or validating UTF-8 input. $ printf '\xf8\x88\x80\x80\x80' | iconv -f UTF-8 -t UTF-8 | xxd - 0000000: f888 8080 80 ..... $ printf '\xf8\x88\x80\x80\x80' | iconv -f UTF-8 -t UCS-4 | xxd - 0000000: 0020 0000 . .. $ echo '<?php print iconv("UTF-8", "UTF-8", "\xf8\x88\x80\x80\x80");' | php | xxd - 0000000: f888 8080 80 ..... The described behavior is also unsafe in security sense. There are many possible scenarios, for example: Malicious input is processed with glibc's regexps and some ill-formed sequences pass through. The programmer expected that output is safe in some sense. This result is passed to another program with a UTF-8 decoder that simply skips ill-formed sequences (thus violating recomendation [3] to never delete ill-formed sequences). This can lead to some strings joining unexpectedly in place where ill-formed sequence was. Of course, second program is someehat guilty, but it wasn't expected that it would ever get ill-formed sequences as input. Attached is a testcase showing the described behavior for regexps. The same set of ill-formed strings can be used to test iconv() and all other mentioned programs. In order for these tests and demonstrations to work, please set LC_ALL to some UTF-8 locale, for example: $ export LC_ALL=ru_UA.UTF-8 As a summary: if some program wants to process UTF-8 with libc and conform to Unicode standard, it has to invent some santinizing function that will replace all ill-formed sequences in the input. Or it would be easier for the programmer just to rely on some other library that conforms, for example, libicu. [1] http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf [2] http://sources.redhat.com/bugzilla/show_bug.cgi?id=2373 [3] http://unicode.org/reports/tr36/#UTF-8_Exploit -- System Information: Debian Release: squeeze/sid APT prefers testing APT policy: (900, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 2.6.32-rc6-04nov2009 (SMP w/2 CPU cores) Locale: LANG=ru_UA.UTF-8, LC_CTYPE=ru_UA.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages libc6 depends on: ii libc-bin 2.10.1-5 GNU C Library: Binaries ii libgcc1 1:4.4.1-4 GCC support library libc6 recommends no packages. Versions of packages libc6 suggests: ii debconf [debconf-2.0] 1.5.28 Debian configuration management sy ii glibc-doc 2.10.1-5 GNU C Library: Documentation ii locales 2.10.1-5 GNU C Library: National Language ( -- debconf information excluded
/* * Compile with gcc -W -Wall -Werror -std=c99 */ #define _GNU_SOURCE 1 #include <sys/types.h> #include <regex.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <locale.h> #include <mcheck.h> static struct { const char *pattern; const char *string; } tests[] = { /* * No match. */ { "\\(.\\)", "\xc0\xaf" }, /* overlong 2-byte sequence for U+002F */ { "\\(.\\)", "\xe0\x80\xaf" }, /* overlong 3-byte sequence for U+002F */ { "\\(.\\)", "\xf0\x80\x80\xaf" }, /* overlong 4-byte sequence for U+002F */ /* continuation byte that is not part of a sequence */ { "\\(.\\)", "\x80" }, { "\\(.\\)", "\x90" }, { "\\(.\\)", "\xaa" }, { "\\(.\\)", "\xbf" }, /* first byte that is not followed by a continuation. \x61 -- 'A' */ { "\\(.\\)", "\xc2\x61" }, /* 2-byte sequence */ { "\\(.\\)", "\xe0\x61" }, /* 3-byte sequence */ { "\\(.\\)", "\xf0\x61" }, /* 4-byte sequence */ /* * Matches, but no substitution. */ /* UTF-8 only defines 1, 2, 3 and 4-byte sequences. */ { "\\(.\\)", "\xf8\x88\x80\x80\x80" }, /* 5-byte sequence, U+200000 */ { "\\(.\\)", "\xfc\x84\x80\x80\x80\x80" }, /* 6-byte sequence, U+4000000 */ { "\\(.\\)", "\xed\xa0\x80" }, /* higher surrogate, U+D800 */ { "\\(.\\)", "\xed\xa0\x91" }, /* higher surrogate, U+D811 */ { "\\(.\\)", "\xed\xaf\xbf" }, /* higher surrogate, U+DBFF */ { "\\(.\\)", "\xed\xb0\x80" }, /* lower surrogate, U+DC00 */ { "\\(.\\)", "\xed\xb0\x91" }, /* lower surrogate, U+DC11 */ { "\\(.\\)", "\xed\xbf\xbf" }, /* lower surrogate, U+DFFF */ { "\\(.\\)", "\xed\xa0\x80\xed\xb0\x80" }, /* paired surrogates, U+D800 + U+DC00 = U+10000 */ { "\\(.\\)", "\xf4\x90\x80\x80" }, /* 4-byte sequence, code point U+110000 > U+10FFFF */ { "\\(.\\)", "\xf5\xa0\xa0\xa0" }, /* 4-byte sequence, code point U+160820 > U+10FFFF */ }; int main() { mtrace(); setlocale(LC_ALL, "ru_UA.UTF-8"); // setlocale(LC_ALL, ""); for(size_t test = 0; test < sizeof(tests) / sizeof(tests[0]); test++) { printf("--- test %zu\n", test); const char *pattern = tests[test].pattern; const char *string = tests[test].string; struct re_pattern_buffer pattern_buffer; pattern_buffer.buffer = NULL; pattern_buffer.allocated = 0; pattern_buffer.fastmap = NULL; pattern_buffer.translate = NULL; pattern_buffer.no_sub = 0; re_set_syntax(RE_SYNTAX_POSIX_BASIC); const char *error = re_compile_pattern(pattern, strlen(pattern), &pattern_buffer); if(error) { printf("re_compile_pattern(): %s\n", error); exit(1); } struct re_registers regs; int errcode = re_match(&pattern_buffer, string, strlen(string), 0, ®s); if(errcode == -1) { printf("no match\n"); } else if(errcode == -2) { printf("internal error\n"); } else { for(size_t i = 0; i < regs.num_regs; i++) { size_t length = regs.end[i] - regs.start[i]; char part[length]; strncpy(part, string, length); part[length] = '\0'; printf("%zu: %d - %d: [%s]\n", i, regs.start[i], regs.end[i], part); } } regfree(&pattern_buffer); } }