[Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences

jcavalla at postini dot com Fri, 20 Apr 2007 14:28:13 -0700

Problem:

When calling the out() method of a codecvt facet for a locale that specifies
UTF-8 encoding, the method fails to recognize partial (i.e., incomplete) UTF-8
encoding sequences at the end of the source string.  Instead of returning the
expected
std::codecvt_base::partial status code with the returned source position
(arg-4) indexing the start of the incomplete sequence, the method returns
std::codecvt_base::ok with the returned source position just past the end of
the source string.  Nothing from the partial sequence ends up in the
destination
wide string (as expected).


Compilation:

gcc -v --save-temps -Wall -ansi -pedantic -g -o localetest localetest.cxx

Compilation output:

Using built-in specs.
Target: i386-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/u
sr/share/info --enable-shared --enable-threads=posix --enable-checking=release
-
-with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions
--enable-
libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada
--enable
-java-awt=gtk --disable-dssi --enable-plugin
--with-java-home=/usr/lib/jvm/java-
1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=i386-redhat-linux
Thread model: posix
gcc version 4.1.1 20070105 (Red Hat 4.1.1-51)
 /usr/libexec/gcc/i386-redhat-linux/4.1.1/cc1plus -E -quiet -v -D_GNU_SOURCE
loc
aletest.cxx -mtune=generic -ansi -Wall -pedantic -fworking-directory
-fpch-prepr
ocess -o localetest.ii
ignoring nonexistent directory
"/usr/lib/gcc/i386-redhat-linux/4.1.1/../../../..
/i386-redhat-linux/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc/i386-redhat-linux/4.1.1/../../../../include/c++/4.1.1

/usr/lib/gcc/i386-redhat-linux/4.1.1/../../../../include/c++/4.1.1/i386-redhat-
linux
 /usr/lib/gcc/i386-redhat-linux/4.1.1/../../../../include/c++/4.1.1/backward
 /usr/local/include
 /usr/lib/gcc/i386-redhat-linux/4.1.1/include
 /usr/include
End of search list.
 /usr/libexec/gcc/i386-redhat-linux/4.1.1/cc1plus -fpreprocessed localetest.ii
-
quiet -dumpbase localetest.cxx -mtune=generic -ansi -auxbase localetest -g
-Wall
 -pedantic -ansi -version -o localetest.s
GNU C++ version 4.1.1 20070105 (Red Hat 4.1.1-51) (i386-redhat-linux)
        compiled by GNU C version 4.1.1 20070105 (Red Hat 4.1.1-51).
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 4720743fdfefd64206c8550433f6e508
 as -V -Qy -o localetest.o localetest.s
GNU assembler version 2.17.50.0.6-2.fc6 (i386-redhat-linux) using BFD version
2.
17.50.0.6-2.fc6 20061020
 /usr/libexec/gcc/i386-redhat-linux/4.1.1/collect2 --eh-frame-hdr -m elf_i386
--
hash-style=gnu -dynamic-linker /lib/ld-linux.so.2 -o localetest
/usr/lib/gcc/i38
6-redhat-linux/4.1.1/../../../crt1.o
/usr/lib/gcc/i386-redhat-linux/4.1.1/../../
../crti.o /usr/lib/gcc/i386-redhat-linux/4.1.1/crtbegin.o
-L/usr/lib/gcc/i386-re
dhat-linux/4.1.1 -L/usr/lib/gcc/i386-redhat-linux/4.1.1
-L/usr/lib/gcc/i386-redh
at-linux/4.1.1/../../.. localetest.o -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s
-lgc
c /usr/lib/gcc/i386-redhat-linux/4.1.1/crtend.o
/usr/lib/gcc/i386-redhat-linux/4
.1.1/../../../crtn.o

Test Source File (localetest.cxx):

//
// This test demonstrates that UTF-8 codecvt facets are ignoring incomplete
// trailing encoding sequences.  The expected behavior is a return of the
// status value std::codecvt_base::partial, with the returned current source
// position at the start of the failed sequence.  The actual behavior is a
// return of std::codecvt_base::ok, with the returned current source position
// at the end of the source string (i.e., the incomplete sequence is ignored).
// 

#include <iostream>
#include <string>
#include <locale>

using namespace std;

//
// Some typedefs to help with facet access.
//

typedef codecvt_base::result Result;
typedef string::traits_type::state_type State;
typedef codecvt<wstring::value_type, string::value_type, State> Converter;

wchar_t to[256];                // Destination buffer.

//
// Perform each test iteration fresh, just to make sure that there isn't any
// lingering context between tests.
//

void
dotest(
    const string &test_name,
    const char *const locale_name,
    const string &test_string
) {

    State q;                            // Shift state context.
    const string::value_type *me = 0;   // Multibyte source current postion.
    wstring::value_type *we = 0;        // Wide destination current position.
    Result status;                      // Conversion status.

    //
    // Set the current locale.
    //

    locale loc(locale_name);
    locale::global(loc);
    cout.imbue(loc);

    //
    // Start with a clear output buffer.
    //

    memset(to, 0, sizeof(to));

    //
    // Do the conversion from narrow multibyte to wide unicode.
    //

    const Converter& cvt = use_facet<Converter>(loc);
    memset(&q, 0, sizeof(q));
    string::size_type src_size = test_string.size();

    status =
        cvt.in(
            q,
            test_string.data(), test_string.data() + src_size, me,
            to, to + sizeof(to)/sizeof(to[0]), we
        );

    string::size_type mpos = me - test_string.data();
    wstring::size_type wpos = we - to;

    //
    // Display the results:
    //

    cout << endl;

    cout << test_name << ": " << loc.name() << endl;
    cout << "  Input:";
    for (
        string::const_iterator i = test_string.begin();
        i != test_string.end();
        ++i
    )
        cout << " " << hex << ((*i)&0xFF);
    cout << " \"" << test_string << "\"" << endl;

    cout << dec << "  Result=" << status
        << " Source=" << mpos
        << " Dest=" << wpos
        << endl;

    cout << "  Output:";
    for (size_t i = 0; i < wpos; ++i)
        cout << " " << hex << to[i];
    cout << endl;

    cout << endl;

    return;

}

//
// Do three tests for each locale: one with a good string, one with a partial
// string, and one with an error string.
//

string from_ok("\xC2\xA1Hasta ma\xC3\xB1\x61na!");
                                // Whole string, with complete lowercase en-yay
                                // sequence (\xC3\xB1).

string from_partial("\xC2\xA1Hasta ma\xC3");
                                // Partial string, with lowercase en-yay cut
                                // off after the first byte of the two-byte
                                // sequence.

string from_error("\xC2\xA1Hasta\xFF ma\xC3\xB1\x61na!");
                                // An error in the middle of the string, for
                                // comparison purposes.

void
dolocale(const char *const locale_name) {
    dotest("Complete", locale_name, from_ok);
    dotest("Partial", locale_name, from_partial);
    dotest("Error", locale_name, from_error);
    return;
}

//
// Do the test across 3 different locales, all with UTF-8 encoding.
//

int
main(int argc, char *argv[]) {

    dolocale("en_US.UTF-8");
    dolocale("es_US.UTF-8");
    dolocale("es_CR.UTF-8");

    return 0;

}

Test Output:

Complete: en_US.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta mañana!"
  Result=0 Source=16 Dest=14
  Output: a1 48 61 73 74 61 20 6d 61 f1 61 6e 61 21


Partial: en_US.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 "¡Hasta ma&#65533;"
  Result=0 Source=11 Dest=9
  Output: a1 48 61 73 74 61 20 6d 61


Error: en_US.UTF-8
  Input: c2 a1 48 61 73 74 61 ff 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta&#65533;
mañana!"
  Result=2 Source=7 Dest=6
  Output: a1 48 61 73 74 61


Complete: es_US.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta mañana!"
  Result=0 Source=16 Dest=14
  Output: a1 48 61 73 74 61 20 6d 61 f1 61 6e 61 21


Partial: es_US.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 "¡Hasta ma&#65533;"
  Result=0 Source=11 Dest=9
  Output: a1 48 61 73 74 61 20 6d 61


Error: es_US.UTF-8
  Input: c2 a1 48 61 73 74 61 ff 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta&#65533;
mañana!"
  Result=2 Source=7 Dest=6
  Output: a1 48 61 73 74 61


Complete: es_CR.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta mañana!"
  Result=0 Source=16 Dest=14
  Output: a1 48 61 73 74 61 20 6d 61 f1 61 6e 61 21


Partial: es_CR.UTF-8
  Input: c2 a1 48 61 73 74 61 20 6d 61 c3 "¡Hasta ma&#65533;"
  Result=0 Source=11 Dest=9
  Output: a1 48 61 73 74 61 20 6d 61


Error: es_CR.UTF-8
  Input: c2 a1 48 61 73 74 61 ff 20 6d 61 c3 b1 61 6e 61 21 "¡Hasta&#65533;
mañana!"
  Result=2 Source=7 Dest=6
  Output: a1 48 61 73 74 61


Test Results:

Note that each of the error cases properly reports the invalid encoding byte
in the source string (\xFF) at source position 7; however, the partial test
cases improperly ignore the partial encoding sequence (\xC3) at the end of the
partial test strings.


-- 
           Summary: Codecvt facets with UTF-8 encoding fail to recognize
                    partial encoding sequences
           Product: gcc
           Version: 4.1.1
            Status: UNCONFIRMED
          Severity: major
          Priority: P3
         Component: libstdc++
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: jcavalla at postini dot com
GCC target triplet: i386-redhat-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31643

[Bug libstdc++/31643] New: Codecvt facets with UTF-8 encoding fail to recognize partial encoding sequences

Reply via email to