Il 14/03/19 22:48, Thiago Macieira ha scritto:
For

   char16_t text1[] = u"" "\u0102";

It produces, without /utf-8 (seehttps://msvc.godbolt.org/z/EvtKzq):

?text1@@3PA_SA DB '?', 00H, 00H, 00H                    ; text1

And with /utf-8:

?text1@@3PA_SA DB 0c4H, 00H, 01aH, ' ', 00H, 00H        ; text1

Those two values make no sense. U+0102 is neither 0x003f (question mark) nor
0x00c4 0x201a ("Ä‚"). This is a clear compiler bug. An interpretation of the
C++11 standard could say that the translation is correct for the no-/utf-8
build, but with /utf-8 or /execution-charset:utf-8 it should have produced the
correct result.


Actually, those values have a somehow connection with the input. Looks like MSVC is double-encoding it:

* "\u0102" under UTF-8 execution charset produces a string containing 0xC4 0x82;

* that string literal is a generic narrow string literal (non prefixed). When concatenating to a u-prefixed string literal, somehow MSVC thinks it's in its native codepage instead of UTF-8...

* so it now reencodes 0xC4 0x82 from CP1252 to UTF-16, yielding
0x00 0xC4 0x20 0x1a, which is what ends up in text1 (fixing the endianness)

The mapping of \u escape sequences to the execution character set happens before string literal concatenation (translation phases 5/6). But AFAIU the mapping is purely symbolic, and has nothing to do with any actual encoding, so MSVC is at fault here?

My 2 c,

--
Giuseppe D'Angelo | [email protected] | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts

Attachment: smime.p7s
Description: Firma crittografica S/MIME

_______________________________________________
Development mailing list
[email protected]
https://lists.qt-project.org/listinfo/development

Reply via email to