On 15/03/2019 08.27, Giuseppe D'Angelo via Development wrote: > Il 14/03/19 22:48, Thiago Macieira ha scritto: >> For >> >> char16_t text1[] = u"" "\u0102"; >> >> It produces, without /utf-8 (seehttps://msvc.godbolt.org/z/EvtKzq): >> >> ?text1@@3PA_SA DB '?', 00H, 00H, 00H ; text1 >> >> And with /utf-8: >> >> ?text1@@3PA_SA DB 0c4H, 00H, 01aH, ' ', 00H, 00H ; text1 >> >> Those two values make no sense. U+0102 is neither 0x003f (question >> mark) nor 0x00c4 0x201a ("Ä‚"). This is a clear compiler bug. An >> interpretation of the C++11 standard could say that the translation >> is correct for the no-/utf-8 build,
In fact, I now believe that to be the case (if unfortunate); note [lex.phases]¶1.5 and also https://groups.google.com/a/isocpp.org/d/msg/std-discussion/qYf6treuLmY/EeLI6bqTCwAJ. >> but with /utf-8 or /execution-charset:utf-8 it should have produced >> the correct result. > > Actually, those values have a somehow connection with the input. Looks > like MSVC is double-encoding it: > > * "\u0102" under UTF-8 execution charset produces a string containing > 0xC4 0x82; > > * that string literal is a generic narrow string literal (non prefixed). > When concatenating to a u-prefixed string literal, somehow MSVC thinks > it's in its native codepage instead of UTF-8... *That* smells buggy. I think I'll stick to /we4566 and adding the extra 'u' if my QStringLiteral is non-ASCII so that I'm not hitting this case. > The mapping of \u escape sequences to the execution character set > happens before string literal concatenation (translation phases 5/6). > But AFAIU the mapping is purely symbolic, and has nothing to do with any > actual encoding, so MSVC is at fault here? Why do you think it's "symbolic"? The standard clearly says "if there is no corresponding member [of the target character set], [the character] is converted to an implementation-defined member". That's obviously the case for the characters in question, so they get mapped to '?'. AFAICT, in my example (execution character set == CP-1252), MSVC is doing what the standard requires it to do. It's unfortunate that this isn't what the user wanted, but I don't see a "solution" except to swap phases 5 and 6. (But again, this does *not* apply to the ECS == UTF-8 case.) (Note: Another solution is to redefine QT_UNICODE_LITERAL_II to `u ## str`, but that's SIC.) -- Matthew _______________________________________________ Development mailing list [email protected] https://lists.qt-project.org/listinfo/development
