On Wednesday, 18 May 2022 20:31:55 PDT Alvin Wong wrote: > Thanks for pointing me to the test. I compiled the test and created an > application manifest for it to use UTF-8. It correctly detected the > system codec to be UTF-8 and reported several failures and a bunch of > warnings. I've attached the test output.
Thanks. It looks like you were completely right in your inspection of the code. All of the charByChar test rows that used more than 2 characters failed. The "nul" entry also did, but that looks like a shortcoming of the Win32 API. > This gave me an idea: perhaps the easiest way for Qt to fix it would > be to check `GetACP() == CP_UTF8`, and if it is true then just use > Qt's built-in UTF-8 support and bypass MultiByteToWideChar completely. Indeed. And given that the UTF-8 codec is highly optimised, it will be definitely much faster. I'll make the change for Qt 6. However, it wouldn't solve the problem for other multibyte locales that may have more than one continuation character. A quick check of the likely culprits reveals that: * Chinese (CP 936) uses GBK, which is limited to two bytes * Japanese (CP 932) uses a variant of Shift JIS, but is also two-byte only * Korean (CP 949) uses the Unified Hangul Code, which likewise only goes up to two bytes Wikipedia also says that GB 2312 is the most common encoding for web pages in Chinese, but that is also a one- or two-byte codec too. And it is no longer used by Windows itself. So it looks like we've never hit this problem because the codepages used by Windows were all DBCS. It might not be worth fixing the codec implementation then. > > As far as I know, it already does. The Vietnamese locale for Windows has > > been using UTF-8 for years (probably since forever) and there's no reason > > that Qt shouldn't support it. > > I don't have first-hand experience with Vietnamese Windows but isn't > Windows-1258 the Vietnamese code page? I know vaguely that > Unicode-only locales (not Vietnamese) are a thing on Windows, but I > thought they had no way to use UTF-8 as the ACP until the beta UTF-8 > support landed on Windows 10. Ok, I'm probably remembering wrong. I thought it had been possible all along, but you had to switch to a particular language (which I thought was Vietnamese), which most users were not win a position to do. The Wikipedia article on CP 1258 has the sentence "UTF-8 is the preferred encoding for Vietnamese in modern applications." I guess I was misled by it and thought it meant UTF-8 was in use on Windows. -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering _______________________________________________ Interest mailing list Interest@qt-project.org https://lists.qt-project.org/listinfo/interest