https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976
--- Comment #14 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The releases/gcc-13 branch has been updated by Jonathan Wakely <r...@gcc.gnu.org>: https://gcc.gnu.org/g:bd5e672303f5f777e8927a746d3ee42db21d871b commit r13-8788-gbd5e672303f5f777e8927a746d3ee42db21d871b Author: Dimitrij Mijoski <dm...@hotmail.com> Date: Thu Sep 28 21:38:11 2023 +0200 libstdc++: Fix handling of surrogate CP in codecvt [PR108976] This patch fixes the handling of surrogate code points in all standard facets for transcoding Unicode that are based on std::codecvt. Surrogate code points should always be treated as error. On the other hand surrogate code units can only appear in UTF-16 and only when they come in a proper pair. Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number of bytes were given in the range [from, from_end), error was returned always. The last byte in such range does not form a full UTF-16 code unit and we can not make any decisions for error, instead partial should be returned. The testsuite for testing these facets was updated in the following order: 1. All functions that test codecvts that work with UTF-8 were refactored and made more generic so they accept codecvt that works with the char type char8_t. 2. The same functions were updated with new test cases for transcoding errors and now additionally test for surrogates, overlong UTF-8 sequences, code points out of the Unicode range, and more tests for missing leading and trailing code units. 3. New tests were added to test codecvt_utf16 in both of its variants, UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2. libstdc++-v3/ChangeLog: PR libstdc++/108976 * src/c++11/codecvt.cc (read_utf8_code_point): Fix handing of surrogates in UTF-8. (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-8. (ucs4_in): Fix handling of range with odd number of bytes. (ucs4_out): Fix handling of surrogates in UCS-4 -> UTF-16. (ucs2_out): Fix handling of surrogates in UCS-2 -> UTF-16. (ucs2_in): Fix handling of range with odd number of bytes. (__codecvt_utf16_base<char16_t>::do_in): Likewise. (__codecvt_utf16_base<char32_t>::do_in): Likewise. (__codecvt_utf16_base<wchar_t>::do_in): Likewise. * testsuite/22_locale/codecvt/codecvt_unicode.cc: Renames, add tests for codecvt_utf16<char16_t> and codecvt_utf16<char32_t>. * testsuite/22_locale/codecvt/codecvt_unicode.h: Refactor UTF-8 testing functions for char8_t, add more test cases for errors, add testing functions for codecvt_utf16. * testsuite/22_locale/codecvt/codecvt_unicode_wchar_t.cc: Renames, add tests for codecvt_utf16<whchar_t>. * testsuite/22_locale/codecvt/codecvt_utf16/79980.cc (test06): Fix test. * testsuite/22_locale/codecvt/codecvt_unicode_char8_t.cc: New test. (cherry picked from commit a8b9c32da787ea0bfbfc9118ac816fa7be4b1bc8)