[Bug libstdc++/108976] codecvt for Unicode allows surrogate code points

redi at gcc dot gnu.org via Gcc-bugs Thu, 02 Mar 2023 03:17:41 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108976


--- Comment #3 from Jonathan Wakely <redi at gcc dot gnu.org> ---
I have some new code for handling UTF-8 for std::print, and using that code
your relaxed u8str gets converted to 12 U+FFFD code points when printed to a
terminal, which I think is correct.

#include <print>

int main()
{
  char u8str[] = "\uC800\uCBFF\uCC00\uCFFF";
  std::println("valid UTF-8: {}", u8str);

  u8str[0] = u8str[3] = u8str[6] = u8str[9] = 0xED; // turn the C into D.
  // now the string is D800, DBFF, DC00 and DFFF encoded in relaxed UTF-8
  // that allows surrogate code points.
  std::vprint_nonunicode("invalid UTF-8 printed raw: {}\n",
std::make_format_args(u8str));
  std::println("invalid UTF-8 printed safely: {}", u8str);
}
$ g++ -std=c++23 surr.cc && ./a.out && ./a.out | xxd
valid UTF-8: 저쯿찀쿿
invalid UTF-8 printed raw: ������������
invalid UTF-8 printed safely: ������������
00000000: 7661 6c69 6420 5554 462d 383a 20ec a080  valid UTF-8: ...
00000010: ecaf bfec b080 ecbf bf0a 696e 7661 6c69  ..........invali
00000020: 6420 5554 462d 3820 7072 696e 7465 6420  d UTF-8 printed 
00000030: 7261 773a 20ed a080 edaf bfed b080 edbf  raw: ...........
00000040: bf0a 696e 7661 6c69 6420 5554 462d 3820  ..invalid UTF-8 
00000050: 7072 696e 7465 6420 7361 6665 6c79 3a20  printed safely: 
00000060: efbf bdef bfbd efbf bdef bfbd efbf bdef  ................
00000070: bfbd efbf bdef bfbd efbf bdef bfbd efbf  ................
00000080: bdef bfbd 0a                             .....


The new code is also much faster, so I'm thinking of rewriting some of the
src/c++11/codecvt.cc facets to use it. But that's a longer term project, we
should fix this bug first.

[Bug libstdc++/108976] codecvt for Unicode allows surrogate code points

Reply via email to