[Lldb-commits] [PATCH] D66447: Add char8_t support (C++20)

Shafik Yaghmour via Phabricator via lldb-commits Wed, 21 Aug 2019 09:23:11 -0700

shafik added a comment.

In D66447#1638783 <https://reviews.llvm.org/D66447#1638783>, @labath wrote:


> In D66447#1638047 <https://reviews.llvm.org/D66447#1638047>, @JDevlieghere 
> wrote:
>
> > In D66447#1637640 <https://reviews.llvm.org/D66447#1637640>, @labath wrote:
> >
> > > This looks good to me, but why are we using a nul character to test utf8 
> > > support? Shouldn't we insert some funnier characters too? I mean, one of 
> > > the advantages of unicode is that it should not be affected by the system 
> > > code pages and such, so hopefully this would not cause problems even on 
> > > some more exotic setups. (And I am pretty sure I remember already seeing 
> > > some chinese chars in some of our data formatter tests)
> >
> >
> > I only glanced at the proposal, but unless I misunderstand the type only 
> > fits UTF-8 characters representable in 1 byte, which are basically just 
> > ASCII.
>
>
> I have now too glanced at the proposal (just the cppreference page, really :) 
> ). I think I understand where you got this impression from, but I don't think 
> that is fully correct. It is true that a *single* char8_t variable can hold 
> only 8 bit UTF8 code units (*not* characters), but that is not surprising 
> since UTF8 is a variable length encoding, so you can't have a type that 
> matches one character exactly. However, an *array* of char8_t is a completely 
> different thing, and I am pretty sure that these are intended to hold utf8 
> strings containing any utf8 characters (otherwise, it wouldn't really deserve 
> to call itself a utf8 type), and so we should print (and test) it as regular 
> utf8.
>
> However, this actually surfaces the question of how should we format single 
> char8_t variables. It makes sense to display the character value if the value 
> happens to be ASCII, but I guess we shouldn't print something like "invalid 
> utf8 character" if it does contain one unit of the multibyte characters.


You may find the the C++ Evolution Working Groups entry on [N4197 Adding u8 
character literals, [tiny] Why no u8 character 
literals?](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4540.html#119)
 and the proposal that add char8_t 
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r5.html> helpful 
in understanding the rationale and the proposal for `char8_t` runs through a 
lot of examples.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D66447/new/

https://reviews.llvm.org/D66447



_______________________________________________
lldb-commits mailing list
lldb-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits

[Lldb-commits] [PATCH] D66447: Add char8_t support (C++20)

Reply via email to