[Lldb-commits] [PATCH] D66447: Add char8_t support (C++20)

Jonas Devlieghere via Phabricator via lldb-commits Wed, 21 Aug 2019 08:48:24 -0700

JDevlieghere added a comment.

In D66447#1638783 <https://reviews.llvm.org/D66447#1638783>, @labath wrote:


> In D66447#1638047 <https://reviews.llvm.org/D66447#1638047>, @JDevlieghere 
> wrote:
>
> > In D66447#1637640 <https://reviews.llvm.org/D66447#1637640>, @labath wrote:
> >
> > > This looks good to me, but why are we using a nul character to test utf8 
> > > support? Shouldn't we insert some funnier characters too? I mean, one of 
> > > the advantages of unicode is that it should not be affected by the system 
> > > code pages and such, so hopefully this would not cause problems even on 
> > > some more exotic setups. (And I am pretty sure I remember already seeing 
> > > some chinese chars in some of our data formatter tests)
> >
> >
> > I only glanced at the proposal, but unless I misunderstand the type only 
> > fits UTF-8 characters representable in 1 byte, which are basically just 
> > ASCII.
>
>
> I have now too glanced at the proposal (just the cppreference page, really :) 
> ). I think I understand where you got this impression from, but I don't think 
> that is fully correct. It is true that a *single* char8_t variable can hold 
> only 8 bit UTF8 code units (*not* characters), but that is not surprising 
> since UTF8 is a variable length encoding, so you can't have a type that 
> matches one character exactly. However, an *array* of char8_t is a completely 
> different thing, and I am pretty sure that these are intended to hold utf8 
> strings containing any utf8 characters (otherwise, it wouldn't really deserve 
> to call itself a utf8 type), and so we should print (and test) it as regular 
> utf8.


Sounds like I simply misunderstood your earlier comment. I thought you meant 
putting a full UTF-8 *character* in a `char8_t.

> However, this actually surfaces the question of how should we format single 
> char8_t variables. It makes sense to display the character value if the value 
> happens to be ASCII, but I guess we shouldn't print something like "invalid 
> utf8 character" if it does contain one unit of the multibyte characters.

What about the current implementation that prints both the hex and the ASCII 
value?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D66447/new/

https://reviews.llvm.org/D66447



_______________________________________________
lldb-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits

[Lldb-commits] [PATCH] D66447: Add char8_t support (C++20)

Reply via email to