[PATCH] D76291: [Support] Fix formatted_raw_ostream for UTF-8

Kristof Beyls via Phabricator via cfe-commits Mon, 22 Jun 2020 11:12:04 -0700

kristof.beyls added a comment.
Herald added a project: LLVM.

This looks fine to me; I just have a number of nit picks.
The only part where I don't understand the code logic is around the comment 
starting with "If this is the final byte of a multi-byte sequence".




================
Comment at: llvm/include/llvm/Support/FormattedStream.h:23-26
 /// formatted_raw_ostream - A raw_ostream that wraps another one and keeps 
track
 /// of line and column position, allowing padding out to specific column
 /// boundaries and querying the number of lines written to the stream.
 ///
----------------
I think it would be useful to add a description to this comment that:
(a) This class assumes that a UTF-8 encoding is used on the Stream; and
(b) "This doesn't attempt to handle everything unicode can do (combining 
characters, right-to-left markers, ...), but hopefully covers most things 
likely to be common in messages and source code we might want to print."


================
Comment at: llvm/lib/Support/FormattedStream.cpp:30
 
-  // Keep track of the current column and line by scanning the string for
-  // special characters
-  for (const char *End = Ptr + Size; Ptr != End; ++Ptr) {
-    ++Column;
-    switch (*Ptr) {
+  auto ProcessCodePoint = [&Line, &Column](StringRef CP) {
+    int Width = sys::locale::columnWidth(CP);
----------------
<bikeshedding mode>
Given that ProcessCodePoint assumes that the Unicode code point represented in 
the UTF-8 encoding, maybe it be slightly better to name the lambda as 
ProcessUTF8CodePoint rather than ProcessCodePoint?
</bikeshedding mode>


================
Comment at: llvm/lib/Support/FormattedStream.cpp:31
+  auto ProcessCodePoint = [&Line, &Column](StringRef CP) {
+    int Width = sys::locale::columnWidth(CP);
+    // columnWidth returns -1 for non-printing characters.
----------------
I'm wondering if using sys::unicode::columnWidthUTF8 instead of 
sys::locale::columnWidth would be more future-proof and more clearly describe 
the intent that this function only processes UTF-8 and not strings encoded in 
other encodings?


================
Comment at: llvm/lib/Support/FormattedStream.cpp:33
+    // columnWidth returns -1 for non-printing characters.
+    if (Width != -1)
+      Column += Width;
----------------
The documentation for sys::unicode::columnWidthUTF8 documents it returns 
ErrorNonPrintableCharacter (-1) if the text contains non-printable characters.
Maybe it's more self-documenting to compare against ErrorNonPrintableCharacter 
rather than -1 in the above if condition?


================
Comment at: llvm/lib/Support/FormattedStream.cpp:36-37
+
+    // If this is the final byte of a multi-byte sequence, it can't be any of
+    // the special whitespace characters below.
+    if (CP.size() > 1)
----------------
Reading through the code linearly from the top to the bottom,  I'm a bit 
surprised by this comment.
I would expect CP to contain exactly the bytes that when interpreted as a UTF-8 
encoded Unicode character, form exactly one Unicode character.
Therefore, I'm not sure how to interpret "If this is the final byte of a 
multi-byte sequence.".
I'm expecting "this" to refer to "CP" in this context. But that cannot be 
"just" the final byte of a multi-byte sequence, unless my assumption that CP 
contains exactly the bytes forming a single UTF-8 encoded Unicode character is 
wrong.
Could CP contain a partial UTF-8 encoded character? If so, maybe it'd be better 
to change the name ProcessCodePoint to something that suggests that could be 
possible?


================
Comment at: llvm/unittests/Support/formatted_raw_ostream_test.cpp:134
+
+  // This character encodes as three bytes, so will cause the buffer to be
+  // flushed after the first byte (4 byte buffer, 3 bytes already written). We
----------------
I guess "This" refers to \u2468? If so, it'd be easier to read this comment if 
it was written like: "// \u2468 encodes as three bytes, ..."



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D76291/new/

https://reviews.llvm.org/D76291



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D76291: [Support] Fix formatted_raw_ostream for UTF-8

Reply via email to