[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.

2020-02-17 Thread Yancheng Zheng via Phabricator via cfe-commits
AnakinZheng created this revision.
AnakinZheng added reviewers: kadircet, sammccall.
AnakinZheng added a project: clang.
Herald added subscribers: cfe-commits, usaxena95, arphaman, jkorous, MaskRay, 
ilya-biryukov.

Previously this piece of code asserts when processing extended ASCII. Proposing 
a fix to extend the ascii handling code to take extended ascii as well.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D74731

Files:
  clang-tools-extra/clangd/SourceCode.cpp


Index: clang-tools-extra/clangd/SourceCode.cpp
===
--- clang-tools-extra/clangd/SourceCode.cpp
+++ clang-tools-extra/clangd/SourceCode.cpp
@@ -59,7 +59,7 @@
   // Astral codepoints are encoded as 4 bytes in UTF-8, starting with 0xxx.
   for (size_t I = 0; I < U8.size();) {
 unsigned char C = static_cast(U8[I]);
-if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character.
+if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character.
   if (CB(1, 1))
 return true;
   ++I;


Index: clang-tools-extra/clangd/SourceCode.cpp
===
--- clang-tools-extra/clangd/SourceCode.cpp
+++ clang-tools-extra/clangd/SourceCode.cpp
@@ -59,7 +59,7 @@
   // Astral codepoints are encoded as 4 bytes in UTF-8, starting with 0xxx.
   for (size_t I = 0; I < U8.size();) {
 unsigned char C = static_cast(U8[I]);
-if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character.
+if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character.
   if (CB(1, 1))
 return true;
   ++I;
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.

2020-02-17 Thread Yancheng Zheng via Phabricator via cfe-commits
AnakinZheng updated this revision to Diff 245021.
AnakinZheng added a comment.

Fix the previous stupid mistake.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74731/new/

https://reviews.llvm.org/D74731

Files:
  clang-tools-extra/clangd/SourceCode.cpp


Index: clang-tools-extra/clangd/SourceCode.cpp
===
--- clang-tools-extra/clangd/SourceCode.cpp
+++ clang-tools-extra/clangd/SourceCode.cpp
@@ -58,8 +58,8 @@
   // A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
   // Astral codepoints are encoded as 4 bytes in UTF-8, starting with 0xxx.
   for (size_t I = 0; I < U8.size();) {
-unsigned char C = static_cast(U8[I]);
-if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character.
+unsigned short C = static_cast(U8[I]);
+if (LLVM_LIKELY(C <= 255)) { // ASCII or extended ASCII character.
   if (CB(1, 1))
 return true;
   ++I;


Index: clang-tools-extra/clangd/SourceCode.cpp
===
--- clang-tools-extra/clangd/SourceCode.cpp
+++ clang-tools-extra/clangd/SourceCode.cpp
@@ -58,8 +58,8 @@
   // A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
   // Astral codepoints are encoded as 4 bytes in UTF-8, starting with 0xxx.
   for (size_t I = 0; I < U8.size();) {
-unsigned char C = static_cast(U8[I]);
-if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character.
+unsigned short C = static_cast(U8[I]);
+if (LLVM_LIKELY(C <= 255)) { // ASCII or extended ASCII character.
   if (CB(1, 1))
 return true;
   ++I;
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.

2020-02-17 Thread Yancheng Zheng via Phabricator via cfe-commits
AnakinZheng marked an inline comment as done.
AnakinZheng added a comment.

Update patch.




Comment at: clang-tools-extra/clangd/SourceCode.cpp:62
 unsigned char C = static_cast(U8[I]);
-if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character.
+if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character.
   if (CB(1, 1))

kadircet wrote:
> `C` is one byte long, it is not possible for it to ever satisfy `C&0x100`.
Updated the patch, now it should be correct.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74731/new/

https://reviews.llvm.org/D74731



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.

2020-02-17 Thread Yancheng Zheng via Phabricator via cfe-commits
AnakinZheng marked an inline comment as done.
AnakinZheng added inline comments.



Comment at: clang-tools-extra/clangd/SourceCode.cpp:62
 unsigned char C = static_cast(U8[I]);
-if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character.
+if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character.
   if (CB(1, 1))

kadircet wrote:
> AnakinZheng wrote:
> > kadircet wrote:
> > > `C` is one byte long, it is not possible for it to ever satisfy `C&0x100`.
> > Updated the patch, now it should be correct.
> sorry but this is still the same since `C` is an `unsigned char` it is always 
> guaranteed to be less than or equal to 255.
> 
> As @sammccall explained above, we should rather re-arrange the logic to get 
> rid of the assertion below, e.g. when `UTF8Length` is less than 2 or more 
> than 4 we can choose to interpret it as ascii, instead of asserting, while 
> commenting the reason for this choice.
It's unsigned short in the new patch. Ok, I'll try @sammccall 's suggestion.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74731/new/

https://reviews.llvm.org/D74731



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.

2020-02-17 Thread Yancheng Zheng via Phabricator via cfe-commits
AnakinZheng marked an inline comment as done.
AnakinZheng added inline comments.



Comment at: clang-tools-extra/clangd/SourceCode.cpp:62
 unsigned char C = static_cast(U8[I]);
-if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character.
+if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character.
   if (CB(1, 1))

kadircet wrote:
> AnakinZheng wrote:
> > kadircet wrote:
> > > AnakinZheng wrote:
> > > > kadircet wrote:
> > > > > `C` is one byte long, it is not possible for it to ever satisfy 
> > > > > `C&0x100`.
> > > > Updated the patch, now it should be correct.
> > > sorry but this is still the same since `C` is an `unsigned char` it is 
> > > always guaranteed to be less than or equal to 255.
> > > 
> > > As @sammccall explained above, we should rather re-arrange the logic to 
> > > get rid of the assertion below, e.g. when `UTF8Length` is less than 2 or 
> > > more than 4 we can choose to interpret it as ascii, instead of asserting, 
> > > while commenting the reason for this choice.
> > It's unsigned short in the new patch. Ok, I'll try @sammccall 's suggestion.
> > It's unsigned short in the new patch.
> 
> Sorry I missed that change, but still `U8[I]` is a one byte element(it is of 
> type `char`), you can't get away by just casting it to `unsigned short`.
Yup, just aware of that, @sammccall 's suggestion looks reasonable, 
implementing now.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74731/new/

https://reviews.llvm.org/D74731



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.

2020-02-17 Thread Yancheng Zheng via Phabricator via cfe-commits
AnakinZheng updated this revision to Diff 245036.
AnakinZheng added a comment.

Implement @sammccall 's suggestion and add test.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74731/new/

https://reviews.llvm.org/D74731

Files:
  clang-tools-extra/clangd/SourceCode.cpp
  clang-tools-extra/clangd/unittests/SourceCodeTests.cpp


Index: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
===
--- clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
+++ clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
@@ -57,6 +57,14 @@
   EXPECT_EQ(lspLength("¥"), 1UL);
   // astral
   EXPECT_EQ(lspLength("😂"), 2UL);
+  // Extended ASCII
+  const char extendedASCIIStr[2] = {static_cast(160U), '\0'};
+  EXPECT_EQ(lspLength(extendedASCIIStr), 1UL);
+  // Invalid UTF16
+  const char invalidUTF16[6] = {
+  static_cast(0xFF), static_cast(0xFF), 
static_cast(0xFF),
+  static_cast(0xFF), static_cast(0xFF), '\0'};
+  EXPECT_EQ(lspLength(invalidUTF16), 2UL);
 
   WithContextValue UTF8(kCurrentOffsetEncoding, OffsetEncoding::UTF8);
   EXPECT_EQ(lspLength(""), 0UL);
@@ -66,6 +74,10 @@
   EXPECT_EQ(lspLength("¥"), 2UL);
   // astral
   EXPECT_EQ(lspLength("😂"), 4UL);
+  // Extended ASCII
+  EXPECT_EQ(lspLength(extendedASCIIStr), 1UL);
+  // Invalid UTF16
+  EXPECT_EQ(lspLength(invalidUTF16), 5UL);
 
   WithContextValue UTF32(kCurrentOffsetEncoding, OffsetEncoding::UTF32);
   EXPECT_EQ(lspLength(""), 0UL);
@@ -75,6 +87,10 @@
   EXPECT_EQ(lspLength("¥"), 1UL);
   // astral
   EXPECT_EQ(lspLength("😂"), 1UL);
+  // Extended ASCII
+  EXPECT_EQ(lspLength(extendedASCIIStr), 1UL);
+  // Invalid UTF16
+  EXPECT_EQ(lspLength(invalidUTF16), 1UL);
 }
 
 // The = → 🡆 below are ASCII (1 byte), BMP (3 bytes), and astral (4 bytes).
Index: clang-tools-extra/clangd/SourceCode.cpp
===
--- clang-tools-extra/clangd/SourceCode.cpp
+++ clang-tools-extra/clangd/SourceCode.cpp
@@ -67,15 +67,20 @@
 }
 // This convenient property of UTF-8 holds for all non-ASCII characters.
 size_t UTF8Length = llvm::countLeadingOnes(C);
-// 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
-// 1xxx is not valid UTF-8 at all. Assert because it's probably our 
bug.
-assert((UTF8Length >= 2 && UTF8Length <= 4) &&
-   "Invalid UTF-8, or transcoding bug?");
 I += UTF8Length; // Skip over all trailing bytes.
-// A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
-// Astral codepoints are encoded as 4 bytes in UTF-8 (0xxx ...)
-if (CB(UTF8Length, UTF8Length == 4 ? 2 : 1))
-  return true;
+// 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
+// 1xxx is not valid UTF-8 at all. Log it because it's probably our 
bug.
+if(UTF8Length == 1 || UTF8Length > 4) {
+  vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, 
string = {1}", UTF8Length, U8);
+  // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
+  if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2))
+return true;
+} else {
+  // A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
+  // Astral codepoints are encoded as 4 bytes in UTF-8 (0xxx ...)
+  if (CB(UTF8Length, UTF8Length == 4 ? 2 : 1))
+return true;
+}
   }
   return false;
 }


Index: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
===
--- clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
+++ clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
@@ -57,6 +57,14 @@
   EXPECT_EQ(lspLength("¥"), 1UL);
   // astral
   EXPECT_EQ(lspLength("😂"), 2UL);
+  // Extended ASCII
+  const char extendedASCIIStr[2] = {static_cast(160U), '\0'};
+  EXPECT_EQ(lspLength(extendedASCIIStr), 1UL);
+  // Invalid UTF16
+  const char invalidUTF16[6] = {
+  static_cast(0xFF), static_cast(0xFF), static_cast(0xFF),
+  static_cast(0xFF), static_cast(0xFF), '\0'};
+  EXPECT_EQ(lspLength(invalidUTF16), 2UL);
 
   WithContextValue UTF8(kCurrentOffsetEncoding, OffsetEncoding::UTF8);
   EXPECT_EQ(lspLength(""), 0UL);
@@ -66,6 +74,10 @@
   EXPECT_EQ(lspLength("¥"), 2UL);
   // astral
   EXPECT_EQ(lspLength("😂"), 4UL);
+  // Extended ASCII
+  EXPECT_EQ(lspLength(extendedASCIIStr), 1UL);
+  // Invalid UTF16
+  EXPECT_EQ(lspLength(invalidUTF16), 5UL);
 
   WithContextValue UTF32(kCurrentOffsetEncoding, OffsetEncoding::UTF32);
   EXPECT_EQ(lspLength(""), 0UL);
@@ -75,6 +87,10 @@
   EXPECT_EQ(lspLength("¥"), 1UL);
   // astral
   EXPECT_EQ(lspLength("😂"), 1UL);
+  // Extended ASCII
+  EXPECT_EQ(lspLength(extendedASCIIStr), 1UL);
+  // Invalid UTF16
+  EXPECT_EQ(lspLength(invalidUTF16), 1UL);
 }
 
 // The = → 🡆 below are ASCII (1 byte), BMP (3 bytes), and astral (4 bytes).
Index: clang-tools-extra/clangd/SourceCode.cpp