[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.
AnakinZheng created this revision. AnakinZheng added reviewers: kadircet, sammccall. AnakinZheng added a project: clang. Herald added subscribers: cfe-commits, usaxena95, arphaman, jkorous, MaskRay, ilya-biryukov. Previously this piece of code asserts when processing extended ASCII. Proposing a fix to extend the ascii handling code to take extended ascii as well. Repository: rG LLVM Github Monorepo https://reviews.llvm.org/D74731 Files: clang-tools-extra/clangd/SourceCode.cpp Index: clang-tools-extra/clangd/SourceCode.cpp === --- clang-tools-extra/clangd/SourceCode.cpp +++ clang-tools-extra/clangd/SourceCode.cpp @@ -59,7 +59,7 @@ // Astral codepoints are encoded as 4 bytes in UTF-8, starting with 0xxx. for (size_t I = 0; I < U8.size();) { unsigned char C = static_cast(U8[I]); -if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character. +if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character. if (CB(1, 1)) return true; ++I; Index: clang-tools-extra/clangd/SourceCode.cpp === --- clang-tools-extra/clangd/SourceCode.cpp +++ clang-tools-extra/clangd/SourceCode.cpp @@ -59,7 +59,7 @@ // Astral codepoints are encoded as 4 bytes in UTF-8, starting with 0xxx. for (size_t I = 0; I < U8.size();) { unsigned char C = static_cast(U8[I]); -if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character. +if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character. if (CB(1, 1)) return true; ++I; ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.
AnakinZheng updated this revision to Diff 245021. AnakinZheng added a comment. Fix the previous stupid mistake. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D74731/new/ https://reviews.llvm.org/D74731 Files: clang-tools-extra/clangd/SourceCode.cpp Index: clang-tools-extra/clangd/SourceCode.cpp === --- clang-tools-extra/clangd/SourceCode.cpp +++ clang-tools-extra/clangd/SourceCode.cpp @@ -58,8 +58,8 @@ // A codepoint takes two UTF-16 code unit if it's astral (outside BMP). // Astral codepoints are encoded as 4 bytes in UTF-8, starting with 0xxx. for (size_t I = 0; I < U8.size();) { -unsigned char C = static_cast(U8[I]); -if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character. +unsigned short C = static_cast(U8[I]); +if (LLVM_LIKELY(C <= 255)) { // ASCII or extended ASCII character. if (CB(1, 1)) return true; ++I; Index: clang-tools-extra/clangd/SourceCode.cpp === --- clang-tools-extra/clangd/SourceCode.cpp +++ clang-tools-extra/clangd/SourceCode.cpp @@ -58,8 +58,8 @@ // A codepoint takes two UTF-16 code unit if it's astral (outside BMP). // Astral codepoints are encoded as 4 bytes in UTF-8, starting with 0xxx. for (size_t I = 0; I < U8.size();) { -unsigned char C = static_cast(U8[I]); -if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character. +unsigned short C = static_cast(U8[I]); +if (LLVM_LIKELY(C <= 255)) { // ASCII or extended ASCII character. if (CB(1, 1)) return true; ++I; ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.
AnakinZheng marked an inline comment as done. AnakinZheng added a comment. Update patch. Comment at: clang-tools-extra/clangd/SourceCode.cpp:62 unsigned char C = static_cast(U8[I]); -if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character. +if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character. if (CB(1, 1)) kadircet wrote: > `C` is one byte long, it is not possible for it to ever satisfy `C&0x100`. Updated the patch, now it should be correct. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D74731/new/ https://reviews.llvm.org/D74731 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.
AnakinZheng marked an inline comment as done. AnakinZheng added inline comments. Comment at: clang-tools-extra/clangd/SourceCode.cpp:62 unsigned char C = static_cast(U8[I]); -if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character. +if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character. if (CB(1, 1)) kadircet wrote: > AnakinZheng wrote: > > kadircet wrote: > > > `C` is one byte long, it is not possible for it to ever satisfy `C&0x100`. > > Updated the patch, now it should be correct. > sorry but this is still the same since `C` is an `unsigned char` it is always > guaranteed to be less than or equal to 255. > > As @sammccall explained above, we should rather re-arrange the logic to get > rid of the assertion below, e.g. when `UTF8Length` is less than 2 or more > than 4 we can choose to interpret it as ascii, instead of asserting, while > commenting the reason for this choice. It's unsigned short in the new patch. Ok, I'll try @sammccall 's suggestion. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D74731/new/ https://reviews.llvm.org/D74731 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.
AnakinZheng marked an inline comment as done. AnakinZheng added inline comments. Comment at: clang-tools-extra/clangd/SourceCode.cpp:62 unsigned char C = static_cast(U8[I]); -if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character. +if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character. if (CB(1, 1)) kadircet wrote: > AnakinZheng wrote: > > kadircet wrote: > > > AnakinZheng wrote: > > > > kadircet wrote: > > > > > `C` is one byte long, it is not possible for it to ever satisfy > > > > > `C&0x100`. > > > > Updated the patch, now it should be correct. > > > sorry but this is still the same since `C` is an `unsigned char` it is > > > always guaranteed to be less than or equal to 255. > > > > > > As @sammccall explained above, we should rather re-arrange the logic to > > > get rid of the assertion below, e.g. when `UTF8Length` is less than 2 or > > > more than 4 we can choose to interpret it as ascii, instead of asserting, > > > while commenting the reason for this choice. > > It's unsigned short in the new patch. Ok, I'll try @sammccall 's suggestion. > > It's unsigned short in the new patch. > > Sorry I missed that change, but still `U8[I]` is a one byte element(it is of > type `char`), you can't get away by just casting it to `unsigned short`. Yup, just aware of that, @sammccall 's suggestion looks reasonable, implementing now. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D74731/new/ https://reviews.llvm.org/D74731 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.
AnakinZheng updated this revision to Diff 245036. AnakinZheng added a comment. Implement @sammccall 's suggestion and add test. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D74731/new/ https://reviews.llvm.org/D74731 Files: clang-tools-extra/clangd/SourceCode.cpp clang-tools-extra/clangd/unittests/SourceCodeTests.cpp Index: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp === --- clang-tools-extra/clangd/unittests/SourceCodeTests.cpp +++ clang-tools-extra/clangd/unittests/SourceCodeTests.cpp @@ -57,6 +57,14 @@ EXPECT_EQ(lspLength("¥"), 1UL); // astral EXPECT_EQ(lspLength("😂"), 2UL); + // Extended ASCII + const char extendedASCIIStr[2] = {static_cast(160U), '\0'}; + EXPECT_EQ(lspLength(extendedASCIIStr), 1UL); + // Invalid UTF16 + const char invalidUTF16[6] = { + static_cast(0xFF), static_cast(0xFF), static_cast(0xFF), + static_cast(0xFF), static_cast(0xFF), '\0'}; + EXPECT_EQ(lspLength(invalidUTF16), 2UL); WithContextValue UTF8(kCurrentOffsetEncoding, OffsetEncoding::UTF8); EXPECT_EQ(lspLength(""), 0UL); @@ -66,6 +74,10 @@ EXPECT_EQ(lspLength("¥"), 2UL); // astral EXPECT_EQ(lspLength("😂"), 4UL); + // Extended ASCII + EXPECT_EQ(lspLength(extendedASCIIStr), 1UL); + // Invalid UTF16 + EXPECT_EQ(lspLength(invalidUTF16), 5UL); WithContextValue UTF32(kCurrentOffsetEncoding, OffsetEncoding::UTF32); EXPECT_EQ(lspLength(""), 0UL); @@ -75,6 +87,10 @@ EXPECT_EQ(lspLength("¥"), 1UL); // astral EXPECT_EQ(lspLength("😂"), 1UL); + // Extended ASCII + EXPECT_EQ(lspLength(extendedASCIIStr), 1UL); + // Invalid UTF16 + EXPECT_EQ(lspLength(invalidUTF16), 1UL); } // The = → 🡆 below are ASCII (1 byte), BMP (3 bytes), and astral (4 bytes). Index: clang-tools-extra/clangd/SourceCode.cpp === --- clang-tools-extra/clangd/SourceCode.cpp +++ clang-tools-extra/clangd/SourceCode.cpp @@ -67,15 +67,20 @@ } // This convenient property of UTF-8 holds for all non-ASCII characters. size_t UTF8Length = llvm::countLeadingOnes(C); -// 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here. -// 1xxx is not valid UTF-8 at all. Assert because it's probably our bug. -assert((UTF8Length >= 2 && UTF8Length <= 4) && - "Invalid UTF-8, or transcoding bug?"); I += UTF8Length; // Skip over all trailing bytes. -// A codepoint takes two UTF-16 code unit if it's astral (outside BMP). -// Astral codepoints are encoded as 4 bytes in UTF-8 (0xxx ...) -if (CB(UTF8Length, UTF8Length == 4 ? 2 : 1)) - return true; +// 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here. +// 1xxx is not valid UTF-8 at all. Log it because it's probably our bug. +if(UTF8Length == 1 || UTF8Length > 4) { + vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8); + // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16 + if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2)) +return true; +} else { + // A codepoint takes two UTF-16 code unit if it's astral (outside BMP). + // Astral codepoints are encoded as 4 bytes in UTF-8 (0xxx ...) + if (CB(UTF8Length, UTF8Length == 4 ? 2 : 1)) +return true; +} } return false; } Index: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp === --- clang-tools-extra/clangd/unittests/SourceCodeTests.cpp +++ clang-tools-extra/clangd/unittests/SourceCodeTests.cpp @@ -57,6 +57,14 @@ EXPECT_EQ(lspLength("¥"), 1UL); // astral EXPECT_EQ(lspLength("😂"), 2UL); + // Extended ASCII + const char extendedASCIIStr[2] = {static_cast(160U), '\0'}; + EXPECT_EQ(lspLength(extendedASCIIStr), 1UL); + // Invalid UTF16 + const char invalidUTF16[6] = { + static_cast(0xFF), static_cast(0xFF), static_cast(0xFF), + static_cast(0xFF), static_cast(0xFF), '\0'}; + EXPECT_EQ(lspLength(invalidUTF16), 2UL); WithContextValue UTF8(kCurrentOffsetEncoding, OffsetEncoding::UTF8); EXPECT_EQ(lspLength(""), 0UL); @@ -66,6 +74,10 @@ EXPECT_EQ(lspLength("¥"), 2UL); // astral EXPECT_EQ(lspLength("😂"), 4UL); + // Extended ASCII + EXPECT_EQ(lspLength(extendedASCIIStr), 1UL); + // Invalid UTF16 + EXPECT_EQ(lspLength(invalidUTF16), 5UL); WithContextValue UTF32(kCurrentOffsetEncoding, OffsetEncoding::UTF32); EXPECT_EQ(lspLength(""), 0UL); @@ -75,6 +87,10 @@ EXPECT_EQ(lspLength("¥"), 1UL); // astral EXPECT_EQ(lspLength("😂"), 1UL); + // Extended ASCII + EXPECT_EQ(lspLength(extendedASCIIStr), 1UL); + // Invalid UTF16 + EXPECT_EQ(lspLength(invalidUTF16), 1UL); } // The = → 🡆 below are ASCII (1 byte), BMP (3 bytes), and astral (4 bytes). Index: clang-tools-extra/clangd/SourceCode.cpp