[PATCH] D112913: Misleading bidirectional detection

2021-11-24 Thread Xidorn Quan via Phabricator via cfe-commits
upsuper added a comment.

I'm not familiar with LLVM / Clang codebase. I was asked by @MaskRay to help 
review the bidi-related part of this patch.

Generally I believe the algorithm here is too naive that it produces both false 
positives and false negatives. I'm not sure how important those edge cases are 
considered, but anyway... details comments below:




Comment at: clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp:48-49
+  // Line break: https://www.unicode.org/reports/tr14/tr14-32.html
+  if (C == '\n' || C == '\r' || C == '\f' || C == '\v' ||
+  C == 0x85 /*next line*/)
+EmbeddingOverride = Isolate = 0;

UAX 14 is probably not the right document to look here.

According to UAX 9 step [[ https://www.unicode.org/reports/tr9/#L1 | L1 ]], the 
embedding level is reset to the paragraph embedding level when hitting segment 
separator and paragraph separator, which are defined in [[ 
https://www.unicode.org/reports/tr9/#Table_Bidirectional_Character_Types | 
table 4 ]] as type B and S, and in UCD [[ 
https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt | 
DerivedBidiClass.txt ]] you can see type B includes U+000A, U+000D, 
U+001C..001E, U+0085, U+2029, and type S includes U+0009, U+000B, and U+001F.

You have U+000C and U+2028 here which are counted as type WS which doesn't 
affect embedding level, so you may be resetting the counter prematurely. And 
you may want to reset the counter for all other characters above.



Comment at: clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp:64-72
+if (CodePoint == RLO || CodePoint == RLE || CodePoint == LRO ||
+CodePoint == LRE)
+  EmbeddingOverride += 1;
+else if (CodePoint == PDF)
+  EmbeddingOverride = std::min(EmbeddingOverride - 1, EmbeddingOverride);
+else if (CodePoint == RLI || CodePoint == LRI || CodePoint == FSI)
+  Isolate += 1;

The bidi algorithm is more complicated than having two simple counters. 
Basically, a `PDF` cancels a override / embedding character only when they 
match, and `PDI` cancels all override / embedding between it and its matching 
isolate character.

I was thinking whether the counters would be enough in the sense that we may 
accept some false positive for edge cases but definitely no false negative. But 
I'm convinced it's not the case. As an example, a sequence of `RLO LRI PDF PDI` 
will yield no embedding and no isolate in this counting model, but the `PDF` 
here actually has no effect, as shown in step [[ 
https://www.unicode.org/reports/tr9/#X7 | X7 ]], if it does not match an 
embedding initiator, it is ignored, so this is effectively leaving a dangling 
`RLO`.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112913/new/

https://reviews.llvm.org/D112913

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D112913: Misleading bidirectional detection

2022-01-06 Thread Xidorn Quan via Phabricator via cfe-commits
upsuper added a comment.

I think the core algorithm looks correct now. I'll leave the code review to 
LLVM reviewers. Thanks.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112913/new/

https://reviews.llvm.org/D112913

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D112913: Misleading bidirectional detection

2022-01-11 Thread Xidorn Quan via Phabricator via cfe-commits
upsuper added a comment.

I'd like to clarify that what I think is correct now is the algorithm to detect 
unclosed explicit formatting scopes in a given string.

I haven't been following very closely with the whole spoofing issue, so I can't 
say that there is no other ways to construct a spoof that this algorithm is not 
designed to detect.

As you have found, `RLM`, and `ALM` can be used to confuse code reader, but 
they are not much different than a string with other strong RTL characters 
inside, and I don't quite see how that can be linted without hurting 
potentially legitimate code. Maybe if the compiler supports treating `LRM` as 
whitespace (I'm not sure whether Clang does), a lint may be added to ask 
wrapping any string with outermost strong characters being RTL in the form of 
`{LRM}"string"{LRM}` so that the RTL characters don't affect outside. Other 
than that, I don't think there is anyway to lint against such a confusion.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112913/new/

https://reviews.llvm.org/D112913

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits