On Tue, Nov 02, 2021 at 01:20:03PM -0600, Martin Sebor wrote: > On 11/2/21 11:18 AM, Marek Polacek via Gcc-patches wrote: > > On Mon, Nov 01, 2021 at 10:10:40PM +0000, Joseph Myers wrote: > > > On Mon, 1 Nov 2021, Marek Polacek via Gcc-patches wrote: > > > > > > > + /* We've read a bidi char, update the current vector as necessary. > > > > */ > > > > + void on_char (kind k, bool ucn_p) > > > > + { > > > > + switch (k) > > > > + { > > > > + case kind::LRE: > > > > + case kind::RLE: > > > > + case kind::LRO: > > > > + case kind::RLO: > > > > + vec.push (ucn_p ? 3u : 1u); > > > > + break; > > > > + case kind::LRI: > > > > + case kind::RLI: > > > > + case kind::FSI: > > > > + vec.push (ucn_p ? 2u : 0u); > > > > + break; > > > > + case kind::PDF: > > > > + if (current_ctx () == kind::PDF) > > > > + pop (); > > > > + break; > > > > + case kind::PDI: > > > > + if (current_ctx () == kind::PDI) > > > > + pop (); > > > > > > My understanding is that PDI should pop all intermediate PDF contexts > > > outward to a PDI context, which it also pops. (But if it's embedded only > > > in PDF contexts, with no PDI context containing it, it doesn't pop > > > anything.) > > > > > > I think failing to handle that only means libcpp sometimes models there > > > as being more bidirectional contexts open than there should be, so it > > > might give spurious warnings when in fact all such contexts had been > > > closed by end of string or comment. > > > > Ah, you're right. > > https://www.unicode.org/reports/tr9/#Terminating_Explicit_Directional_Isolates > > says that "[PDI] terminates the scope of the last LRI, RLI, or FSI whose > > scope has not yet been terminated, as well as the scopes of any subsequent > > LREs, RLEs, LROs, or RLOs whose scopes have not yet been terminated." > > but PDF doesn't have the latter quirk. > > > > Fixed in the below: I added a suitable truncate into on_char. The new test > > Wbidirectional-14.c exercises the handling of PDI. > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk? > > > > -- >8 -- > > From a link below: > > "An issue was discovered in the Bidirectional Algorithm in the Unicode > > Specification through 14.0. It permits the visual reordering of > > characters via control sequences, which can be used to craft source code > > that renders different logic than the logical ordering of tokens > > ingested by compilers and interpreters. Adversaries can leverage this to > > encode source code for compilers accepting Unicode such that targeted > > vulnerabilities are introduced invisibly to human reviewers." > > > > More info: > > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > > https://trojansource.codes/ > > > > This is not a compiler bug. However, to mitigate the problem, this patch > > implements -Wbidirectional=[none|unpaired|any] to warn about possibly > > misleading Unicode bidirectional characters the preprocessor may encounter. > > Birectional sounds very general. Can we come up with a name > that's a bit more descriptive of the problem the warning reports? > From skimming the docs and the tests it looks like the warning > points out uses of bidirectonal characters in the program source > code as well as comments. Would -Wbidirectional-text be better? > Or -Wbidirectional-chars? (If Clang is also adding a warning > for this, syncing up with them one way or the other might be > helpful.)
I dunno, I could go with -Wbidirectional-chars. Does anyone else think I should rename the current name to -Wbidirectional-chars? Other ideas: -Wunicode-bidi / -Wmultibyte-chars / -Wmisleading-bidirectional. The patch for clang-tidy I saw called this misleading-bidirectional. > ... > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index c5730228821..9dfb95dc24c 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -327,7 +327,9 @@ Objective-C and Objective-C++ Dialects}. > > -Warith-conversion @gol > > -Warray-bounds -Warray-bounds=@var{n} -Warray-compare @gol > > -Wno-attributes -Wattribute-alias=@var{n} -Wno-attribute-alias @gol > > --Wno-attribute-warning -Wbool-compare -Wbool-operation @gol > > +-Wno-attribute-warning @gol > > +-Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} @gol > > +-Wbool-compare -Wbool-operation @gol > > -Wno-builtin-declaration-mismatch @gol > > -Wno-builtin-macro-redefined -Wc90-c99-compat -Wc99-c11-compat @gol > > -Wc11-c2x-compat @gol > > @@ -7674,6 +7676,21 @@ Attributes considered include @code{alloc_align}, > > @code{alloc_size}, > > This is the default. You can disable these warnings with either > > @option{-Wno-attribute-alias} or @option{-Wattribute-alias=0}. > > +@item -Wbidirectional=@r{[}none@r{|}unpaired@r{|}any@r{]} > > +@opindex Wbidirectional= > > +@opindex Wbidirectional > > +@opindex Wno-bidirectional > > +Warn about UTF-8 bidirectional characters. > > I suggest to mention where. If everywhere, enumerate the most > common contexts to make it clear it means everywhere: > > Warn about UTF-8 bidirectional characters in source code, > including string literals, identifiers, and comments. OK, I've updated the text to: Warn about possibly misleading UTF-8 bidirectional characters in comments, string literals, character constants, and identifiers. Marek