On 11/1/21 3:14 PM, David Malcolm via Gcc-patches wrote:
[Resending to get around mailing list size limit; see notes below]

This patch implements a new -Whomoglyph diagnostic, enabled by default.

Internally it implements the "skeleton" algorithm from:
   http://www.unicode.org/reports/tr39/#Confusable_Detection
so that every new identifier is mapped to a "skeleton", and if
the skeleton is already in use by a different identifier, issue
a -Whomoglyph diagnostic.
It uses the data from:
   https://www.unicode.org/Public/security/13.0.0/confusables.txt
to determine which characters are confusable.

For example, given the example of CVE-2021-42694 at
https://trojansource.codes/, with this patch we emit:

t.cc:7:1: warning: identifier ‘sayНello’ (‘say\u041dello’)... [CWE-1007] 
[-Whomoglyph]
     7 | void say<U+041D>ello() {
       | ^~~~
t.cc:3:1: note: ...confusable with non-equal identifier ‘sayHello’ here
     3 | void sayHello() {
       | ^~~~

(the precise location of the token isn't quite right; the
identifiers should be underlined, rather than the "void" tokens)

This takes advantage of:
   "diagnostics: escape non-ASCII source bytes for certain diagnostics"
     https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583020.html
to escape non-ASCII characters when printing a source line for -Whomoglyph,
so that we print "say<U+041D>ello" when quoting the source line, making it
clearer that this is not "sayHello".

In order to implement "skeleton", I had to implement NFD support, so the
patch also contains some UTF-32 support code.

Known issues:
- I'm doing an extra hash_table lookup on every identifier lookup.
   I haven't yet measured the impact on the speed of the compiler.
   If this is an issue, is there a good place to stash an extra
   pointer in every identifier?
- doesn't yet bootstrap, as the confusables.txt data contains ASCII
   to ASCII confusables, leading to warnings such as:
../../.././gcc/options.h:11273:3: warning: identifier ‘OPT_l’... [CWE-1007] 
[-Whomoglyph]
../../.././gcc/options.h:9959:3: note: ...confusable with non-equal identifier 
‘OPT_I’ (‘OPT_I’) here
   Perhaps the option should have levels, where we don't complain about
   pure ASCII confusables at the default level?
- no docs yet
- as noted above the location_t of the token isn't quite right
- map_identifier_to_skeleton and map_skeleton_to_first_use aren't
   yet integrated with the garbage collector
- some other FIXMEs in the patch

[I had to trim the patch for space to get it to pass the size filter on the
mailing list; I trimmed:
   contrib/unicode/confusables.txt,
   gcc/testsuite/selftests/NormalizationTest.txt
which can be downloaded from the URLs in the ChangeLog, and:
   gcc/confusables.inc
   gcc/decomposition.inc
which can be generated using the scripts in the patch ]

Thoughts?

None from me on the actual feature -- even after our discussion
this morning I remain comfortably ignorant of the problem :)
I just have a quick comment on the two new string classes:

...
+
+/* A class for manipulating UTF-32 strings.  */
+
+class utf32_string
+{
...
+ private:
...
+  cppchar_t *m_buf;
+  size_t m_alloc_len;
+  size_t m_len;
+};
+
+/* A class for constructing UTF-8 encoded strings.
+   These are not NUL-terminated.  */
+
+class utf8_string
+{
...
+ private:
+  uchar  *m_buf;
+  size_t m_alloc_sz;
+  size_t m_len;
+};

There are container abstractions both in C++ and in GCC that
these classes look like they could be implemented in terms of:
I'm thinking of std::string, std::vector, vec, and auto_vec.
They have the additional advantage of being safely copyable
and assignable, and of course, of having already been tested.
I see that the classes in your patch provide additional
functionality that the abstractions don't.  I'd expect it
be doable on top of the abstractions and without
reimplementing all the basic buffer management.

Martin

Reply via email to