On Mon, Nov 01, 2021 at 05:14:12PM -0400, David Malcolm via Gcc-patches wrote: > Thoughts?
Resending my previously internally posted mail: Thanks for working for this, but I'm afraid this is done at a wrong location. Consider attached testcases Whomoglyph1.C and Whomoglyph2.C. On Whomoglyph1.C testcase, I'd expect a warning, because there is a clear confusion for the reader, something that isn't visible in any of emacs, vim, joe editors or on the terminal, when f3 uses scope identifier, the casual reader will expect that it uses N1::N2::scope, but there is no such variable, only one N1::N2::ѕсоре that visually looks the same, but has different UTF-8 chars in it. So, name lookup will instead find N1::scope and use that. But Whomoglyph2.C will emit warnings that are IMHO not appropriate, I believe there is no confusion at all there, e.g. for both C and C++, the f5/f6 case, it doesn't really matter how each of the function names its own parameter, one can never access another function's parameter. Ditto for different namespace provided that both namespaces aren't searched in the same name lookup, or similarly classes etc. So, IMNSHO that warning belongs to name-lookup (cp/name-lookup.c for the C++ FE). And, another important thing is that most users don't really use unicode in identifiers, I bet over 99.9% of identifiers don't have any >= 0x80 characters in it and even when people do use them, confusable identifiers during the same lookup are even far more unlikely. So, I think we should optimize for the common case, ASCII only identifiers and spend as little compile time as possible on this stuff. Another thing we've discussed on IRC is that the Unicode confusables.txt has some canonicalizations that I think most users will not appreciate, or at least not appreciate if it is warned by default. In particular, the canonicalizations from ASCII to ASCII: 0 -> O 1 -> l I -> l m -> rn I think most users use monospace fonts when working on C/C++ etc. source code, anything else ruins indentation (ok, except for projects that indent everything by tabs only). In monospace fonts, I think m can't be confused with rn. And in most monospace fonts, 0 and O and 1/l/I are also fairly well distinguishable. So, I think we should warn about f00 vs. fOO or home vs. horne or f1 vs. fl vs. fI only with -Whomoglyph=2, a non-default extra level, rather than by default. But, given the way confusables.txt is generated, while it is probably easy to undo the m -> rn canonicalizations (assume everything that maps to rn actually maps to m), for the 0 vs. O or 1 vs. l vs. I case I'm afraid trying to differentiate to what it actually maps will be harder. So, my proposal would be by default not to warn for the 0 vs. O, 1 vs. l vs. I and m vs. rn differences if neither of the identifiers contains any UTF-8 chars and do warn otherwise. For spending as short compile time as possible on this, I think libcpp knows well if a UTF-8 or UCN character has been encountered in an identifier (it uses the libcpp/charset.c entrypoints then), so perhaps we could mark the tokens with a new flag that stands for "identifier contains at least one UTF-8 (non-ASCII) or UCN character. Or if we don't, perhaps just quickly scanning identifier for characters with MSB set might be tollerable too. So, that for how to identify identifiers for which we should compute the canonical homoglyph form at all. How to store it, again as we should IMHO optimize for the usual case where even identifiers with UTF-8/UCN characters canonicalize to itself, I think best representation for that would be not to waste whole pointer on each IDENTIFIER_NODE for it, but instead reserve for it one bit on the identifiers, "identifier has a canonical homoglyph version different from itself", and that bit would mean we have some hash map etc. on the side that maps it to the canonical identifier. As for the actual uses of confusables.txt transformations, do you think we need to work on UTF-32? Can't it all be done on UTF-8 instead? Let whatever confusable.txt -> something.h generator we write prepare a decision function that can use UTF-8 in both what to replace and certainly just a UTF-8 string literal in what to replace it with. Note, the above would mean we don't compute them for those 0 -> O, [1I] -> l or m -> rn canonicalizations for the default -Whomoglyph mode. Perhaps we can use some bit/flag on the C FE scopes or C++ namespaces and on the identifiers we map to through the hash map. On the scopes etc. it would mean this scope has some identifiers in it that have the homoglyphs alternatives and the homoglyph canonical forms have been already added to wherever the name lookup can find them. And on the canonical forms perhaps stand for this canonical form has any O, l or rn sequences in it. And then during actual name lookup, if current identifier doesn't the has alt homoglyph canonicalization flag set and the scope doesn't have the new bit set either, there would be nothing further to do, for name-lookup where the identifier we search for has the new bit set and the scope has it too we'd do two lookups, one normal and one for the alt homoglyph canonical form, and if the scope doesn't have the new bit set, we'd perform whatever is needed to figure that out. Jakub
// { dg-do compile } // { dg-options "-Whomoglyph" } namespace N1 { int scope; // ASCII int f1 () { return scope++; } // ASCII namespace N2 { int ѕсоре; // Cyrillic int f2 () { return ѕсоре++; } // Cyrillic int f3 () { return scope++; } // ASCII { dg-warning "Whomoglyph" } } }
// { dg-do compile } // { dg-options "-Whomoglyph" } namespace N1 { int scope; // ASCII int f1 () { return scope++; } // ASCII } namespace N2 { int ѕсоре; // Cyrillic { dg-bogus "Whomoglyph" } int f2 () { return ѕсоре++; } // Cyrillic } struct S1 { constexpr static int Hello = 1; // ASCII int f3 () { return Hello; } // ASCII }; struct S2 { constexpr static int Ηello = 2; // Greek+ASCII { dg-bogus "Whomoglyph" } int f4 () { return Ηello; } // Greek+ASCII }; int f5 (int s) // ASCII { return s; // ASCII } int f6 (int ѕ) // Cyrillic { dg-bogus "Whomoglyph" } { return ѕ; // Cyrillic }