[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Richard via Phabricator via cfe-commits Mon, 09 May 2022 08:33:17 -0700

LegalizeAdulthood added inline comments.


================
Comment at: 
clang-tools-extra/clang-tidy/modernize/IntegralLiteralExpressionMatcher.cpp:99
+
+  if (!Current->isLiteral() || isStringLiteral(Current->getKind()) ||
+      !isIntegralConstant(*Current)) {
----------------
aaron.ballman wrote:
> LegalizeAdulthood wrote:
> > aaron.ballman wrote:
> > > aaron.ballman wrote:
> > > > LegalizeAdulthood wrote:
> > > > > LegalizeAdulthood wrote:
> > > > > > aaron.ballman wrote:
> > > > > > > LegalizeAdulthood wrote:
> > > > > > > > aaron.ballman wrote:
> > > > > > > > > LegalizeAdulthood wrote:
> > > > > > > > > > aaron.ballman wrote:
> > > > > > > > > > > LegalizeAdulthood wrote:
> > > > > > > > > > > > aaron.ballman wrote:
> > > > > > > > > > > > > I know this is code moved from elsewhere, but I 
> > > > > > > > > > > > > suppose we never considered the odd edge case where a 
> > > > > > > > > > > > > user does something like `"foo"[0]` as a really awful 
> > > > > > > > > > > > > integer constant. :-D
> > > > > > > > > > > > It's always possible to write crazy contorted code and 
> > > > > > > > > > > > have a check not recognize it.  I don't think it's 
> > > > > > > > > > > > worthwhile to spend time trying to handle torture cases 
> > > > > > > > > > > > unless we can find data that shows it's prevalent in 
> > > > > > > > > > > > real world code.
> > > > > > > > > > > > 
> > > > > > > > > > > > If I was doing a code review and saw this:
> > > > > > > > > > > > ```
> > > > > > > > > > > > enum {
> > > > > > > > > > > >     FOO = "foo"[0]
> > > > > > > > > > > > };
> > > > > > > > > > > > ```
> > > > > > > > > > > > I'd flag that in a code review as bogus, whereas if I 
> > > > > > > > > > > > saw:
> > > > > > > > > > > > ```
> > > > > > > > > > > > enum {
> > > > > > > > > > > >     FOO = 'f'
> > > > > > > > > > > > };
> > > > > > > > > > > > ```
> > > > > > > > > > > > That would be acceptable, which is why character 
> > > > > > > > > > > > literals are accepted as an integral literal 
> > > > > > > > > > > > initializer for an enum in this check.
> > > > > > > > > > > >  I don't think it's worthwhile to spend time trying to 
> > > > > > > > > > > > handle torture cases unless we can find data that shows 
> > > > > > > > > > > > it's prevalent in real world code.
> > > > > > > > > > > 
> > > > > > > > > > > I think I'm okay agreeing to that in this particular 
> > > > > > > > > > > case, but this is more to point out that writing your own 
> > > > > > > > > > > parser is a maintenance burden. Users will hit cases 
> > > > > > > > > > > we've both forgotten about here, they'll file a bug, then 
> > > > > > > > > > > someone has to deal with it. It's very hard to justify to 
> > > > > > > > > > > users "we think you write silly code" because they often 
> > > > > > > > > > > have creative ways in which their code is not actually so 
> > > > > > > > > > > silly, especially when we support "most" valid 
> > > > > > > > > > > expressions.
> > > > > > > > > > Writing your own parser is unavoidable here because we 
> > > > > > > > > > can't just assume that any old thing will be a valid 
> > > > > > > > > > initializer just by looking at the set of tokens present in 
> > > > > > > > > > the macro body.  (There is a separate discussion going on 
> > > > > > > > > > about improving the preprocessor support and parsing things 
> > > > > > > > > > more deeply, but that isn't even to the point of a 
> > > > > > > > > > prototype yet.)  The worst thing we can do is create 
> > > > > > > > > > "fixits" that produce invalid code.
> > > > > > > > > > 
> > > > > > > > > > The worst that happens if your expression isn't recognized 
> > > > > > > > > > is that your macro isn't matched as a candidate for an 
> > > > > > > > > > enum.  You can always make it an enum manually and join it 
> > > > > > > > > > with adjacent macros that were recognized and converted.
> > > > > > > > > > 
> > > > > > > > > > As it stands, the check only recognizes a single literal 
> > > > > > > > > > with an optional unary operator.
> > > > > > > > > > 
> > > > > > > > > > This change expands the check to recognize a broad range of 
> > > > > > > > > > expressions, allowing those macros to be converted to 
> > > > > > > > > > enums.  I opened the issue because running 
> > > > > > > > > > modernize-macro-to-enum on the [[ 
> > > > > > > > > > https://github.com/InsightSoftwareConsortium/ITK | ITK 
> > > > > > > > > > codebase ]] showed some simple expressions involving 
> > > > > > > > > > literals that weren't recognized and converted.
> > > > > > > > > > 
> > > > > > > > > > If an expression isn't recognized and an issue is opened, 
> > > > > > > > > > it will be an enhancement request to support a broader 
> > > > > > > > > > range of expression, not a bug that this check created 
> > > > > > > > > > invalid code.
> > > > > > > > > > 
> > > > > > > > > > IMO, the more useful thing that's missing from the grammar 
> > > > > > > > > > is recognizing `sizeof` expressions rather than indexing 
> > > > > > > > > > string literals with an integral literal subscript.
> > > > > > > > > > 
> > > > > > > > > > I had planned on doing another increment to recognize 
> > > > > > > > > > `sizeof` expressions.
> > > > > > > > > > Writing your own parser is unavoidable here because we 
> > > > > > > > > > can't just assume that any old thing will be a valid 
> > > > > > > > > > initializer just by looking at the set of tokens present in 
> > > > > > > > > > the macro body. 
> > > > > > > > > 
> > > > > > > > > If you ran the token sequence through clang's parser and got 
> > > > > > > > > an AST node out, you'd have significantly *more* information 
> > > > > > > > > as to whether something is a valid enum constant initializer 
> > > > > > > > > because you can check that it's an actual constant expression 
> > > > > > > > > *and* that it's within a valid range of values. This not only 
> > > > > > > > > fixes edge case bugs with your approach (like the fact that 
> > > > > > > > > you can generate a series of literal expressions that result 
> > > > > > > > > in a value too large to store within an enumerator constant), 
> > > > > > > > > but it enables new functionality your approach currently 
> > > > > > > > > disallows (like using constexpr variables instead of just 
> > > > > > > > > numeric literals).
> > > > > > > > > 
> > > > > > > > > So I don't agree that it's unavoidable to write another 
> > > > > > > > > custom parser.
> > > > > > > > You keep bringing up the idea that the values have to be known, 
> > > > > > > > but so far they don't.
> > > > > > > > 
> > > > > > > > Remember, we are replacing macro identifiers with anonymous 
> > > > > > > > enum identifiers.  We aren't specifying a restricting type to 
> > > > > > > > the enum, so as long as it's a valid integral literal 
> > > > > > > > expression, we're not changing any semantics.  Unscoped enums 
> > > > > > > > also allow arbitrary conversions to/from an underlying integral 
> > > > > > > > type chosen by the compiler.
> > > > > > > > 
> > > > > > > > C++20 9.7.1 paragraph 7 says:
> > > > > > > > 
> > > > > > > > > For an enumeration whose underlying type is not fixed, the 
> > > > > > > > > underlying type is an integral type that can
> > > > > > > > > represent all the enumerator values defined in the 
> > > > > > > > > enumeration. If no integral type can represent all the
> > > > > > > > > enumerator values, the enumeration is ill-formed. It is 
> > > > > > > > > implementation-defined which integral type is used
> > > > > > > > > as the underlying type except that the underlying type shall 
> > > > > > > > > not be larger than int unless the value of an
> > > > > > > > > enumerator cannot fit in an int or unsigned int . If the 
> > > > > > > > > enumerator-list is empty, the underlying type is as
> > > > > > > > > if the enumeration had a single enumerator with value 0.
> > > > > > > > 
> > > > > > > > So the compiler is free to pick an underlying type that's large 
> > > > > > > > enough to handle all the explicitly listed initial values.  Do 
> > > > > > > > we actually need to know the values for this check?  I don't 
> > > > > > > > think so, because we aren't changing anything about the type of 
> > > > > > > > the named values.  When the compiler evaluates an integral 
> > > > > > > > literal, it goes through a similar algorithm assigning the 
> > > > > > > > appropriate type to those integral values:
> > > > > > > > 
> > > > > > > > C++20 5.9 paragraph 2:
> > > > > > > > 
> > > > > > > > > A preprocessing number does not have a type or a value; it 
> > > > > > > > > acquires both after a successful conversion to an
> > > > > > > > > integer-literal token or a floating-point-literal token.
> > > > > > > > 
> > > > > > > > C++20 5.13.2 paragraph 3:
> > > > > > > > 
> > > > > > > > > The type of an integer-literal is the first type in the list 
> > > > > > > > > in Table 8 corresponding to its optional integer-suffix
> > > > > > > > > in which its value can be represented. 
> > > > > > > > 
> > > > > > > > The table says the type is int, unsigned int, long int, 
> > > > > > > > unsigned long int, long long int, or unsigned long long int 
> > > > > > > > based on the suffix and the value and that the type is chosen 
> > > > > > > > to be big enough to hold the value if the suffix is unspecified.
> > > > > > > > 
> > > > > > > > > but [using `clangParse`] enables new functionality your 
> > > > > > > > > approach currently disallows (like using constexpr variables 
> > > > > > > > > instead of just numeric literals).
> > > > > > > > 
> > > > > > > > I agree that if we used the full parser, we'd bring in 
> > > > > > > > `constexpr` expressions as valid initializers for the enums.  
> > > > > > > > However, before engaging in all that work, I'd like to see how 
> > > > > > > > likely this is in existing codebases by feedback from users 
> > > > > > > > requesting the support.  Maybe engaging the parser isn't a big 
> > > > > > > > amount of work, I don't actually know.  I've never looked 
> > > > > > > > deeply at the actual parsing code in clang.  Maybe it's easy 
> > > > > > > > enough to throw a bag of tokens at it and get back an AST node, 
> > > > > > > > maybe not.  (I suspect not based on my experience with the code 
> > > > > > > > base so far.)
> > > > > > > > 
> > > > > > > > My suspicion is that code bases that are heavy with macros for 
> > > > > > > > constants //aren't// using modern C++ in the body of those 
> > > > > > > > macros to define the values of those constants.  Certainly this 
> > > > > > > > is 100% true for C code that uses macros to define constants, 
> > > > > > > > by definition.  This check applies equally well to C code as C 
> > > > > > > > has had enums forever but even recent C code still tends to use 
> > > > > > > > macros for constants.
> > > > > > > > 
> > > > > > > > Still, my suspicions aren't data.  I'd like to get this check 
> > > > > > > > deployed in a basic fashion and let user feedback provide data 
> > > > > > > > on what is important.
> > > > > > > > 
> > > > > > > > > So I don't agree that it's unavoidable to write another 
> > > > > > > > > custom parser.
> > > > > > > > 
> > > > > > > > That's a fair point.  //Some// kind of parser is needed to 
> > > > > > > > recognize valid initializer expressions or we run the risk of 
> > > > > > > > transforming valid code into invalid code.  Whether it is a 
> > > > > > > > custom recognizer as I've done or `clangParse` is what we're 
> > > > > > > > debating here.
> > > > > > > > You keep bringing up the idea that the values have to be known, 
> > > > > > > > but so far they don't.
> > > > > > > 
> > > > > > > See comments at the top level.
> > > > > > > 
> > > > > > > > So the compiler is free to pick an underlying type that's large 
> > > > > > > > enough to handle all the explicitly listed initial values. Do 
> > > > > > > > we actually need to know the values for this check? 
> > > > > > > 
> > > > > > > Yes, C requires the enumeration constants to be representable 
> > > > > > > with `int`. But also, because this is in the `modernize` module, 
> > > > > > > it's very likely we'll be getting a request to convert to using a 
> > > > > > > scoped enumeration or an enumeration with the appropriate fixed 
> > > > > > > underlying type in C++ as well.
> > > > > > Oh, I see now, thanks for explaining it.  I didn't realize that C 
> > > > > > restricts the values to `int`.
> > > > > Regarding conversion to a scoped enum, I think that is best handled 
> > > > > by a separate enum-to-scoped-enum check.  I have one I've been 
> > > > > working on separately.  As bad as it is to convert macros (since they 
> > > > > have no respect for structure or scope), it's quite a bit of work to 
> > > > > convert a non-scoped enum to an enum because now implicit conversions 
> > > > > enter the picture and expressions involving macros (e.g. `FLAG_X | 
> > > > > FLAG_Y`) also get much more complicated.  Not only that but usages 
> > > > > have to have types updated.  I don't think it's very useful to 
> > > > > upgrade to a scoped enum and then have every use wrapped in 
> > > > > `static_cast<int>()`.  It just creates uglier code than what was 
> > > > > there before and I don't think people would adopt such a check.
> > > > > 
> > > > > Regarding conversion to an appropriate fixed underlying type, that 
> > > > > isn't allowed on unscoped enums, only on scoped enums, so it has all 
> > > > > the above complexity plus selecting the appropriate fixed underlying 
> > > > > type.
> > > > You're welcome, sorry for not pointing it out sooner!
> > > > Regarding conversion to a scoped enum, I think that is best handled by 
> > > > a separate enum-to-scoped-enum check.
> > > 
> > > It's been a while since I checked, but I recall that checks with 
> > > interacting fix-its tend not to play well together. We should probably 
> > > see if that's still the case today. As an example, if the 
> > > enum-to-scoped-enum check runs BEFORE the modernize-macros-to-enum check, 
> > > then the behavior will be worse than if the checks are run in the reverse 
> > > order. Because of issues like that, I'm not quite as convinced that a 
> > > separate check is best (though I do agree it's notionally better).
> > > 
> > > > Regarding conversion to an appropriate fixed underlying type, that 
> > > > isn't allowed on unscoped enums, only on scoped enums, so it has all 
> > > > the above complexity plus selecting the appropriate fixed underlying 
> > > > type.
> > > 
> > > That's incorrect; fixed underlying types and scoped enumerations are 
> > > orthogonal features (though a scoped enumeration always has a fixed 
> > > underlying type): https://godbolt.org/z/sGYsjdnrT
> > You're right that there can be unexpected interactions between checks when 
> > you run multiple of them concurrently, but this has always been the case 
> > and isn't surprising to me.  This doesn't seem to be a situation unique to 
> > these checks though.  As more and more transformations become available 
> > through clang-tidy, it's inevitable that two different checks will want to 
> > modify the same piece of code.  For instance, the identifier naming check 
> > and modernize-loop-convert.  Modernize-loop-convert can eliminate variables 
> > entirely (iterators go poof), while the identifier check wants to rename 
> > the iterators.
> > 
> > Huh.  OK, good to know.  I tried doing an underlying type on an unscoped 
> > enum and I got a compilation error; I must have just done it wrong.
> > 
> > You're right that there can be unexpected interactions between checks when 
> > you run multiple of them concurrently, but this has always been the case 
> > and isn't surprising to me. 
> 
> +1, this isn't a new issue. The reason I brought it up is because we've been 
> bringing up this issue for a few years now and nobody has had the chance to 
> try to fix the fixit infrastructure to improve the behavior of these kinds of 
> interactions. So my fear is that we keep making the situation incrementally 
> worse, and then it gets incrementally harder for anyone to fix it because of 
> odd edge case behavior. That's not a reason for you to change what you're 
> doing in this patch right now, though -- just background on where I'm coming 
> from.
I've already observed that `run-clang-tidy.py` can produce invalid fixits for 
header files, see [[ https://github.com/llvm/llvm-project/issues/54885 | bug 
54885 ]] and [[ 
https://discourse.llvm.org/t/clang-tidy-how-to-get-replacements-properly-de-duplicated-by-clang-apply-replacements/61697
 | this discussion ]].  I haven't yet concluded if it's a bug in the way I'm 
emitting fixits or a bug in the way `clang-apply-replacements` tries to 
de-duplicate fixits.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124500/new/

https://reviews.llvm.org/D124500

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Reply via email to