[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Aaron Ballman via Phabricator via cfe-commits Mon, 02 May 2022 13:18:08 -0700

aaron.ballman added inline comments.


================
Comment at: 
clang-tools-extra/clang-tidy/modernize/IntegralLiteralExpressionMatcher.cpp:99
+
+  if (!Current->isLiteral() || isStringLiteral(Current->getKind()) ||
+      !isIntegralConstant(*Current)) {
----------------
LegalizeAdulthood wrote:
> aaron.ballman wrote:
> > LegalizeAdulthood wrote:
> > > aaron.ballman wrote:
> > > > LegalizeAdulthood wrote:
> > > > > aaron.ballman wrote:
> > > > > > LegalizeAdulthood wrote:
> > > > > > > aaron.ballman wrote:
> > > > > > > > I know this is code moved from elsewhere, but I suppose we 
> > > > > > > > never considered the odd edge case where a user does something 
> > > > > > > > like `"foo"[0]` as a really awful integer constant. :-D
> > > > > > > It's always possible to write crazy contorted code and have a 
> > > > > > > check not recognize it.  I don't think it's worthwhile to spend 
> > > > > > > time trying to handle torture cases unless we can find data that 
> > > > > > > shows it's prevalent in real world code.
> > > > > > > 
> > > > > > > If I was doing a code review and saw this:
> > > > > > > ```
> > > > > > > enum {
> > > > > > >     FOO = "foo"[0]
> > > > > > > };
> > > > > > > ```
> > > > > > > I'd flag that in a code review as bogus, whereas if I saw:
> > > > > > > ```
> > > > > > > enum {
> > > > > > >     FOO = 'f'
> > > > > > > };
> > > > > > > ```
> > > > > > > That would be acceptable, which is why character literals are 
> > > > > > > accepted as an integral literal initializer for an enum in this 
> > > > > > > check.
> > > > > > >  I don't think it's worthwhile to spend time trying to handle 
> > > > > > > torture cases unless we can find data that shows it's prevalent 
> > > > > > > in real world code.
> > > > > > 
> > > > > > I think I'm okay agreeing to that in this particular case, but this 
> > > > > > is more to point out that writing your own parser is a maintenance 
> > > > > > burden. Users will hit cases we've both forgotten about here, 
> > > > > > they'll file a bug, then someone has to deal with it. It's very 
> > > > > > hard to justify to users "we think you write silly code" because 
> > > > > > they often have creative ways in which their code is not actually 
> > > > > > so silly, especially when we support "most" valid expressions.
> > > > > Writing your own parser is unavoidable here because we can't just 
> > > > > assume that any old thing will be a valid initializer just by looking 
> > > > > at the set of tokens present in the macro body.  (There is a separate 
> > > > > discussion going on about improving the preprocessor support and 
> > > > > parsing things more deeply, but that isn't even to the point of a 
> > > > > prototype yet.)  The worst thing we can do is create "fixits" that 
> > > > > produce invalid code.
> > > > > 
> > > > > The worst that happens if your expression isn't recognized is that 
> > > > > your macro isn't matched as a candidate for an enum.  You can always 
> > > > > make it an enum manually and join it with adjacent macros that were 
> > > > > recognized and converted.
> > > > > 
> > > > > As it stands, the check only recognizes a single literal with an 
> > > > > optional unary operator.
> > > > > 
> > > > > This change expands the check to recognize a broad range of 
> > > > > expressions, allowing those macros to be converted to enums.  I 
> > > > > opened the issue because running modernize-macro-to-enum on the [[ 
> > > > > https://github.com/InsightSoftwareConsortium/ITK | ITK codebase ]] 
> > > > > showed some simple expressions involving literals that weren't 
> > > > > recognized and converted.
> > > > > 
> > > > > If an expression isn't recognized and an issue is opened, it will be 
> > > > > an enhancement request to support a broader range of expression, not 
> > > > > a bug that this check created invalid code.
> > > > > 
> > > > > IMO, the more useful thing that's missing from the grammar is 
> > > > > recognizing `sizeof` expressions rather than indexing string literals 
> > > > > with an integral literal subscript.
> > > > > 
> > > > > I had planned on doing another increment to recognize `sizeof` 
> > > > > expressions.
> > > > > Writing your own parser is unavoidable here because we can't just 
> > > > > assume that any old thing will be a valid initializer just by looking 
> > > > > at the set of tokens present in the macro body. 
> > > > 
> > > > If you ran the token sequence through clang's parser and got an AST 
> > > > node out, you'd have significantly *more* information as to whether 
> > > > something is a valid enum constant initializer because you can check 
> > > > that it's an actual constant expression *and* that it's within a valid 
> > > > range of values. This not only fixes edge case bugs with your approach 
> > > > (like the fact that you can generate a series of literal expressions 
> > > > that result in a value too large to store within an enumerator 
> > > > constant), but it enables new functionality your approach currently 
> > > > disallows (like using constexpr variables instead of just numeric 
> > > > literals).
> > > > 
> > > > So I don't agree that it's unavoidable to write another custom parser.
> > > You keep bringing up the idea that the values have to be known, but so 
> > > far they don't.
> > > 
> > > Remember, we are replacing macro identifiers with anonymous enum 
> > > identifiers.  We aren't specifying a restricting type to the enum, so as 
> > > long as it's a valid integral literal expression, we're not changing any 
> > > semantics.  Unscoped enums also allow arbitrary conversions to/from an 
> > > underlying integral type chosen by the compiler.
> > > 
> > > C++20 9.7.1 paragraph 7 says:
> > > 
> > > > For an enumeration whose underlying type is not fixed, the underlying 
> > > > type is an integral type that can
> > > > represent all the enumerator values defined in the enumeration. If no 
> > > > integral type can represent all the
> > > > enumerator values, the enumeration is ill-formed. It is 
> > > > implementation-defined which integral type is used
> > > > as the underlying type except that the underlying type shall not be 
> > > > larger than int unless the value of an
> > > > enumerator cannot fit in an int or unsigned int . If the 
> > > > enumerator-list is empty, the underlying type is as
> > > > if the enumeration had a single enumerator with value 0.
> > > 
> > > So the compiler is free to pick an underlying type that's large enough to 
> > > handle all the explicitly listed initial values.  Do we actually need to 
> > > know the values for this check?  I don't think so, because we aren't 
> > > changing anything about the type of the named values.  When the compiler 
> > > evaluates an integral literal, it goes through a similar algorithm 
> > > assigning the appropriate type to those integral values:
> > > 
> > > C++20 5.9 paragraph 2:
> > > 
> > > > A preprocessing number does not have a type or a value; it acquires 
> > > > both after a successful conversion to an
> > > > integer-literal token or a floating-point-literal token.
> > > 
> > > C++20 5.13.2 paragraph 3:
> > > 
> > > > The type of an integer-literal is the first type in the list in Table 8 
> > > > corresponding to its optional integer-suffix
> > > > in which its value can be represented. 
> > > 
> > > The table says the type is int, unsigned int, long int, unsigned long 
> > > int, long long int, or unsigned long long int based on the suffix and the 
> > > value and that the type is chosen to be big enough to hold the value if 
> > > the suffix is unspecified.
> > > 
> > > > but [using `clangParse`] enables new functionality your approach 
> > > > currently disallows (like using constexpr variables instead of just 
> > > > numeric literals).
> > > 
> > > I agree that if we used the full parser, we'd bring in `constexpr` 
> > > expressions as valid initializers for the enums.  However, before 
> > > engaging in all that work, I'd like to see how likely this is in existing 
> > > codebases by feedback from users requesting the support.  Maybe engaging 
> > > the parser isn't a big amount of work, I don't actually know.  I've never 
> > > looked deeply at the actual parsing code in clang.  Maybe it's easy 
> > > enough to throw a bag of tokens at it and get back an AST node, maybe 
> > > not.  (I suspect not based on my experience with the code base so far.)
> > > 
> > > My suspicion is that code bases that are heavy with macros for constants 
> > > //aren't// using modern C++ in the body of those macros to define the 
> > > values of those constants.  Certainly this is 100% true for C code that 
> > > uses macros to define constants, by definition.  This check applies 
> > > equally well to C code as C has had enums forever but even recent C code 
> > > still tends to use macros for constants.
> > > 
> > > Still, my suspicions aren't data.  I'd like to get this check deployed in 
> > > a basic fashion and let user feedback provide data on what is important.
> > > 
> > > > So I don't agree that it's unavoidable to write another custom parser.
> > > 
> > > That's a fair point.  //Some// kind of parser is needed to recognize 
> > > valid initializer expressions or we run the risk of transforming valid 
> > > code into invalid code.  Whether it is a custom recognizer as I've done 
> > > or `clangParse` is what we're debating here.
> > > You keep bringing up the idea that the values have to be known, but so 
> > > far they don't.
> > 
> > See comments at the top level.
> > 
> > > So the compiler is free to pick an underlying type that's large enough to 
> > > handle all the explicitly listed initial values. Do we actually need to 
> > > know the values for this check? 
> > 
> > Yes, C requires the enumeration constants to be representable with `int`. 
> > But also, because this is in the `modernize` module, it's very likely we'll 
> > be getting a request to convert to using a scoped enumeration or an 
> > enumeration with the appropriate fixed underlying type in C++ as well.
> Oh, I see now, thanks for explaining it.  I didn't realize that C restricts 
> the values to `int`.
You're welcome, sorry for not pointing it out sooner!


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124500/new/

https://reviews.llvm.org/D124500

_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Reply via email to