[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Richard via Phabricator via cfe-commits Mon, 02 May 2022 13:19:28 -0700

LegalizeAdulthood added inline comments.


================
Comment at: 
clang-tools-extra/clang-tidy/modernize/IntegralLiteralExpressionMatcher.cpp:99
+
+  if (!Current->isLiteral() || isStringLiteral(Current->getKind()) ||
+      !isIntegralConstant(*Current)) {
----------------
LegalizeAdulthood wrote:
> aaron.ballman wrote:
> > LegalizeAdulthood wrote:
> > > aaron.ballman wrote:
> > > > LegalizeAdulthood wrote:
> > > > > aaron.ballman wrote:
> > > > > > LegalizeAdulthood wrote:
> > > > > > > aaron.ballman wrote:
> > > > > > > > I know this is code moved from elsewhere, but I suppose we 
> > > > > > > > never considered the odd edge case where a user does something 
> > > > > > > > like `"foo"[0]` as a really awful integer constant. :-D
> > > > > > > It's always possible to write crazy contorted code and have a 
> > > > > > > check not recognize it.  I don't think it's worthwhile to spend 
> > > > > > > time trying to handle torture cases unless we can find data that 
> > > > > > > shows it's prevalent in real world code.
> > > > > > > 
> > > > > > > If I was doing a code review and saw this:
> > > > > > > ```
> > > > > > > enum {
> > > > > > >     FOO = "foo"[0]
> > > > > > > };
> > > > > > > ```
> > > > > > > I'd flag that in a code review as bogus, whereas if I saw:
> > > > > > > ```
> > > > > > > enum {
> > > > > > >     FOO = 'f'
> > > > > > > };
> > > > > > > ```
> > > > > > > That would be acceptable, which is why character literals are 
> > > > > > > accepted as an integral literal initializer for an enum in this 
> > > > > > > check.
> > > > > > >  I don't think it's worthwhile to spend time trying to handle 
> > > > > > > torture cases unless we can find data that shows it's prevalent 
> > > > > > > in real world code.
> > > > > > 
> > > > > > I think I'm okay agreeing to that in this particular case, but this 
> > > > > > is more to point out that writing your own parser is a maintenance 
> > > > > > burden. Users will hit cases we've both forgotten about here, 
> > > > > > they'll file a bug, then someone has to deal with it. It's very 
> > > > > > hard to justify to users "we think you write silly code" because 
> > > > > > they often have creative ways in which their code is not actually 
> > > > > > so silly, especially when we support "most" valid expressions.
> > > > > Writing your own parser is unavoidable here because we can't just 
> > > > > assume that any old thing will be a valid initializer just by looking 
> > > > > at the set of tokens present in the macro body.  (There is a separate 
> > > > > discussion going on about improving the preprocessor support and 
> > > > > parsing things more deeply, but that isn't even to the point of a 
> > > > > prototype yet.)  The worst thing we can do is create "fixits" that 
> > > > > produce invalid code.
> > > > > 
> > > > > The worst that happens if your expression isn't recognized is that 
> > > > > your macro isn't matched as a candidate for an enum.  You can always 
> > > > > make it an enum manually and join it with adjacent macros that were 
> > > > > recognized and converted.
> > > > > 
> > > > > As it stands, the check only recognizes a single literal with an 
> > > > > optional unary operator.
> > > > > 
> > > > > This change expands the check to recognize a broad range of 
> > > > > expressions, allowing those macros to be converted to enums.  I 
> > > > > opened the issue because running modernize-macro-to-enum on the [[ 
> > > > > https://github.com/InsightSoftwareConsortium/ITK | ITK codebase ]] 
> > > > > showed some simple expressions involving literals that weren't 
> > > > > recognized and converted.
> > > > > 
> > > > > If an expression isn't recognized and an issue is opened, it will be 
> > > > > an enhancement request to support a broader range of expression, not 
> > > > > a bug that this check created invalid code.
> > > > > 
> > > > > IMO, the more useful thing that's missing from the grammar is 
> > > > > recognizing `sizeof` expressions rather than indexing string literals 
> > > > > with an integral literal subscript.
> > > > > 
> > > > > I had planned on doing another increment to recognize `sizeof` 
> > > > > expressions.
> > > > > Writing your own parser is unavoidable here because we can't just 
> > > > > assume that any old thing will be a valid initializer just by looking 
> > > > > at the set of tokens present in the macro body. 
> > > > 
> > > > If you ran the token sequence through clang's parser and got an AST 
> > > > node out, you'd have significantly *more* information as to whether 
> > > > something is a valid enum constant initializer because you can check 
> > > > that it's an actual constant expression *and* that it's within a valid 
> > > > range of values. This not only fixes edge case bugs with your approach 
> > > > (like the fact that you can generate a series of literal expressions 
> > > > that result in a value too large to store within an enumerator 
> > > > constant), but it enables new functionality your approach currently 
> > > > disallows (like using constexpr variables instead of just numeric 
> > > > literals).
> > > > 
> > > > So I don't agree that it's unavoidable to write another custom parser.
> > > You keep bringing up the idea that the values have to be known, but so 
> > > far they don't.
> > > 
> > > Remember, we are replacing macro identifiers with anonymous enum 
> > > identifiers.  We aren't specifying a restricting type to the enum, so as 
> > > long as it's a valid integral literal expression, we're not changing any 
> > > semantics.  Unscoped enums also allow arbitrary conversions to/from an 
> > > underlying integral type chosen by the compiler.
> > > 
> > > C++20 9.7.1 paragraph 7 says:
> > > 
> > > > For an enumeration whose underlying type is not fixed, the underlying 
> > > > type is an integral type that can
> > > > represent all the enumerator values defined in the enumeration. If no 
> > > > integral type can represent all the
> > > > enumerator values, the enumeration is ill-formed. It is 
> > > > implementation-defined which integral type is used
> > > > as the underlying type except that the underlying type shall not be 
> > > > larger than int unless the value of an
> > > > enumerator cannot fit in an int or unsigned int . If the 
> > > > enumerator-list is empty, the underlying type is as
> > > > if the enumeration had a single enumerator with value 0.
> > > 
> > > So the compiler is free to pick an underlying type that's large enough to 
> > > handle all the explicitly listed initial values.  Do we actually need to 
> > > know the values for this check?  I don't think so, because we aren't 
> > > changing anything about the type of the named values.  When the compiler 
> > > evaluates an integral literal, it goes through a similar algorithm 
> > > assigning the appropriate type to those integral values:
> > > 
> > > C++20 5.9 paragraph 2:
> > > 
> > > > A preprocessing number does not have a type or a value; it acquires 
> > > > both after a successful conversion to an
> > > > integer-literal token or a floating-point-literal token.
> > > 
> > > C++20 5.13.2 paragraph 3:
> > > 
> > > > The type of an integer-literal is the first type in the list in Table 8 
> > > > corresponding to its optional integer-suffix
> > > > in which its value can be represented. 
> > > 
> > > The table says the type is int, unsigned int, long int, unsigned long 
> > > int, long long int, or unsigned long long int based on the suffix and the 
> > > value and that the type is chosen to be big enough to hold the value if 
> > > the suffix is unspecified.
> > > 
> > > > but [using `clangParse`] enables new functionality your approach 
> > > > currently disallows (like using constexpr variables instead of just 
> > > > numeric literals).
> > > 
> > > I agree that if we used the full parser, we'd bring in `constexpr` 
> > > expressions as valid initializers for the enums.  However, before 
> > > engaging in all that work, I'd like to see how likely this is in existing 
> > > codebases by feedback from users requesting the support.  Maybe engaging 
> > > the parser isn't a big amount of work, I don't actually know.  I've never 
> > > looked deeply at the actual parsing code in clang.  Maybe it's easy 
> > > enough to throw a bag of tokens at it and get back an AST node, maybe 
> > > not.  (I suspect not based on my experience with the code base so far.)
> > > 
> > > My suspicion is that code bases that are heavy with macros for constants 
> > > //aren't// using modern C++ in the body of those macros to define the 
> > > values of those constants.  Certainly this is 100% true for C code that 
> > > uses macros to define constants, by definition.  This check applies 
> > > equally well to C code as C has had enums forever but even recent C code 
> > > still tends to use macros for constants.
> > > 
> > > Still, my suspicions aren't data.  I'd like to get this check deployed in 
> > > a basic fashion and let user feedback provide data on what is important.
> > > 
> > > > So I don't agree that it's unavoidable to write another custom parser.
> > > 
> > > That's a fair point.  //Some// kind of parser is needed to recognize 
> > > valid initializer expressions or we run the risk of transforming valid 
> > > code into invalid code.  Whether it is a custom recognizer as I've done 
> > > or `clangParse` is what we're debating here.
> > > You keep bringing up the idea that the values have to be known, but so 
> > > far they don't.
> > 
> > See comments at the top level.
> > 
> > > So the compiler is free to pick an underlying type that's large enough to 
> > > handle all the explicitly listed initial values. Do we actually need to 
> > > know the values for this check? 
> > 
> > Yes, C requires the enumeration constants to be representable with `int`. 
> > But also, because this is in the `modernize` module, it's very likely we'll 
> > be getting a request to convert to using a scoped enumeration or an 
> > enumeration with the appropriate fixed underlying type in C++ as well.
> Oh, I see now, thanks for explaining it.  I didn't realize that C restricts 
> the values to `int`.
Regarding conversion to a scoped enum, I think that is best handled by a 
separate enum-to-scoped-enum check.  I have one I've been working on 
separately.  As bad as it is to convert macros (since they have no respect for 
structure or scope), it's quite a bit of work to convert a non-scoped enum to 
an enum because now implicit conversions enter the picture and expressions 
involving macros (e.g. `FLAG_X | FLAG_Y`) also get much more complicated.  Not 
only that but usages have to have types updated.  I don't think it's very 
useful to upgrade to a scoped enum and then have every use wrapped in 
`static_cast<int>()`.  It just creates uglier code than what was there before 
and I don't think people would adopt such a check.

Regarding conversion to an appropriate fixed underlying type, that isn't 
allowed on unscoped enums, only on scoped enums, so it has all the above 
complexity plus selecting the appropriate fixed underlying type.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124500/new/

https://reviews.llvm.org/D124500

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Reply via email to