[PATCH] D88299: [clang-format] Add MacroUnexpander.

Sam McCall via Phabricator via cfe-commits Mon, 12 Oct 2020 11:24:48 -0700

sammccall added a comment.

This is magnificently difficult :-)
I've sunk a fair few hours into understanding it now, and need to send some 
comments based on the interface+tests.
Some of these will be misguided or answered by the implementation, but I'm 
afraid I can't fit it all into my head (on a reasonable timeframe) until I 
understand it better.




================
Comment at: clang/lib/Format/FormatToken.h:449
 
+  /// When macro expansion introduces parents, those are marked as
+  /// \c MacroParent, so formatting knows their children need to be formatted.
----------------
I can't really understand from the comment when this is supposed to be set, and 
there are no tests of it.

(The comment is vague: is a "parent" the inverse of FormatToken::Children here? 
Is this scenario when the parents in question are new, or their children are 
new, or both? What part of the code is "formatting", and why would it otherwise 
skip the children?)


================
Comment at: clang/lib/Format/Macros.h:134
 
+/// Matches formatted lines that were created by macro expansion to a format
+/// for the original macro call.
----------------
"matches formatted lines" probably describes the hard technical problem it has 
to solve, but not so much what it does for the caller:  what the transformation 
is between its inputs and its outputs.

Is it something like:

```
Rewrites expanded code (containing tokens expanded from macros) into spelled 
code (containing tokens for the macro call itself). Token types are preserved, 
so macro arguments in the output have semantically-correct types from their 
expansion. This is the point of expansion/unexpansion: to allow this 
information to be used in formatting.
```

[Is it just tokentypes? I guess it's also Role and MustBreakBefore and some 
other stuff like that?]


================
Comment at: clang/lib/Format/Macros.h:142
+/// When getting the formatted lines of the expansion via the \c addLine 
method:
+/// -> class A {
+/// -> public:
----------------
I'm a bit confused by these arrows. It doesn't seem that they each point to an 
unwrappedline passed to addLine?


================
Comment at: clang/lib/Format/Macros.h:142
+/// When getting the formatted lines of the expansion via the \c addLine 
method:
+/// -> class A {
+/// -> public:
----------------
sammccall wrote:
> I'm a bit confused by these arrows. It doesn't seem that they each point to 
> an unwrappedline passed to addLine?
This example didn't really help me understand the interface of this class, it 
seems to be a special case:
 - the input is a single block construct (rather than e.g. a whole program), 
but it's not clear why that's the case.
 - the output (unexpanded form) consists of exactly a macro call with no 
leading/trailing tokens, which isn't true in general

If the idea is to provide as input the minimal range of lines that span the 
macro, we should say so somewhere. But I would like to understand why we're not 
simply processing the whole file.


================
Comment at: clang/lib/Format/Macros.h:147
+///
+/// Creates the unwrapped lines containing the macro call tokens so that
+/// the macro call tokens fit the semantic structure of the expanded formatted
----------------
this says "creates the unwrapped lines" but getResult() returns only one.
Does the plural here refer to the tree? Maybe just use singular or "the 
unwrappedlinetree"?


================
Comment at: clang/lib/Format/Macros.h:154
+/// -> })
+class MacroUnexpander {
+public:
----------------
I get the symmetry between the expander/unexpander classes, but the name is 
making it harder for me to follow the code.
 - the extra compound+negation in the name makes it hard/slow to understand, as 
I'm always thinking first about expansion
 - the fact that expansion/unexpansion are not actually opposites completely 
breaks my intuition. It also creates two different meanings of "unexpanded" 
that led me down the garden path a bit (e.g. in the test).

Would you consider `MacroCollapser`? It's artificial, but expand/collapse are 
well-known antonyms without being the same word.

(Incidentally, I just realized this is why I find "UnwrappedLine" so slippery - 
the ambiguity between "this isn't wrapped" and "this was unwrapped")


================
Comment at: clang/lib/Format/Macros.h:164
+  /// For the given \p Line, match all occurences of tokens expanded from a
+  /// macro and replace them with the original macro call in \c getResult().
+  void addLine(const UnwrappedLine &Line);
----------------
I find this hard to follow.
- again, the "match" part is an implementation detail that sounds interesting 
but isn't :-)
- "from a macro" sounds like one in particular, but is actually every macro
- the bit about "getResult" is a separate point that feels shoehorned in

What about:
"Replaces tokens that were expanded from macros with the original macro calls. 
The result is stored and available in getResult()"


================
Comment at: clang/lib/Format/Macros.h:175
+  /// Generally, this line tries to have the same structure as the expanded,
+  /// formatted unwrapped lines handed in via \c addLine():
+  /// If a token in a macro argument is a child of a token in the expansion,
----------------
how can this be the case if the input can have multiple lines and the output 
only one?

Is the return value a synthetic parent of the translated lines?
Or is there a hidden requirement on the caller here that we don't keep feeding 
in lines unless we're continuing a macro call and therefore know we'll end up 
with one line?

This stuff could be clarified in docs but again I have to ask, can't we 
sidestep the whole issue by processing the whole file and returning all the 
lines?


(This is somewhat answered in the implementation, though that doesn't seem like 
the right place)


================
Comment at: clang/lib/Format/Macros.h:237
+  // lines as children.
+  // In the end, we stich the lines together so that each subsequent line
+  // is a child of the last token of the previous line. This is necessary
----------------
The explanation here seems to be proving the converse: if we *didn't* use this 
representation, then the code wouldn't work. However what I'm missing is an 
explanation of why it **is** correct/sensible.

After staring at the tests, I'm still not sure, since the tests seem to 
postprocess the "natural" output the same way before asserting equality.

My tentative conclusion is it would be clearest to move this "in the end" step 
to the *caller* of getResult(), as it seems to have more to do with formatting 
than unexpansion. But I haven't looked in detail at that caller, maybe I'm 
wrong...


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:1
+#include "../../lib/Format/Macros.h"
+#include "../../lib/Format/UnwrappedLineParser.h"
----------------
All of these tests use both the expander and unexpander (so need consider 
behavior/bugs of both). 
Is it possible to test the unexpander in isolation?

(Context: I'm trying to use the tests to understand what the class does, but 
the inputs aren't visible)


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:1
+#include "../../lib/Format/Macros.h"
+#include "../../lib/Format/UnwrappedLineParser.h"
----------------
sammccall wrote:
> All of these tests use both the expander and unexpander (so need consider 
> behavior/bugs of both). 
> Is it possible to test the unexpander in isolation?
> 
> (Context: I'm trying to use the tests to understand what the class does, but 
> the inputs aren't visible)
Having read all the tests here, they seem to follow exactly the same schema:

1. some macro definitions
2. some sequence of expand() calls, simulating expanding some macro-heavy code
3. verify the expanded token sequence, rearranging it into expected 
UnwrappedLine structure
4. call unexpand() and check that the result has the expected UnwrappedLine 
structure

You do this using various fairly-general helpers and DSLs, but don't really 
combine them in different ways... the tests are somewhat readable and error 
messages are OK, but if these are important tests, it might be worth looking at 
a data-driven test. e.g.

```
Case NestedChildBlocks;
NestedChildBlocks.Macros = {"ID(x)=x", "CALL(x)=f([] { x })"};
NestedChildBlocks.Original = "ID(CALL(CALL(return a * b;)))";
NestedChildBlocks.Expanded = R"cpp(
{
f([] {
  f([] {
    return a * b;
  })
})
}
)cpp"; // indentation shows structure
NestedChildBlocks.Unexpanded = R"cpp(
ID(
  {
  CALL(CALL(return a * b;))
  }
)
)cpp";
NestedChildBlocks.verify();
```

Definitely involves a bit of reinventing wheels though.


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:20
+
+class Expansion {
+public:
----------------
docs for this class/major members?


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:28
+    auto *ID = Lex.id(Name);
+    auto UnexpandedLine = std::make_unique<UnwrappedLine>();
+    UnexpandedLine->Tokens.push_back(ID);
----------------
this name (and all the other "Unexpanded*" in this class) is misleading, 
because there's a process called "unexpanding" but these aren't the output of 
it.

I'd suggest "spelled", though I do think renaming the unexpander would also be 
worthwhile.


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:93
+
+struct Matcher {
+  Matcher(const TokenList &Tokens) : Tokens(Tokens), It(this->Tokens.begin()) 
{}
----------------
this needs docs (it's a cool technique! no need to make the reader decipher it 
though)


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:96
+
+  Chunk consume(const TokenList &Tokens) {
+    TokenList Result;
----------------
you always consume(lex(...)) - taking a StringRef directly instead would make 
it clearer that the identity of the temporary tokens doesn't matter


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:125
+  std::unique_ptr<MacroExpander>
+  create(const std::vector<std::string> &MacroDefinitions) {
+    return std::make_unique<MacroExpander>(MacroDefinitions,
----------------
`create()` is a strange name for the *expander* in an *unexpander* test


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:141
+
+  UnwrappedLine line(llvm::ArrayRef<Chunk> Chunks) {
+    UnwrappedLine Result;
----------------
FWIW, this seems confusing to me - line() has overloads that are simple, and 
then this one that does the same nontrivial stitching that the production code 
does.
The fact that the a/b lines are not sibling sub-lines in `line({tokens("a"), 
tokens("b")})`but they *are* sibling tokens in  `line(lex("a b"))` is hard to 
keep track of in the tests.

If the stitching really is necessary, I think it's important for the expected 
output to also be shown in its stitched form.


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:155
+
+  Chunk tokens(llvm::StringRef Text) { return Chunk(lex(Text)); }
+
----------------
why is this "tokens" and not "chunk"?


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:207
+  Unexp.addLine(line(E.consume(lex("};"))));
+  EXPECT_TRUE(Unexp.finished());
+  Matcher U(Call);
----------------
you have lots of assertions that this is true, but none that it is false


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:242
+  Expansion Exp1(Lex, *Macros);
+  TokenList Call1 = Exp1.expand("ID", {"a *b"});
+  // 2. expand ID({ a *b; })
----------------
the high-level point of this test seems to be that by looking the 
post-expansion context, we can determine that b is a declared pointer so we 
don't put a space before it.

And everywhere these tokens are mentioned/verified here, the spacing is 
correct, as if it were propagated... but of course the spacing is actually 
ignored everywhere and underneath it's just sequences of tokens.
This makes it hard to track how well this it testing what it really wants to 
test.

Is it possible to mark the asterisk with the correct tokentype at the point 
where formatting would occur, and then verify that the unexpanded form (i.e. 
`Chunk1.Tokens[1]`) still has the tokentype set?
(It's probably possible to prove this by inspection by understanding what U1 
contains, how Matcher works etc, but I think it'd still be illustrative)


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:245
+  TokenList Arg;
+  Arg.push_back(Lex.id("{"));
+  Arg.append(Exp1.getTokens().begin(), Exp1.getTokens().end());
----------------
what does id() mean if not identifier?


================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:406
+  Matcher E(Exp.getTokens());
+  Unexp.addLine(line(E.consume(lex("x;"))));
+  Unexp.addLine(line(E.consume(lex("x y"))));
----------------
because this example isn't realistic, it'd be nice to have the comment here 
showing what formatting you're simulating


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88299/new/

https://reviews.llvm.org/D88299

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D88299: [clang-format] Add MacroUnexpander.

Reply via email to