sammccall added inline comments.
================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:30 +// A node in a forest. +class ForestNode { +public: ---------------- sammccall wrote: > I wonder if we should put these in a namespace `forest::Node` rather than > giving them the "Forest" prefix? alignas(ForestNode*) ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:67 + std::string Dump(const Grammar &) const; + std::string DumpRecursive(const Grammar &, bool abbreviated = false) const; + ---------------- dump(), dumpRecursive() ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:78 + } + static uint16_t SequenceData(RuleID rule, + llvm::ArrayRef<const ForestNode *> elements) { ---------------- Rule, Elements ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:90 + // Terminal - unused + uint16_t Data_; + // A trailing array of Node* . ---------------- remove trailing _ ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:91 + uint16_t Data_; + // A trailing array of Node* . +}; ---------------- ForestNode* ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:100 +public: + size_t nodeNum() const { return NodeNum; } + size_t bytes() const { return Arena.getBytesAllocated() + sizeof(this); } ---------------- count()? ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:40 + +// An implementation of a directed acyclic graph (DAG), used as a +// graph-structured stack (GSS) in the GLR parser. ---------------- the fact that this uses a DAG seems less important than what it represents. I think the next sentence is a better introduction, and then we should describe what data the stacks are modelling, finally we should describe the data structure after (see comment below). ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:43 +// +// GSS is an efficient data structure to represent multiple active stacks, it +// employs a stack-combination optimization to avoid potentially exponential ---------------- as with forest, I think this places too much emphasis on the memory-compactness when it's not the main benefit. (If RAM were free we still wouldn't want a big array of stacks). ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:43 +// +// GSS is an efficient data structure to represent multiple active stacks, it +// employs a stack-combination optimization to avoid potentially exponential ---------------- sammccall wrote: > as with forest, I think this places too much emphasis on the > memory-compactness when it's not the main benefit. (If RAM were free we still > wouldn't want a big array of stacks). To address the two above comments, maybe something like. ``` A Graph-Structured Stack represents the multiple parse stacks for a generalized LR parser. Each node stores a parse state, the last parsed ForestNode, and its parent(s). There may be several heads (top of stack), and the parser operates by: - shift: pushing terminal symbols on top of the stack - reduce: replace N symbols on top of the stack with one nonterminal The structure is a DAG rather than a stack: - GLR allows multiple actions on the same head, producing forks (nodes with the same parent). - The parser reconciles nodes with the same (state, ForestNode), producing joins (node with multiple parents). The parser is responsible for creating nodes, and keeping track of the set of heads. ``` ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:46 +// growth of the stack: +// - combing equal stack prefixes -- A new stack doesn't need to have a full +// copy of its parent’s stack. They share a common prefix. ---------------- combing -> combining I wonder if combining is the right explanation here - unlike local ambiguity packing, it's not like we produce two things and then merge them. What about rather: `sharing stack prefixes: when the parser must take two conflicting paths, it creates two new stack head nodes with the same parent.` ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:48 +// copy of its parent’s stack. They share a common prefix. +// - combing euqal stack suffices -- as there are a finite number of DFA's +// state the parser can be in. A set of heads can be in the same state ---------------- combing -> combining euqal -> equal ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:48 +// copy of its parent’s stack. They share a common prefix. +// - combing euqal stack suffices -- as there are a finite number of DFA's +// state the parser can be in. A set of heads can be in the same state ---------------- sammccall wrote: > combing -> combining > euqal -> equal the "-- as..." clause is a bit confusing and doesn't seem necessary. I think what's missing here is a high level explanation of what's going on in the parse to trigger this, and explicitly mentioning this is how we get a node with multiple parents. When alternative parses converge on the same interpretation of later tokens, their stack heads end up in the same state. These are merged, resulting in a single head with several parents. ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:53 +// +// E.g. we have two active stacks: +// 0 -> 1 -> 2 ---------------- this shows forking but not merging and in particular it suggests that #heads == #stacks, which is not the case ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:54 +// E.g. we have two active stacks: +// 0 -> 1 -> 2 +// | ^ head1, representing a stack [2, 1, 0] ---------------- arrows are pointing the wrong way (at least opposite to our pointers) ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:58 +// ^ head2, representing a stack [3, 1, 0] +struct Graph { + // Represents a node in the graph. ---------------- the name "Graph" is IMO too general, this isn't a reusable graph or even dag class. I think GSS is fine, it's a distinctive name from the literature, and the expansion of the abbreviation is nice and descriptive (but too verbose to be the actual name) ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:59 +struct Graph { + // Represents a node in the graph. + struct Node { ---------------- this comment doesn't say anything, remove? ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:60 + // Represents a node in the graph. + struct Node { + // The parsing state presented by the graph node. ---------------- alignas(Node*) (in practice this will match ForestNode* so is a no-op, but it documents the intent) ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:61 + struct Node { + // The parsing state presented by the graph node. + LRTable::StateID State : LRTable::StateBits; ---------------- again, drop comment unless there's something to say ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:64 + static constexpr unsigned PredecessorBits = 3; + // Number of the predecessors of the node. + // u is the predecessor of v, if u -> v. ---------------- This is not what predecessor usually means in graph theory: u is the predecessor of v if there is *some path* from u->v. I think "parent" is a common and well-understood term. ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:68 + // The forest node for a termina/nonterminal symbol. + // The symbol correponds to the label of edges which leads to current node + // from the predecessor nodes. ---------------- Also not sure what this comment says: - first line just repeats the type: type is a forest node, forest nodes are always for symbols, and terminal/nonterminal are all the possibilities - second line is either referring to or defining (not sure) edge labels, but edge labels aren't defined or referred to anywhere else, so why? Maybe: ``` The parse tree for the last symbol we parsed. This symbol appears on the left of the dot in the parse state. ``` ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:77 + + bool operator==(const Node &L) const { + return State == L.State && predecessors() == L.predecessors(); ---------------- why is Parsed not part of the identity? (maybe there's a good reason, but there should be a comment) ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:87 + llvm::ArrayRef<const Node *> Predecessors) { + assert(Predecessors.size() < (1 << Node::PredecessorBits) && + "Too many predecessors to fit in PredecessorBits!"); ---------------- 8 parents isn't that many and it seems like this might be a dynamic property of the input code rather than a static property of the grammar. But I don't think this bitpacking is buying anything, it looks like the layout is: - State : 13 - PredecessorCount : 3 - (padding) : 48 - Parsed : 64 So I think we might as well just use a uint16 PredecessorCount and still have room left over for a uint16 refcount later. Is there anything else we might usefully store in the extra space? ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:100 + size_t bytes() const { return Arena.getTotalMemory() + sizeof(*this); } + size_t nodeCount() const { return NodeCount; } + ---------------- name consistently with forest arena ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:112 + + const ForestNode *parse(const TokenStream &Code); + ---------------- as discussed offline, the signature here should involve a token range, or likely an ArrayRef<ForestNode> for the relevant terminals. Probably a FIXME to allow the start symbol to be specified. ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:114 + + const Graph &getGSS() const { return GSS; } + ---------------- sammccall wrote: > this function never exposes an interesting graph, as it's mostly empty before > + after parse(). > It's only really useful for working out how much memory is allocated. > > If we were to drop this, the public interface of GLRParser would be a single > function (so it need not be a public class) and Graph would disappear > entirely! > > With this in mind, it seems like we could replace this header file with > something like: > > ``` > struct ParseStats { > unsigned GSSBytes = 0; > }; > ForestNode *glrParse(const TokenStream &, const LRTable&, const Grammar&, > ForestArena&, ParseStats *Stats = nullptr); > ``` > > It feels like hiding the GSS might be an obstacle to debugging/testing. > However this really would need a story/API around how to interrupt the parser. > LLVM_DEBUG seems reasonable to me for now. After offline discussion I think we either want: - to hide GSS as mentioned above, and just test the forest output - expose GSS class, and have a two versions of the parse function: one that finalizes the GSS into a result ForestNode, and a "raw" one that just returns the GSS. Then we can get at the GSS state at a point by running the raw parser on some prefix of the code. (Or even *only* have the raw one and make it the caller's responsibility to extract the node from the GSS, but this is probably silly) ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:129 + // Frontier is to track all avaiable actions from all active stack heads. + struct Frontier { + // A corresponding stack head. ---------------- I love the name frontier! Unfortunately the word refers to the whole boundary/set, rather than individual elements of it. Maybe this? ``` /// The list of actions we're ready to perform. struct { std::vector<Pending> Shift; std::vector<Pending> Reduce; std::vector<Pending> Accept; bool empty(); // accept is an odd-one-out here, but I think it's going away? } Frontier; ``` Also maybe a FIXME that the frontier needs to order reductions so that local ambiguity packing is guaranteed to work as a single pass ================ Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:133 + // An action associated with the Head. + const LRTable::Action *PerformAction = nullptr; + }; ---------------- just Action? ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/Forest.cpp:30 + } + llvm_unreachable("unhandle node kind!"); +} ---------------- unhandled ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:33 + ParsedForest.init(Code); + const Token *Lookahead = &Code.tokens().front(); + addActions(GSS.addNode(/*StartState*/ 0, nullptr, {}), *Lookahead); ---------------- sammccall wrote: > It actually currently looks like it would be at least as natural to have the > parser operate on the sequence of terminal ForestNodes rather than on the > Tokens themselves... I'd suggest calling this NextTok instead of Lookahead: - in common english this is a verb rather than a noun - in parsers it more often refers to the *number* of tokens in the tables than the tokens themselves - referring to the token you're *currently shifting* as "ahead" is bizarre! ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:67 + +std::vector<const Graph::Node *> +GLRParser::performShift(Token::Index Lookahead) { ---------------- if this is a hot loop we shouldn't be creating std::vectors and throwing them away ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:80 + + // Merge the stack -- if multiple stack heads are going to shift a same + // state, we perform the shift only once by combining these heads. ---------------- we shift *tokens*, not states: the token conceptually moves from the input to the stack. (This seems to be other references not just me!) if multiple stack heads will reach the same state after shifting a token? ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:85 + // state 2 and state 3: + // 0 -> 1 -> 2 + // ` -> 3 ---------------- I think the arrows directions aren't particularly clear (our pointers go the opposite way) or necessary here, and backticks and up arrows are hard to read. A couple of (widely-supported) box-drawing characters would help a lot I think: ``` 0--1--2 └--3 0--1--2--4 └--3--┘ ``` (I would keep using `-` and `|` rather than making everything pretty boxes, it's readable enough and easier to edit. We may want to consider box-drawing characters for dump functions though...) ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:93 + // Shifts are partitioned by the shift state, so each partition (per loop + // iteration) corresponds to a "perform" shift. + llvm::sort(ShiftList, [](const Frontier &L, const Frontier &R) { ---------------- I can't parse `a "perform" shift`. Batch shifts by target state so we can merge matching groups? ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:97 + R.PerformAction->kind() == LRTable::Action::Shift); + return std::forward_as_tuple(L.PerformAction->getShiftState(), L.Head) < + std::forward_as_tuple(R.PerformAction->getShiftState(), R.Head); ---------------- I don't think tiebreaking by `Head` achieves anything. If we really need to ensure deterministic behavior, comparing pointers won't do that as allocation may vary. On the other hand, `stable_sort` should work: we'll tiebreak by order enqueued, so we're deterministic if our inputs are. ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:108 + // Predecessors of the new head in GSS. + std::vector<const Graph::Node *> Predecessors; + llvm::for_each(Batch, [&Predecessors](const Frontier &F) { ---------------- SmallVector ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:110 + llvm::for_each(Batch, [&Predecessors](const Frontier &F) { + assert(llvm::find(Predecessors, F.Head) == Predecessors.end() && + "Unexpected duplicated stack heads during shift!"); ---------------- llvm::is_contained ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:114 + }); + const auto *Head = GSS.addNode(NextState, Leaf, Predecessors); + LLVM_DEBUG(llvm::dbgs() ---------------- if this turns out to be hot, you could avoid the temporary array of predecessors: return a mutable node from addNode ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:128 +static std::vector<std::string> +getStateString(llvm::ArrayRef<const Graph::Node *> A) { + std::vector<std::string> States; ---------------- name doesn't match return type ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:131 + for (const auto &N : A) + States.push_back(llvm::formatv("state {0}", N->State)); + return States; ---------------- I suspect `S{0}` will be verbose enough, it's not likely spelling out `state` will make this much less cryptic, but it may be harder to scan ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:141 + assert(PathStorage.empty() && "PathStorage must be empty!"); + std::function<void(const Graph::Node *, unsigned)> enumPath = + [&CB, &PathStorage, &enumPath](const Graph::Node *Current, ---------------- We're leaning on std::function a lot for code that's supposedly in the hot path. It's probably fine, but I'd like to see this written more directly. As far as I can tell, this could easily be a class, with enumerateReducePath a (recursive) method that concretely calls into handleReducePath() or something at the bottom. Also if it's a class we can easily reset() instead of destroying it if we want to reuse its vectors across separate reductions. ================ Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:158 +// heads. +void GLRParser::performReduction(const Token &Lookahead) { + if (!ReduceList.empty()) ---------------- (just a note, I need to dig into this part tomorrow!) ================ Comment at: clang/tools/clang-pseudo/ClangPseudo.cpp:33 desc("Print the LR table for the grammar")); +static opt<std::string> ParseFile("parse", desc("Parse a C++ source file"), + init("")); ---------------- nit: a C++ source file => a source file ================ Comment at: clang/tools/clang-pseudo/ClangPseudo.cpp:33 desc("Print the LR table for the grammar")); +static opt<std::string> ParseFile("parse", desc("Parse a C++ source file"), + init("")); ---------------- sammccall wrote: > nit: a C++ source file => a source file Why is this a new flag independent of `Source`? It also seems inconsistent with other flags, which describe what output is requested rather than what computation should be done. Maybe "-print-forest" and "-print-stats"? (which could also potentially control stats of other stages) Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D121150/new/ https://reviews.llvm.org/D121150 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits