Author: Sam McCall Date: 2022-05-18T00:08:47+02:00 New Revision: 79ca4ed3e782e505e8964a3c968de5fd4f09ca7a
URL: https://github.com/llvm/llvm-project/commit/79ca4ed3e782e505e8964a3c968de5fd4f09ca7a DIFF: https://github.com/llvm/llvm-project/commit/79ca4ed3e782e505e8964a3c968de5fd4f09ca7a.diff LOG: [pseudo] Design notes from discussion today. NFC Added: clang-tools-extra/pseudo/DesignNotes.md Modified: Removed: ################################################################################ diff --git a/clang-tools-extra/pseudo/DesignNotes.md b/clang-tools-extra/pseudo/DesignNotes.md new file mode 100644 index 000000000000..421cc02aef75 --- /dev/null +++ b/clang-tools-extra/pseudo/DesignNotes.md @@ -0,0 +1,123 @@ +# Error recovery (2022-05-07) + +Problem: we have two fairly generic cases of recovery bounded within a range: + - sequences: `int x; this is absolute garbage; x++;` + - brackets: `void foo() { this is garbage too }` + +So far, we've thought of these as diff erent, and had precise ideas about +brackets ("lazy parsing") and vague ones about sequences. +Con we unify them instead? + +In both cases we want to recognize the bounds of the garbage item based on +basic token-level features of the surrounding code, and avoid any interference +with the surrounding code. + +## Brackets + +Consider a rule like `compound-stmt := { stmt-seq }`. + +The desired recovery is: +- if we fail at `{ . stmt-seq }` +- ... and we can find for the matching `}` +- then consume up to that token as an opaque broken `stmt-seq` +- ... and advance state to `{ stmt-seq . }` + +We can annotate the rule to describe this: `{ stmt-seq [recovery] }`. +We can generalize as `{ stmt-seq [recovery=rbrace] }`, allowing for diff erent +**strategies** to find the resume point. + +(It's OK to assume we're skipping over one RHS nonterminal, we can always +introduce a nonterminal for the bracket contents if necessary). + +## Sequences + +Can we apply the same technique to sequences? +Simplest case: delimited right-recursive sequence. + +``` +param-list := param +param-list := param , param-list +``` + +We need recovery in **both** rules. +`param` in the first rule matches the *last* param in a list, +in the second rule it matches all others. We want to recover in any position. + +If we want to be able to recovery `int x int y` as two parameters, then we +should extract a `param-and-comma` rule that recovery could operate on. + +### Last vs not-last elements + +Sequences will usually have two rules with recovery, we discussed: + - how to pick the correct one to recover with + - in a left-recursive rule they correspond to last & not-last elements + - the recovery strategy knows whether we're recoverying last or not-last + - we could have the strategy return (pos, symbol parsed), and artificially + require distinct symbols (e.g. `stmt` vs `last_stmt`). + - we can rewrite left-recursion in the grammar as right-recursion + +However, on reflection I think we can simply allow recovery according to both +rules. The "wrong" recovery will produce a parse head that dies. + +## How recovery fits into GLR + +Recovery should kick in at the point where we would otherwise abandon all +variants of an high-level parse. + +e.g. Suppose we're parsing `static_cast<foo bar baz>(1)` and are up to `bar`. +Our GSS looks something like: + +``` + "the static cast's type starts at foo" +---> {expr := static_cast < . type > ( expr )} + | "foo... is a class name" + +---- {type := identifier .} + | "foo... is a template ID" + +---- {type := identifier . < template-arg-list >} +``` + +Since `foo bar baz` isn't a valid class name or template ID, both active heads +will soon die, as will the parent GSS Node - the latter should trigger recovery. + +- we need a refcount in GSS nodes so we can recognize never-reduced node death +- when a node dies, we look up its recovery actions (symbol, strategy). + These are the union of the recovery actions for each item. + They can be stored in the action table. + Here: `actions[State, death] = Recovery(type, matching-angle-bracket)` +- we try each strategy: feeding in the start position = token of the dying node + (`foo`) getting out the end position (`>`). +- We form an opaque forest node with the correct symbol (`type`) spanning + [start, end) +- We create a GSS node to represent the state after recovery. + The new state is found in the Goto table in the usual way. + +``` + "the static cast's type starts at foo" +---> {expr := static_cast < . type > ( expr )} + | "`foo bar baz` is an unparseable type" + +---- {expr := static_cast < type . > (expr)} +``` + +## Which recovery heads to activate + +We probably shouldn't *always* create active recovery heads when a recoverable +node dies (and thus keep it alive). +By design GLR creates multiple speculative parse heads and lets incorrect heads +disappear. + +Concretely, the expression `(int *)(x)` is a valid cast, we probably shouldn't +also parse it as a call whose callee is a broken expr. + +The simplest solution is to only create recovery heads if there are no normal +heads remaining, i.e. if parsing is completely stuck. This is vulnerable if the +"wrong" parse makes slightly more progress than the "right" parse which has +better error recovery. + +A sophisticated variant might record recovery opportunities and pick the one +with the rightmost *endpoint* when the last parse head dies. + +We should consider whether including every recovery in the parse forest might +work after all - this would let disambiguation choose "broken" but likely parses +over "valid" but unlikely ones. + + _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits