Hi Erik,

At 2025-12-08T11:03:42+0000, dvalin--- via GNU roff typesetting system 
discussion wrote:
> With the obscuring side-effect bug cleared, you're down to the grammar
> issue now.  In the old days I was used to seeing "shift-shift" or
> "shift-reduce" as conflict description, helping to figure out how to
> tweak either the input or the grammar.

I believe you're thinking of shift/reduce versus reduce/reduce
conflicts.  I've never heard of a shift/shift conflict.

If we think of a parser as using a stack, then to _shift_ means to push
an item onto that stack because current token being read (when
considering the contents of the stack) does not suffice to _reduce_ it,
or determine a "node" in the abstract syntax tree.

An example of a reduce/reduce conflict would be a token '*' having
valid meaning both as an integer variable name and as a multiplication
operator, such that one might intend the expression "* * *" to "reduce"
to two variable references and a multiplication operator in between, and
for them to be added to the abstract syntax accordingly.  When the
expression is evaluated, the result would be '*' (the variable) squared,
some integral value.

My training is that reduce/reduce conflicts indicate a "broken" grammar,
one where the parser thinks it knows enough to reduce the symbol, but
has multiple rules for doing so, and can't decide between them.  For
ease of both human and machine parsing, language designers have a bias
toward context-free grammars.

(As I understand it, C famously has a deeply inherent reduce/reduce
conflict; most identifiers that aren't reserved words can reduce as type
names _or_ as symbol names.  How did C survive this?  Well, this was
Bell Labs, where people thought nothing of entangling their lexical
analysis with their parsing, as nroff did and does.  The solution is
known as "the lexer hack".[1])

> Yacc has only single-token lookahead, my 71 yo brain reminds from my
> last venture a decade ago,

The famous LALR(1) "lookahead left/right (1 token)" flavor of parser,
yes.  I believe GNU Bison can produce a GLR (generalized left-right)
parser--I guess that's one that can scan arbitrarily far both left and
right--but I've never needed to exercise that feature in anger.

With limited lookahead, including a limit of zero, a different form of
conflict arises: the shift/reduce conflict.  Unlike reduce/reduce
conflicts, so the story goes, a shift/reduce conflict is not necessarily
fatal to your grammar in the marble halls where Turing and Chomsky sit.

The more lookahead your parser has, the more likely it can eliminate the
conflict through consideration of context--even looking only one token
ahead.

Here is a notional example for contemplation purposes, not necessarily
resembling any real parser.  Without context (or "lookahead"), the C
language would have shift/reduce conflicts with the "*" symbol I chose
above in the following contexts.

  const int b = 4 * 2;
  int d = *c;
  int f = e /* my comment */ + 2;

...but it doesn't, because any C parser has context.  In the first
example an identifier token is a "terminal" symbol and is reduced, or
put onto the syntax tree.[2]

In the second example, the parser encounters a star but the most recent
terminal symbol is not an identifier or literal, so, contextually,
multiplication is foreclosed, and the '*' reduces to a dereferencing
operator, which is unary and right-associative.

In the third example, we're being really tricky.  We've just added the
terminal symbol for the identifier "e" to the syntax tree and we're
ready for an operator.  And we get one!  But we can't be sure the '/' is
really a division operator.  We must look ahead in the input stream,
taking a peek before we decide.  Aha!  A '*' follows the '/', so we know
we don't have an arithmetic operator at all.  In fact, we have the start
of a comment, which means we neither shift nor reduce, but discard.  (I
suppose we could also think of this process as "reducing" the input to
nullity--until the comment terminator is encountered.)

_Without_ lookahead, we'd have a shift/reduce conflict thanks to the
multiple parsing rules involving '/' and '*'.  Notice, though, that if
we imagine a world where nobody ever comments their code--call it
Brogrammer Utopia--this shift/reduce conflict is only notional, and will
not pose a problem for the Bro C compiler.

I have read that a parser for RFC 822 dates, the classic Unix date
format(s), has at least one shift/reduce conflict, but since no human or
program constructs dates in an ambiguous form, we simply live with any
shift/reduce conflicts our parser generators dutifully warn us about.

That's some parser theory as told by a ham-fisted practitioner.
Compiler experts and CS professors might, as we speak, be drawing hot
baths with Calgon to take them away.

> so if the rules for ORDINAL and LAST are ambiguous, then you'll have
> such a conflict. Sight unseen, my WAG would be that the rules for
> ORDINAL and LAST might share tokens or subordinate rules with
> insufficient distinction?

That sounds likely.  But I haven't really learned pic(1) yet, so I don't
understand its grammar, and consequently I'm not prepared to tidy it up.

> The wording of the warning intimates not only such an ambiguity, but
> that it is so fundamental that the yacc maintainer is unimpressed.
> That might indicate it's not too hard to spot? (If we can remember
> this stuff.) Anyone like to email pic.ypp, if convenient? (It'd be
> quite a while before I could find round tuits to figure out git to
> grab the source and do any serious digging, but a look can't hurt, if
> someone hasn't fixed it already.¹)

Here's a hyperlink to "pic.ypp" as of the most recent groff release.
(Selecting that keeps the link stable in the event we ever rearrange the
tree.  Also, this file has seen few non-cosmetic changes since then.)

https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/preproc/pic/pic.ypp?h=1.23.0

> ¹ A 1.2T mini electric excavator is arriving this week, for clearing
> 1.7 km of fenceline scrub, then I run a new 5 km fence with
> post-insulated wires for a solar powered 6-sector alarm. The design
> and PCB are done, but I now need to program the ATmega328P (same is in
> Arduino) to report via 433 MHz radio if the firewood pirates make
> another convoy raid into the 200 Ha of native forest. (Police not
> interested after the event.)

I'll bet you a bottle of back country moonshine that some of the
police/sheriff's deputies are among the tree pirates.

> And the tractor shed is only 1/8 built, though I've cut half the poles
> from the forest and dragged them home.  It all takes longer now.

Sounds like you're doing a more honest day's work than I am.  :)

Anyway, I promise myself that some day I'll truly test my mettle against
the Dragon Book cover to cover.  And maybe I will.  Occasionally, I
accomplish a task that I set for myself 20 or 30 years in the past.

But, I admit, not often enough.

Regards,
Branden

[1] https://en.wikipedia.org/wiki/Lexer_hack

    This could have been avoided via the simple expedient of introducing
    a prefix sigil to mark either symbol or type identifiers (the other
    could be left unadorned).

    If I ever get my 1.2T mini electric time machine working, I'm going
    to drive that thing across the country like Alvin F. Straight to
    Murray Hill in 1973 and give Thompson and Ritchie a piece of my
    mind.  Here's how I imagine it going.

    GBR: It's gross to have `cc` check a list the parser has built of
    already-seen type names to disambiguate identifiers.  Just stick a
    dollar sign in front of either type or variable names.  It's easy.

    K&T: That sounds like more typing.  More typing is bad.  Haven't you
    noticed that all the best Unix commands have only two-letter names?
    And that one of us avoids pressing the shift key wherever possible?

    GBR: Fine, use the sigil only for type names.  Types are named less
    often than symbols in most code.

    K&T: No.  It's still too much.

    GBR: Even though you guys don't use anything but a machine integer
    for anything that isn't floating point or a struct?  No Booleans, no
    bounds on enums--

    K&T: That's right.  The very notion of sticking a dollar sign in
    front of a type name is laughable.  There's no money in strong
    typing.

    GBR: ...uh, I'll just be getting back in my time machine now.

    K&T: That's right, kid--nobody should have to live through the
    16-bit address space era more than once.

    <vworp vworp vworp>

[2] A modern compiler infers the existent of `constexpr`s--constant
    expressions, and performs the evaluation itself...in a way that had
    better be 100% compatible with language and target machine
    arithmetic.

Attachment: signature.asc
Description: PGP signature

Reply via email to