Re: [idea] troff -Troff

G. Branden Robinson Fri, 16 Feb 2024 08:50:32 -0800

Hi Alex,

At 2024-02-16T16:21:45+0100, Alejandro Colomar wrote:
> I've been thinking about a suggestion I've done in the past.  I wanted
> a program that reads man(7) source and produces roff(7) source, so
> that it can later be passed to troff(1), thus splitting the groff(1)
> pipeline a bit more.  The idea is similar to how eqn(1) and other
> pre-troff filters do their job.


Right.  I had a similar desire when I first came to groff development.

> The purposes of this are debugging, and also learning roff(7) starting
> from a decent understanding of a given macro package.
> 
> I'll reword my suggestion slightly differently, which maybe makes more
> sense to you:
> 
> Would you add a -Troff output device that prints roff(7) source?  Or
> maybe a --roff-out flag that modifies the behavior of troff(1) to print
> roff(7) out instead of its usual job.

I am unlikely to undertake this.

It's not that it's a bad idea, it's that it's hard.

Please excuse an excursion into parser theory, presented by a stumbling
amateur in the field.

The way a *roff parser works--_any_ *roff parser I know of except
possibly mandoc(1)'s--makes this nearly intractable.

And that in turn is because the roff language does not have a formal
grammar, and if someone did come up with one it would, I predict, be
fearsomely complex.

Recall that Ossanna's nroff, which pretty much established the grammar
of the roff language as we know it today, dates back to 1972 or 1973.

That's before lex.  That's even before yacc.  That's before the first
editions of major texts in parser theory now taught to computer science
undergraduates were even written.

I don't know that roff can be categorized in one of the traditional
classes like LALR(1) or LL(1).  If "GLR" means what it says (likely),
"generalized left-right grammar", and I fully understand what that means
(less likely) I suppose it's one of those.

Barring my acquisition of greater facility with parser theory, I would
describe the roff language's grammar as "ad hoc".  That, and not
performance considerations, are, I think, why everybody who's written a
roff parser has used recursive descent; that is an approach amenable to
handling special cases anywhere you want them, at the significant cost
of making a succinct, abstract description of the grammar progressively
less feasible with every exception to the general rules.

A bit more theory before I get to the concrete.

A fundamental paradigm for language parsing is, upon consumption of the
next token in a sequence, deciding whether to "shift" it, or "reduce"
it.  To my understanding, "shifting" means, roughly, that we can't
decide yet how to represent the pending expressing in our abstract
syntax tree (AST).  A common approach is to push the token onto a stack,
whence we will pop it later after seeing more tokens that enable us to
"reduce" it.

To generalize and analogize, let's consider an arithmetic expression in
a roff-like language.  (I won't promise that this will precisely match
what GNU troff does.)  Remember that roff arithmetic is evaluated
strictly from left to right, but with parentheses recognized to impose a
precedential ordering.

20/(3 + 1)

The first thing we see is a numeric literal.  Great.  That has an easy
representation in underlying languages all the way to the machine level.
That "reduces", so we put it in our abstract syntax tree.[1]  The next
thing we see is a division operator.  That, too is a meaningful token by
itself, so it "reduces".

The next thing is an open parenthesis.  Wuh-oh.  There are no
parentheses in machine instructions.  We can't turn this into a node of
our abstract syntax tree.  So we "shift" it (onto a stack, say), in hope
that we'll later be able to make sense of it.

The next few tokens?

3 REDUCE
+ REDUCE
1 REDUCE

Now the closing parenthesis.  This is the mate of our opening
parenthesis, which we had hoped for.  Since, for the purposes of this
illustration, we're not doing any arithmetic evaluation ourselves, what
we _don't_ do is further reduce "3", "+", and "1".  Instead we populate
our AST, arranged such that whatever interprets it (a code generator)
will sequence the instructions appropriately.  For a simple stack-based
machine that would look something like this.

PUSH 20
PUSH 3
PUSH 1
ADD     (stack now looks like 20, 4)
DIV     (stack now looks like 5)
POP
STORE $8000     (i.e., write 5 to memory at 0x8000)

So, why the big-ass excursion into 2nd/3rd year CS student material?

With any luck you'll have seen a hint.  Because interpolations are
performed recursively at the point they are encountered, the information
you want no longer exists by the time the parser is done evaluating it.

Reading the stack machine output above, tell me where the parentheses
were in the source language without having seen it.  Can you do it?

Hmm, actually you probably can.  Maybe this is why they teach parsers
for simple arithmetic first.

Okay.  Let's posit a slightly more sophisticated language, like
Kernighan's hoc[2].  Let's have variables.

a := 3
b := 1
c := 20/(a + b)

Let's imagine that this parses into AST nodes like this.  This is not
the same language as my pretend machine language above, but I've
supplied some parenthetical hints.

ref a 3
ref b 1
20 (reduce -> PUSH)
deref a
3 (reduce -> PUSH)
deref b
1 (reduce -> PUSH)
+ (reduce -> PUSH)
/ (reduce -> PUSH)
pop
pop
ref c %accum (notation for a memory location like 0x8000 where our
              language runtime keeps a copy of the top of the stack for
              values of interest)

This will produce much the same machine language.

PUSH 20
PUSH 3
PUSH 1
ADD     (stack now looks like 20, 4)
DIV     (stack now looks like 5)
POP
STORE $8000     (i.e., write 5 to memory at 0x8000)

Okay.  _Now_ can you reconstruct the original source from these machine
instructions?  No.  The evidence of the use of variables has been
effaced; we don't even know what their names are.

And so it is with the roff languages.  An arbitrary number of macro
expansions may have taken place.

(Aside: one may wonder why I'm presenting three whole ad hoc languages
to explain the background here.  Good question.  My answer is a somewhat
weak "that's how everybody else presents it".  My understanding is that,
as a rule, this is result of demand on the part of programmers.
Generally, they seem to want to maintain programs in languages that are
more "expressive" than an abstract syntax tree.  AIUI, Forth and Lisp
_much_ more closely resemble the AST; in fact this is, if I'm not
mistaken, what's behind one of Alan Kay's famous insights about Lisp,
something about its S-expressions and ASTs being interconvertible, or an
obvious one-to-one mapping between a program's representation in memory
and the visible expression of it on the terminal.  Something like that.
I confess to embarrassment at how much of a Philistine I come across as
in this paragraph.  No great sophisticate am I.)

I'm not saying it is impossible to do what you want (or what I wanted,
years ago).  By going carefully through the parser and adding "hooks" to
preserve input tokens at key locations, and/or building a list of them
as they are encountered, and disposing of them only once output is
flushed, I imagine it could be achieved.  But it would be a significant
effort and demand deep familiarity with the entire GNU troff parser, not
just the bulk of it in input.cpp.

My own knowledge of it is far advanced over what it was 5 years ago, but
I do not feel equal to the task of scaling that mountain yet.

So I think this is unlikely to happen soon.

In the meantime, perhaps a passing CS professor will, with exasperated
patience, correct the numerous howlers I managed to utter above.

Regards,
Branden

[1] *roff parsers don't delegate arithmetic evaluation all the way to
    the machine.  They translate them into C expressions instead, and, I
    think, I use a separate, temporary stack for evaluation.  But the
    principle holds, in my opinion.

[2] ...the presentation of which in _The Unix Programming Environment_,
    my labored explanation is but a pale shadow.  I wish someone would
    update that book for modern times; the currents of history have been
    particularly cruel to its old-school lex and yacc usage, which is
    some of the most valuable material in it.

signature.asc
Description: PGP signature

Re: [idea] troff -Troff

Reply via email to