Re: easing access to bit vectors in the *roff language (was: Need background on design of groff character classes)

G. Branden Robinson Wed, 03 Dec 2025 18:03:59 -0800

Hi Erik,

Some lengthy musings on GNU troff and programming languages follow.

At 2025-12-02T11:08:49+0000, dvalin--- via GNU roff typesetting system 
discussion wrote:
> On 02.12.25 03:42, G. Branden Robinson wrote:
> > Off-topic, but I'm curious...where did this practice of suffixing
> > user names with multiple hyphen-minuses come from?
> 
> Nary the foggiest, Branden - here in your post's quote is the first
> time I've seen it. It does not appear in the Bcc back here, of that
> post. Mystery == Total.

Huh.  I guess it's something Mailman (the mailing list management
software GNU uses) does for some reason.

> > .warn +delim
> > .warn -char
> > .warn =all
> > .warn =w-delim
> > .warn -w+syntax+reg+mac
> >
> > All of these are currently invalid syntax, so this should be a
> > compatible extension.
> 
> That syntax hides the OR/AND in a most user-friendly fashion, I think.
> (And I'm inordinately fond of explicit human-readable syntax.)

I would say that most humans are, but the legions of Obfuscated C Coding
Contest fans, and old-school Perl hackers, would give the lie to my
claim.

> If there's less wrangling of a hand-knitted parsing engine, then
> that's so much more viable, I figure.

Well, it'd take non-trivial initial wrangling to create a new kind of
expression, which we could call a "flag manipulation expression" or
similar.  But if the function interpreting that expression type were
designed well, it would be general-purpose (perhaps taking an array of
structs pairing the configurable flag names with their values, and
returning an integer), and could later be applied to `cflags` and the
notional "next-generation hyphenation control" request I've been
ruminating about for years.

https://savannah.gnu.org/bugs/?55070

> You've won my vote, though it's only that of a groff beginner
> trampling the edges of the maze, when time permits.

Everybody starts somewhere!

> Put it down to C exposure, but "==" for equality test, and "=" seem
> nearly good enough. However, ":=" and "==" is a more secure
> combination, as omission of one "=" doesn't silently convert a test to
> an assignment. (Who has never had to debug that one?)

I have a habit of writing any equality comparison between an lvalue and
non-lvalue with the non-lvalue first to help (the compiler help) me
catch this problem.

So while people (including myself) feel instinctively that:

        if (ch == 'c')

is idiomatic, I prefer to write:

        if ('c' == ch)

instead.

> Haven't done that, but I did write a Z80 two-pass assembler,
> pre-millennium. (Don't tell anyone, but that was in BASIC.)

I wonder if anyone ever claimed to you that that was impossible--one of
those "tell me you don't understand Turing-completeness without saying
you don't understand Turing-completeness" moments...

> I've done a bit of lex & yacc - enough to get the hang of its limited
> debugging. A formal grammar is a stricture, but sanity-preserving in
> the long run.

There was an interesting discussion on the TUHS list a while back about
how yacc delivered some drawbacks with its advantages; that its power
and declarative expressiveness led to many good and useful new "little
languages" (and some not so little), but also to some that probably
should never have been made.

As I understand it, since the 1990s or so the trend has been away from
yacc grammars for "production-level" language compilers back toward
hand-written recursive descent parsers (not necessarily including
lexical analysis, which is still advantageous to keep separate, cf.
*roff's ad hoc mixture of the two processes).  My understanding is that
this movement was largely spurred by the difficulty of producing
intelligible diagnostics about semantic errors when using yacc, because
again AIUI, a yacc-generated parser is (typically) so context-free that
it can't tell you much about its input except that it's non-conforming.
I suspect, but do not know, that the addition of templating systems to
C++ compilers in the 1990s, a feature even more powerful than planned
(Turing-completeness rears its head again)[4] made for a burnt-fingers
experience among compiler vendors.

None of that is to discount the value of a formal grammar for one's
language, which is as important as ever.  No non-toy language should
be designed without a formal grammar and validator thereof ever again.
I take a dim view of Brendan Eich and Rasmus Lerdorf because they were
young enough to have learned from the mistakes of their elders--who
sometimes made their mistakes because they lacked the proper equipment
to avoid them--but didn't.

Anyway, it may be that yacc/Bison are no longer fashionable choices for
expressing a formal grammar, maybe for good reasons.

So, how to keep a formal grammar and the (potential) massive benefit of
a _decidable_ grammar[1] without dropping information about context that
is not necessary to resolve the grammar but which is greatly helpful to
humans when writing in the language?  I don't know.  I think people have
solved this; I just haven't read anything on point or worked on a
language implementation that did.

I'm confident Haskell has the tools for the job, though.

Maybe someone on this list can tell me something more concrete.  :)

> Your courage in facing an inherited hand-knitted if-but-maybe machine
> is awe-inspiring. I'd scream and run.

Thanks!  Though I don't know if "courage" is the right word.
"Cussedness" might be more accurate.  _Structurally_, GNU troff's code
has seldom if ever mortified me.  There's a noticeable aversion to
introducing new types, which led to some IMO hacks, like "character
classes" as a kind of "charinfo" with a flag on it,[5] and
"font-specific fallback characters" as "plain" special characters with
otherwise illegal names.  However, I can understand the choices in light
of the code's pre-STL legacy.  It's just immensely painful to manage
collections of objects in pre-standard C++.

The recurring complaint I have about GNU troff's code when grappling
with--or even just reading--it is (again, IMO) evidence of an
insufficient commitment to giving variables, functions, and sometimes
even classes expressive names.  But that's not a big deal to fix.  I've
renamed tons and tons of things in the formatter over the past 5 years
to make the code better say what (I think) it means.  There are still
many things I haven't addressed, or addressed only partially.  As an
example of the latter, I want _every_ request handling function's name
to end with `_request()` to make it obvious to the reader that (1) it's
going to be manipulating the input stream, (2) it will need to validate
its input with maximum paranoia because that comes from "outside", and
(3) it should be audited for any "inner functionality" or manipulation
of language objects that should be factored out.

Observe that if property (3) were rigorously pursued as a goal, we could
decouple the formatting engine from the grammar of its input language.
That in turn would open a few interesting doors, like an ultra-rigorous
AT&T troff simulator[3], the possibility of an Heirloom Doctools
troff-compatible front end to cope with the divergence of its grammar
from GNU troff's (which will likely worsen over time), and the potential
for a reformed GNU troff grammar for the notional groff 2.0.

The jury's out on whether those changes constitute improvements for
anyone but me.  Nobody's told me they find the code easier or harder to
read since I got involved with developing it.

> All power to your arm, and thanks for tolerating notions from those in
> the back row.

I appreciate the feedback.

Regards,
Branden

[1] Not too many years ago I learned that most Unix shells don't have a
    decidable grammar, thanks to at least one feature: "alias".

    Bourne-style shells are so unclean that I keep threatening myself
    to really give rc a serious try "this time".  But I _really_ need
    readline(3) and history(3) and most rc shell mavens seem to suffer
    from a brand of puritanism that eschews anything GNU ever did.
    Since rc shell mavens are a breed both rare and fractious (one of
    the major implementors is determined to strike out his own,
    rendering some of Duff's original paper strongly non-descriptive of
    his own "rc" shell[2]), I struggle to find a community to work with.
    On the other hand, I seem to get along fine with Chet Ramey.

[2] https://news.ycombinator.com/item?id=34723523

    By Duff's own account, the rc shell's weirdest feature is "if not".
    I was amused when I realized its similarity to *roff's `el` request.
    But in fact *roff's `el` is even weirder, because you can invoke it
    in the absence of an `ie` request before it (its branch will not
    be taken).  Weirder still (but relatedly) *roff's `el` can occur
    with any number of non-conditional requests preceding it.

    Compare:

    $ nroff <<EOF
    ##> .el .tm branch A taken
    ##> .ie 0 .tm branch B taken
    ##> .tm Hi Erik
    ##> .el .tm branch C taken
    ##> EOF
    Hi Erik
    branch C taken

    $ rc
    % if not echo branch A taken
    rc: /dev/stdin: `if not' does not follow `if(...)'
    % if (false) echo branch B taken
    % echo Hi Erik
    Hi Erik
    % if not echo branch C taken
    rc: /dev/stdin: `if not' does not follow `if(...)'

    "The one bit of large-scale syntax that Bourne unquestionably does
    better than _rc_ is the `if` statement with `else` clause.  _Rc_'s
    `if` has no terminating `fi`-like bracket.  As a result, the parser
    cannot tell whether or not to expect an `else` clause without
    looking ahead in its input.  The problem is that after reading, for
    example

    ```
      if(test -f junk) echo junk found
    ```

    in interactive mode, _rc_ cannot decide whether to execute it
    immediately and print `$prompt(1)`, or to print `$prompt(2)` and
    wait for the `else` to be typed.  In the Bourne shell, this is not a
    problem, because the `if` command must end with a `fi`, regardless
    of whether it contains an `else` or not.

    _Rc_'s admittedly feeble solution is to declare that the `else`
    clause is a separate statement, with the semantic proviso that it
    must immediate follow an `if`, and to call it `if not` rather than
    `else`, as a reminder that something odd is going on. ..."

    https://pdos.csail.mit.edu/6.828/2007/readings/rc-shell.pdf

[3] One might be tempted to conclude that we could then rip
    "compatibility mode" out of the "engine" (or "model") logic of GNU
    troff; AT&T troff mavens could instead run an "atroff" front end, or
    run groff with a hypothetical "--grammar=att" option or similar.
    But no.  Apart from some differences in the set of obscure control
    characters that constitute valid input, GNU troff is, in
    "compatibility mode", already really good at emulating AT&T troff
    _grammar_.  There exist deeper differences, at the "engine" level if
    you will, and for faithful emulation we'd still need to track
    "compatibility mode" state.

[4] http://www.rubinsteyn.com/text/template_insanity.html
    https://rtraba.com/wp-content/uploads/2015/05/cppturing.pdf

[5] This design choice led to the argument between Dave and me that
    spawned the parent thread.  While I now agree with him that
    "character class" objects were intended to have a "character flags"
    property that is separate from those of the individual character
    ("`charinfo`") objects that the class ultimately contains (via
    ranges and nested classes), he hasn't argued, and I don't think
    anyone would, that character classes meaningfully possess _other_
    properties of `charinfo` objects, like a (potential) macro
    definition or an "asciification code".  Character classes can't be
    interpolated or "asciified", so all of the `charinfo` properties
    they have but don't use are dead weight.  This fact indicates that
    they should have been made a separate type.

signature.asc
Description: PGP signature

Re: easing access to bit vectors in the *roff language (was: Need background on design of groff character classes)

Reply via email to