Hi Alex, At 2024-02-16T16:21:45+0100, Alejandro Colomar wrote: > I've been thinking about a suggestion I've done in the past. I wanted > a program that reads man(7) source and produces roff(7) source, so > that it can later be passed to troff(1), thus splitting the groff(1) > pipeline a bit more. The idea is similar to how eqn(1) and other > pre-troff filters do their job.
Right. I had a similar desire when I first came to groff development. > The purposes of this are debugging, and also learning roff(7) starting > from a decent understanding of a given macro package. > > I'll reword my suggestion slightly differently, which maybe makes more > sense to you: > > Would you add a -Troff output device that prints roff(7) source? Or > maybe a --roff-out flag that modifies the behavior of troff(1) to print > roff(7) out instead of its usual job. I am unlikely to undertake this. It's not that it's a bad idea, it's that it's hard. Please excuse an excursion into parser theory, presented by a stumbling amateur in the field. The way a *roff parser works--_any_ *roff parser I know of except possibly mandoc(1)'s--makes this nearly intractable. And that in turn is because the roff language does not have a formal grammar, and if someone did come up with one it would, I predict, be fearsomely complex. Recall that Ossanna's nroff, which pretty much established the grammar of the roff language as we know it today, dates back to 1972 or 1973. That's before lex. That's even before yacc. That's before the first editions of major texts in parser theory now taught to computer science undergraduates were even written. I don't know that roff can be categorized in one of the traditional classes like LALR(1) or LL(1). If "GLR" means what it says (likely), "generalized left-right grammar", and I fully understand what that means (less likely) I suppose it's one of those. Barring my acquisition of greater facility with parser theory, I would describe the roff language's grammar as "ad hoc". That, and not performance considerations, are, I think, why everybody who's written a roff parser has used recursive descent; that is an approach amenable to handling special cases anywhere you want them, at the significant cost of making a succinct, abstract description of the grammar progressively less feasible with every exception to the general rules. A bit more theory before I get to the concrete. A fundamental paradigm for language parsing is, upon consumption of the next token in a sequence, deciding whether to "shift" it, or "reduce" it. To my understanding, "shifting" means, roughly, that we can't decide yet how to represent the pending expressing in our abstract syntax tree (AST). A common approach is to push the token onto a stack, whence we will pop it later after seeing more tokens that enable us to "reduce" it. To generalize and analogize, let's consider an arithmetic expression in a roff-like language. (I won't promise that this will precisely match what GNU troff does.) Remember that roff arithmetic is evaluated strictly from left to right, but with parentheses recognized to impose a precedential ordering. 20/(3 + 1) The first thing we see is a numeric literal. Great. That has an easy representation in underlying languages all the way to the machine level. That "reduces", so we put it in our abstract syntax tree.[1] The next thing we see is a division operator. That, too is a meaningful token by itself, so it "reduces". The next thing is an open parenthesis. Wuh-oh. There are no parentheses in machine instructions. We can't turn this into a node of our abstract syntax tree. So we "shift" it (onto a stack, say), in hope that we'll later be able to make sense of it. The next few tokens? 3 REDUCE + REDUCE 1 REDUCE Now the closing parenthesis. This is the mate of our opening parenthesis, which we had hoped for. Since, for the purposes of this illustration, we're not doing any arithmetic evaluation ourselves, what we _don't_ do is further reduce "3", "+", and "1". Instead we populate our AST, arranged such that whatever interprets it (a code generator) will sequence the instructions appropriately. For a simple stack-based machine that would look something like this. PUSH 20 PUSH 3 PUSH 1 ADD (stack now looks like 20, 4) DIV (stack now looks like 5) POP STORE $8000 (i.e., write 5 to memory at 0x8000) So, why the big-ass excursion into 2nd/3rd year CS student material? With any luck you'll have seen a hint. Because interpolations are performed recursively at the point they are encountered, the information you want no longer exists by the time the parser is done evaluating it. Reading the stack machine output above, tell me where the parentheses were in the source language without having seen it. Can you do it? Hmm, actually you probably can. Maybe this is why they teach parsers for simple arithmetic first. Okay. Let's posit a slightly more sophisticated language, like Kernighan's hoc[2]. Let's have variables. a := 3 b := 1 c := 20/(a + b) Let's imagine that this parses into AST nodes like this. This is not the same language as my pretend machine language above, but I've supplied some parenthetical hints. ref a 3 ref b 1 20 (reduce -> PUSH) deref a 3 (reduce -> PUSH) deref b 1 (reduce -> PUSH) + (reduce -> PUSH) / (reduce -> PUSH) pop pop ref c %accum (notation for a memory location like 0x8000 where our language runtime keeps a copy of the top of the stack for values of interest) This will produce much the same machine language. PUSH 20 PUSH 3 PUSH 1 ADD (stack now looks like 20, 4) DIV (stack now looks like 5) POP STORE $8000 (i.e., write 5 to memory at 0x8000) Okay. _Now_ can you reconstruct the original source from these machine instructions? No. The evidence of the use of variables has been effaced; we don't even know what their names are. And so it is with the roff languages. An arbitrary number of macro expansions may have taken place. (Aside: one may wonder why I'm presenting three whole ad hoc languages to explain the background here. Good question. My answer is a somewhat weak "that's how everybody else presents it". My understanding is that, as a rule, this is result of demand on the part of programmers. Generally, they seem to want to maintain programs in languages that are more "expressive" than an abstract syntax tree. AIUI, Forth and Lisp _much_ more closely resemble the AST; in fact this is, if I'm not mistaken, what's behind one of Alan Kay's famous insights about Lisp, something about its S-expressions and ASTs being interconvertible, or an obvious one-to-one mapping between a program's representation in memory and the visible expression of it on the terminal. Something like that. I confess to embarrassment at how much of a Philistine I come across as in this paragraph. No great sophisticate am I.) I'm not saying it is impossible to do what you want (or what I wanted, years ago). By going carefully through the parser and adding "hooks" to preserve input tokens at key locations, and/or building a list of them as they are encountered, and disposing of them only once output is flushed, I imagine it could be achieved. But it would be a significant effort and demand deep familiarity with the entire GNU troff parser, not just the bulk of it in input.cpp. My own knowledge of it is far advanced over what it was 5 years ago, but I do not feel equal to the task of scaling that mountain yet. So I think this is unlikely to happen soon. In the meantime, perhaps a passing CS professor will, with exasperated patience, correct the numerous howlers I managed to utter above. Regards, Branden [1] *roff parsers don't delegate arithmetic evaluation all the way to the machine. They translate them into C expressions instead, and, I think, I use a separate, temporary stack for evaluation. But the principle holds, in my opinion. [2] ...the presentation of which in _The Unix Programming Environment_, my labored explanation is but a pale shadow. I wish someone would update that book for modern times; the currents of history have been particularly cruel to its old-school lex and yacc usage, which is some of the most valuable material in it.
signature.asc
Description: PGP signature