> > Going the reverse direction, groff=>markup language (e.g. HTML), > it is equally evident that only conversions from a known macroset > are going to produce semantically clean results
Forget about macro sets, and focus on the Troff pipeline itself. Here's how I'd do it: infer | little-languages | troff -Tps | post-infer | dweb Where: - infer(1) <https://lists.gnu.org/archive/html/groff/2020-09/msg00031.html> is the preprocessor concept I shared last-week, - little-languages is the usual menagerie of preprocessors, sans soelim(1) and preconv(1), - troff(1) formats the actual Roff code, tackles the author's choice of macro-package, and tackles every other headache stipulated by low-level roff(7) usage, - post-infer(1) performs its magic (see below), using any well-formed stream of device-independent output[1], where the actual typesetting device is arbitrary, - dweb(1) receives a structured reconstruction of the original document (i.e., JSON, XML), then proceeds to use the combination of rendered markup and restored structural info to generate a clean, modern HTML document, with embedded MathML and SVG where appropriate. Now, about that magic? Here's what post-infer(1) looks for when processing the intermediate representation: x X meta: begin tag … x X meta: end tag This corresponds to <tag>…</tag> in any language that quacks like XML, and any text printed between those lines is assembled into logical blocks, resolving character order in the order a locale's writing direction[2], and keeps track of tag nesting and word boundaries. This step is the one I'm unsure of, since we're left to margins and character-metrics are ultimately how we determine where one blob of text ends, and another begins. Thankfully, we have access to the actual font-metrics, which can alleviate problems with: - Measuring distances between baselines (for subscripts, superscripts and paragraph breaks), - Handling overstriking by composing obvious-looking combinations into the appropriate Unicode character; i.e., O, / and _ get resolved into *Ø,* - Knowing a glyph's advance width (for identifying unnatural spacing between letters that might be a word boundary or track-kerning) Every other tidbit of detail we can glean from device controls gets scraped as well, such as various anchor and hyperlink syntaxes used by gropdf(1), Heirloom Troff, and anything else we know of ahead of time, And the best part of all this? Authors don't need to give a damn if they're targeting HTML or PDF output. This makes sense once you realise HTML5 and nroff(1) output is oddly-similar: the document outline is most important, and purely presentational characters are either stripped, or boiled down to whatever the medium supports by way of styling. For HTML, that means <b>, <i>, <u>, <s>. For terminal output, that'd be *bold*, *underline*, indentation, and margins. Right. I've finally gotten this years-old idea off my chest, and I'd like to hear people's opinions. Thoughts? *FOOTNOTES I ADDED MANUALLY, BECAUSE I CAN'T G. BRANDEN LIKE HE CAN:* 1. I'll tend to refer to device-independent troff's *"intermediate output language"* as just ditroff(7). I don't care if it's anachronistic, I need a concise way to refer to the language described in groff_out(5) …without implying anything Groff-specific. 2. For folks *not* using Neatroff, "writing direction" is the same thing as "left-to-right, top-to-bottom". One ironic typesetting limitation of both Groff and Heirloom is that only left-to-right scripts are supported. So much for Mongolian man page support! On Tue, 29 Sep 2020 at 05:54, Peter Schaffter <pe...@schaffter.ca> wrote: > On Mon, Sep 28, 2020, Steve Izma wrote: > > I think Larry's point here is that it's not that hard to write a > > script to go from a markup language to groff. > > I think the crux of the matter is going from a markup language *to a > specific groff macroset*, not merely "to groff." > > A couple of years ago, Yves Cloutier and I discussed his idea of > creating an entirely new markup language, which would begin life as > a markup=>groff converter and proceed from there. Needless to say, > he discovered quickly that his converter needed a macroset to map > the markup to; conversion to low-level groff is futile because many > of the operations groff is expected to perform demand being managed > by macros. > > Going the reverse direction, groff=>markup language (e.g. HTML), > it is equally evident that only conversions from a known macroset > are going to produce semantically clean results. Thus, I feel that > any work done on a grohtml-like device must start by determining an > appropriate macroset to use for the conversion, then extending it to > include additional macrosets. > > grohtml(1) makes no mention of macrosets, which lacuna can only be > construed by users in one of two ways: grohtml is macro-agnostic, > thus it will convert any macroset, or grohtml only converts > low-level groff ("...converts the output of GNU troff to html."). > I believe neither is true. > > Using -ms as an example, I would dearly love to see the DESCRIPTION > in grohtml(1) begin: > > "The grohtml front end...translates the output of documents > formatted with the groff_ms(1) macros to html. Users should > always invoke grohtml via the groff command with the -Thtml > and -ms options." > > -- > Peter Schaffter > http://www.schaffter.ca > >