Re: (off topic?) Docbook? Re: manlint?

John Gardner Mon, 28 Sep 2020 14:02:11 -0700

>
> Going the reverse direction, groff=>markup language (e.g. HTML),
> it is equally evident that only conversions from a known macroset
> are going to produce semantically clean results

Forget about macro sets, and focus on the Troff pipeline itself. Here's how
I'd do it:

infer | little-languages | troff -Tps | post-infer | dweb

Where:

   -  infer(1)
   <https://lists.gnu.org/archive/html/groff/2020-09/msg00031.html> is the
   preprocessor concept I shared last-week,
   - little-languages is the usual menagerie of preprocessors, sans
   soelim(1) and preconv(1),
   - troff(1) formats the actual Roff code, tackles the author's choice of
   macro-package, and tackles every other headache stipulated by low-level
   roff(7) usage,
   - post-infer(1) performs its magic (see below), using any well-formed
   stream of device-independent output[1], where the actual typesetting device
   is arbitrary,
   - dweb(1) receives a structured reconstruction of the original document
   (i.e., JSON, XML), then proceeds to use the combination of rendered markup
   and restored structural info to generate a clean, modern HTML document,
   with embedded MathML and SVG where appropriate.

Now, about that magic? Here's what post-infer(1) looks for when processing
the intermediate representation:

x X meta: begin tag
…

x X meta: end tag

This corresponds to <tag>…</tag> in any language that quacks like XML, and
any text printed between those lines is assembled into logical blocks,
resolving character order in the order a locale's writing direction[2], and
keeps track of tag nesting and word boundaries. This step is the one I'm
unsure of, since we're left to margins and character-metrics are ultimately
how we determine where one blob of text ends, and another begins.
Thankfully, we have access to the actual font-metrics, which can alleviate
problems with:

   - Measuring distances between baselines (for subscripts, superscripts
   and paragraph breaks),
   - Handling overstriking by composing obvious-looking combinations into
   the appropriate Unicode character; i.e., O, / and _ get resolved into
   *Ø,*
   - Knowing a glyph's advance width (for identifying unnatural spacing
   between letters that might be a word boundary or track-kerning)

Every other tidbit of detail we can glean from device controls gets scraped
as well, such as various anchor and hyperlink syntaxes used by gropdf(1),
Heirloom Troff, and anything else we know of ahead of time,

And the best part of all this? Authors don't need to give a damn if they're
targeting HTML or PDF output. This makes sense once you realise HTML5 and
nroff(1) output is oddly-similar: the document outline is most important,
and purely presentational characters are either stripped, or boiled down to
whatever the medium supports by way of styling. For HTML, that means <b>,
<i>, <u>, <s>. For terminal output, that'd be *bold*, *underline*,
indentation, and margins.

Right. I've finally gotten this years-old idea off my chest, and I'd like
to hear people's opinions. Thoughts?

*FOOTNOTES I ADDED MANUALLY, BECAUSE I CAN'T G. BRANDEN LIKE HE CAN:*

   1. I'll tend to refer to device-independent troff's *"intermediate
   output language"* as just ditroff(7). I don't care if it's
   anachronistic, I need a concise way to refer to the language described in
   groff_out(5) …without implying anything Groff-specific.

   2. For folks *not* using Neatroff, "writing direction" is the same thing
   as "left-to-right, top-to-bottom". One ironic typesetting limitation of
   both Groff and Heirloom is that only left-to-right scripts are supported.
   So much for Mongolian man page support!

On Tue, 29 Sep 2020 at 05:54, Peter Schaffter <pe...@schaffter.ca> wrote:

> On Mon, Sep 28, 2020, Steve Izma wrote:
> > I think Larry's point here is that it's not that hard to write a
> > script to go from a markup language to groff.
>
> I think the crux of the matter is going from a markup language *to a
> specific groff macroset*, not merely "to groff."
>
> A couple of years ago, Yves Cloutier and I discussed his idea of
> creating an entirely new markup language, which would begin life as
> a markup=>groff converter and proceed from there.  Needless to say,
> he discovered quickly that his converter needed a macroset to map
> the markup to; conversion to low-level groff is futile because many
> of the operations groff is expected to perform demand being managed
> by macros.
>
> Going the reverse direction, groff=>markup language (e.g. HTML),
> it is equally evident that only conversions from a known macroset
> are going to produce semantically clean results.  Thus, I feel that
> any work done on a grohtml-like device must start by determining an
> appropriate macroset to use for the conversion, then extending it to
> include additional macrosets.
>
> grohtml(1) makes no mention of macrosets, which lacuna can only be
> construed by users in one of two ways: grohtml is macro-agnostic,
> thus it will convert any macroset, or grohtml only converts
> low-level groff ("...converts the output of GNU troff to html.").
> I believe neither is true.
>
> Using -ms as an example, I would dearly love to see the DESCRIPTION
> in grohtml(1) begin:
>
>   "The grohtml front end...translates the output of documents
>    formatted with the groff_ms(1) macros to html.  Users should
>    always invoke grohtml via the groff command with the -Thtml
>    and -ms options."
>
> --
> Peter Schaffter
> http://www.schaffter.ca
>
>

Re: (off topic?) Docbook? Re: manlint?

Reply via email to