Re: mini-book manual pages through multi-.so pages (i.e., the old proc(5) page)

G. Branden Robinson Thu, 25 Sep 2025 16:40:57 -0700

Hi Ingo,

At 2025-09-25T18:19:39+0200, Ingo Schwarze wrote:
> G. Branden Robinson wrote on Thu, Sep 25, 2025 at 04:15:02AM -0500:
> > At 2025-09-25T02:02:24+0200, Ingo Schwarze wrote:
> 
> >> On the other hand, for mdoc(7), the situation is much worse than
> >> for man(7) in so far as the macro order .Dd .Dt .Os used to be mere
> >> convention, and any other order of these three macros used to be
> >> equally valid.  Groff-1.23 utterly broke that and now always starts
> >> a new manual page at .Dd, so every manual page with a different
> >> macro order is now totally broken with groff.
> 
> > I broke it, and I broke it for a reason.  When formatting for
> > paginated output devices (anything that isn't a terminal or
> > HTML--the only output formats _mandoc_ natively supports[...]),
[...]
> > [...] As I understand it, _mandoc_(1)'s PDF support comes from using
> >       an external tool to generate it from HTML.  That approach has
> >       significant limitations from a typesetting perspective.
> 
> Absolutely not.  The mandoc -Tps and -Tpdf output modes are
> implemented as a submodule "term_ps.c" of the terminal-output module
> "term.c".  The module "term_ps.c" directly generates valid PostScript
> and PDF code from the abstract man(7) and mdoc(7) syntax trees using
> knowledge about the syntax and semantics of the PostScript and PDF
> stack-based Turing-complete programming languages.  HTML is not
> involved in any way, and the only program that mandoc(1) ever runs
> execve(2) on is the pager, and only in man(1) = "mandoc -a" mode.


Ah!  That's the second time (at least) you've caught me out in error
regarding _mandoc_ implementation details.  I should learn from this.

It's taken me several years to reach even a modest level of familiarity
with GNU troff's source code, meaning that when there is a problem with
the formatter, I feel I can make an intelligent guess at which source
file to look at.  I have no familiarity with _mandoc_ and should not
guess at all.

> I totally agree that generating PDF from HTML would be a bad idea.
> You can be forgiven for the misunderstanding in so far your final
> conclusion is not that far from the truth: while -T ps and -T pdf
> generate syntactically valid and superficially acceptable code, the
> quality of that code is rather low from a typesetting perspective.

Yes.  I haven't looked in a while, but the output looked so much like
some kind of "html2pdf" process--maybe like a browser's "print to file"
feature--that I hastily presumed you'd outsourced the responsibility.
Sorry about that.

> While mandoc(1) also natively supports the -T tree, -T man,
> and -T markdown output modes, those are not typesetting modes either,

Right.  The good news is that GNU troff, on the Git master branch, is
beginning to catch up with `-T tree`.

     .pline     Report, in JSON syntax to the standard error stream, the
                list of output nodes corresponding to the pending output
                line.  In JSON, a pair of empty brackets “[ ]”
                represents an empty list.  A pending output line has not
                yet undergone adjustment, and lacks a line number and
                margin character (all as applicable).
[...]
     .pm name ...
                Report, to the standard error stream, the JSON‐encoded
                name and contents of each macro, string, or diversion
                name.

I've also toyed with the idea of adding an environment variable that
would cause _every_ output line (in the top-level diversion) to be
dumped in this way.  (I've done it successfully in my own working copy,
but never pushed it.)

And I have notions for an `-A` option exposing alternative output modes
named by the option argument.  One could be be an anchor/bookmark
itemizer.

> so you are right that mandoc(1) has poor typesetting support, much
> poorer than it needs to have as a consequence of its development goals
> - even though it will likely never reach groff(1) or Heirloom levels
> of typesetting quality, it could do much better when given some love.

Fair.  Well, this list is here when you need a consultation.  :)

> It does support paginated output in -T ps and -T pdf though, including
> page headers and footers on every page (as opposed to only for each
> *manual* page like in terminal output - mandoc terminal output, in
> groff terminology, is always continuous, but that does not apply to -T
> ps nor to -T pdf).

Acknowledged.

> > If `Dd`, `Dt`, and `Os` can appear in arbitrary order, you risk
> > producing an incorrect page footer, sticking some of document n+1's
> > data at the bottom of the last page of document n.  I know this
> > because I saw it happen.
> 
> Only if you concatenate sources.

Yes, _if_--but that's a long-accepted operation mode for *roff...

https://minnie.tuhs.org/cgi-bin/utree.pl?file=2BSD/doc/pascal/makefile

(2BSD was May 1979) and a pretty familiar idiom for Unix commands for
about as long as there's been an `argv[]` vector.

> > Possibly I could have added support for some kind of transitional
> > state to _groff_'s _mdoc_ package, and deferred the page break until
> > all 3 macros had appeared regardless of ordering,
> 
> That creates new, different problems: what if one of the macros is
> missing?  Then you would never start the new page at all?
> In particular, it is easy to imagine a page where .Dd is missing, if
> a page author (unwisely) decided displaying a date doesn't matter.

Sure, but then the page is ill-formed according to both _groff_ *and*
_mandoc_.  We need not recover from invalid input in the same ways.[1]
One approach would be, upon hitting any of the three macros, reset to
zero a counter of initialization macros seen, break the page, and reset
all of the corresponding "titles" data to something like "UNTITLED", an
empty section, "UNDATED", and, for `Os`, maybe "UNDISTRIBUTED".  Then,
whichever of the three macros is being processed, partially populate the
data per the argument(s) and increment the "initialization macros seen"
counter.  Add a check to `Sh`, `Pp`, and maybe some other macros to emit
a distressing warning diagnostic if the "initialization macros seen"
counter does not equal 3.

That's a sketch on a napkin.  I don't think this problem is infeasible
to tackle; the question is whether it's worth the effort.  Do you think
it is?

> > but that would have added
> > complicated logic.  My impression is that you're not a fan of
> > complicated logic, as a rule.
> 
> Yes!  :)
> 
> I think the whole idea of formatting multiple pages in one go
> is misguided because it creates the both untractable and entirely
> unnecessary problem that you describe of finding page starts - also
> note that in a manual page, .TH is not necessarily the first roff(7)
> request.

If it appears at all, it should be the first macro called because
otherwise the package doesn't know what to populate the page header
with.  For groff 1.24.0, I've made _man_(7) a little more forgiving of
such "degenerate" documents per requests from Alex Colomar,[2] but this
support gets even less of a commitment than the "NO WARRANTY" that
normally attaches to our code--less than zero!

> The trouble is unnecessary because you *do* actually know where the
> pages start and end - otherwise you could not concatenate them in the
> first place.

I do know, by checking the value of the `.F` register.  It is
straightforward in *roff to do this, and it's perfectly possible even in
Seventh Edition Unix troff.  When `.F` interpolates an empty value, we
know we've reached the end of input.  That's how I know when _not_ to
draw a separating horizontal rule between documents in "continuous
rendering" mode.  This is implemented in groff 1.23.0 already.

https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/tmac/an.tmac?h=1.23.0#n125

> You are artificially wiping out information that you actually have and
> then jump through no end of hoops attempting to recover the lost
> information that in fact can no longer be recovered.

I think you're wrong.  See above regarding use of the `.F` register and,
for _mdoc_(7), a proposed workaround for denegerate or unorthodox
documents.  For _man_(7), there's nothing to work around, because a
single `TH` call (a) unambiguously marks the commencement of a new
document and (b) completely determines the contents of the page headers
and footers.

> Heck, even something as simple as inserting an undocumented,
> implementation-detail private macro
> 
>   .page_start_private_macro_to_please_branden
> 
> between pages instead of just recklessly cat(1)ing them might
> mitgate some of the trouble (though i admit i didn't spend too much
> thought on the idea and could be missing something).

I'm not seeing any trouble to mitigate here, apart from the
low-frequency case of _mdoc_(7) documents with disordered initialization
macros.  Is that a defect?  I'll come back to that below.

> The proper way to create a book from manual pages is to generate
> each manual page seperately and then concatenate the resulting
> PostScript or PDF documents with an appropriate external tool other
> than *roff(1).

But then you have to track the page number externally to the formatter,
and correctly initialize it for each rendered document.

groff_man(7):

Options
     -rPn     Start enumeration of pages at n.  The default is 1.

> There is no problem with page numbers or tables of
> contents because formatting manual pages does not print pages numbers
> or tables of contents anyway,

They don't?  I guess not, historically.  groff can, if the user wants.

> not even when (unwisely) formatted in one go.

I see nothing unwise about it.  I'm pretty pleased with how
groff-man-pages.pdf has shaped up.  I _would_ like more assistance from
the macro packages and/or the formatter in avoiding widows and orphans.

> > In my opinion, the segregation of `Dd`, `Dt`, and `Os` was a blunder
> > in _mdoc_'s design
> 
> Yes, i mostly agree, though i would weight the reasons why it was
> a blunder slightly different.  The worst part is that .Os turned
> out to not be particularly useful at all, for any purpose, and is
> very hard to make useful in any context.

It seems pretty useful to me for any approach to operating system
distribution that admits multiple sources or vendors.  I get that that's
not the BSD way.  But it's not everybody's way.

> And then, it was a blunder
> that Cynthia did not specify a hard requirement on the order of
> these macros.

If she'd done that, it's one short step further to a single macro, `Th`.

> Had such an order been documented and uncompromisingly enforced by the
> code, the segregation would have done little harm, even though not
> being particularly useful - it would have become a mere bikeshed
> whether you prefer one macro with several positional arguments or a
> group of macros with an mnemonic name each.

There would remain the problem of someone interleaving
non-initialization (or non-"preamble", as I think you term it) macro
calls in the midst of `Dd`, `Dt`, and `Os`.

Incidentally it has been my opinion that _mdoc_(7) pages, if they stuck
with this trio (which seems immovable), should have started with `Dt`
instead of `Dd`, because that's the information most important to the
page maintainer.   (It did just occur to me that starting with `Dd`
might be slightly more convenient for generators of _mdoc_ documents,
perhaps via shell commands in a Makefile rule.  That is, you might
compute the date reported in the `Dd` call and tack on the remainder of
the page with _cat_(1).)  But, perceiving decades of prior art with the
existing ordering, I elected not to tilt at this particular windmill.

> > Furthermore, _mdoc_ documents that deviate from the canonical/
> > (conventional?) order seem rare.  In a FreeBSD bug report raising
> > this issue,[...] Wolfram Schneider identified only 15 pages in the
> > base/core/whatever system (all from 1 package, I think: krb5), and
> > 371 out of about 15,000 in the ports collection.
> 
> Wow.  That's _way_ more than i would have expected - almost 400
> real-world pages that got broken in FreeBSD alone?

Hah.  You live in a more regimented world than I do.  My impression is
that for any defect in _man_(7) composition one cares to identify,
you'll see _at least_ 10% of all extant pages manifesting it.

[snip]
> Heh.  You only downloaded the ports Git repo, and even that seemed
> large to you?

Yup.  It was years ago; I was on a slower Internet connection then.  But
I am also impatient.  My gcc Git checkout started from a shallow copy.
That works fine for me; I'm merely a passive spectator of that project.

> That repo doesn't even contain *any* of the code
> nor any of the manuals nor any of the build systems - it purely
> consists of meta instructions how to download the actual code
> including the actual build systems, and it contains build system
> wrappers explaining how to run the diverse build systems.

Ah!  I would have been cross to discover this after waiting so long.

> Do not attempt a bulk build at home, unless you have a powerful
> cluster of fast, modern machines, several days of time, and know
> exactly what you are doing.  It is akin to running a full build of
> Debian, including a full build of *all* Debian packages, including all
> optional packages.
> 
> Even i never attempted an OpenBSD bulk build, and i have no access
> to any build cluster that would even be remotely adequate for trying
> it.  And the FreeBSD ports tree is significantly larger than the
> OpenBSD one, probably at least twice the size.

I see.  If I wanted to grep all the man pages in all of a *BSD's ports,
how would I go about doing it?

> > If someone does actually regard this as a defect in _groff_, they can
> > say so.  I have not yet seen anyone make this claim.
> 
> I dimly recall complaining about the mdoc(7) preamble regressions
> years ago, and i dimly recall your reply as something along the
> lines of "that would be too hard to fix", so i mostly gave up on
> it - and recently marked the related tests in the mandoc regression
> suite as "broken in groff-1.23", to help me move on with other tasks.

Okay.  I didn't regard that a defect report, but merely a grumble.  If I
misunderstood, you can say so, by filing a Savannah ticket yourself or
asking me to.  As noted above, I think the problem is superable.

> > Have you read the code?
> 
> No.  Why should i?  I have not even read the documentation how it
> is all supposed to work because the user interface design (before
> even starting to think about the implementation) is so complicated
> that i gave up on even reading the documentation, or the discussions
> how it should be designed.

There are no interface changes for "PDF book" support.  That interface
is what it has always been:

groff_man(7):

Synopsis
     groff -man [option ...] [file ...]
     groff -m man [option ...] [file ...]

In other words, the "UI" is the second ellipsis on each line.

I've certainly added new registers and strings to _groff_ _man_ and
_mdoc_, but none of them have any essential connection to what we might
call batch rendering of man pages.

groff_man(7):

Authors
...
     ... G. Branden Robinson ⟨[email protected]⟩ implemented
     the AD and MF strings; CS, CT, and U registers; and the MR macro
     for groff 1.23 (2023), and the BP, PO, and TS registers and a
     revised implementation of the SY/YS macros for groff 1.24 (2025).

None of these features direct or configure batch rendering.

So, tell me--how have I complicated the API to support batch rendering?

> Even merely stubbing out and deactiviting only those parts of the
> new API that cause undesirable behaviour already caused
> non-negligible effort for me, even while ignoring the (likely much
> larger) parts that are merely needless for OpenBSD purposes
> but without triggering any obvious harm.

If your efforts are comprehensively recorded at
<https://github.com/ischwarze/groff-port/commits/1.23/>, then I have
responded to them as helpfully as I can.

You haven't replied to me except in one case, where drew my attention to
a post-1.23.0 regression.  I appreciate it--I fixed it and added a test.

https://github.com/ischwarze/groff-port/commit/b251b75a25d4a86107870d0676d3a7dcbf125db1#commitcomment-163275691
https://savannah.gnu.org/bugs/?67385
https://cgit.git.savannah.gnu.org/cgit/groff.git/commit/?id=bfa14e8894d5a6dbb18f9d0dd5d8490767f4c336

> Actually, with mandoc man(1),
> 
>    $ man -Tpdf true false > tmp.pdf
> 
> does result in an output file that both xpdf(1) and gv(1) display
> just fine as a two-page document (with each manual page on one
> page of the "book").  I'm not entirely sure the syntax is valid:
> 
>    $ grep -n PDF- tmp.pdf
>   1:%PDF-1.1
>   447:%PDF-1.1
> 
> I'm too lazy to check right now whether that is valid syntax,

I'm sure Deri knows.  :)

> reading the 750 page PDF specification is always a bit of challenge,

I hear ya.  The HTML 5 "Living Standard" is even more daunting to me.

> but in case it is not valid syntax, it can certainly be fixed
> *without* requiring concatenation of the input files, making
> sure that mandoc continues to process every input file entirely
> separately, without any spillover from one file to the next,
> and without any ambiguity where one manual page ends and the
> next begins.

As I tried to explain above, I think the distinction you're making is
illusory.  It's a simple, and 45+ year old, application of *roff macro
programming to detect when one's input file has changed.

> Now mandoc(1) does all of that by itself, having a badly
> non-Unix-style monotithic software architure approach.  The
> traditional way roff operates is in Unix-style through cooperation of
> many small tools that each do one particular job well, so the
> concatenation should almost certainly be done *after* the troff(1) and
> *after* the postprocessor stage.

Yes, and _groff_ still works this way, albeit a little surreptitiously.

[...]
> > To address the gripe you raise above about `Dd`, `Dt`, and `Os`
> > would require--guess what?--more registers (and/or strings) and more
> > complexity.  Is that what you want?
> 
> No.  Then again, what mandoc(1) does - not for it's own choice, but
> for compatibility with pre-1.23 groff - is not *that* complicated:
> Mandoc distinguishes (and groff used to distinguish) two parsing
> phases: a preamble phase and a content phase.  The phase transition
> is triggered when the first output happens - in mandoc, that's
> modeled by encountering the first non-premble mdoc(7) or man(7)
> macro or the first text (i.e. non-request non-macro) input line -
> but that's an implementation detail, roff would likely use some
> concept of traps instead.  When the phase transition happens, the
> first header line is printed, with whatever content is available
> at that time.  The content of footer lines, and of header lines
> on subsequent pages, can still be changed by (some) preamble macros
> occuring late, i.e. when already in the body phase - the details
> are not particularly consistent but have been implemented for
> compatibility with pre-1.23 groff.  I would certainly be open to
> making all this more consistent and even simpler.

I think my napkin sketch above implements this model, without elevating
the notions of "preamble phase" and "content phase" to conceptual
importance.  But they can be elicited from the proposed design if one
wants to think about it that way.

No additional traps necessary, as I envision it.

> By the way, purely internal, undocumented strings and registers
> are not a problem, instead they are mere implementation details
> (well, simple code is a virtue, too, but not nearly as crucial
> as simple UI and documentation).  Documented strings and registers
> and instructions how to use them is what i call "complexity".

Right.  I struggle to see how I increased UI complexity in support of
batch rendering.  Making it Just Work(tm) for the naïve user was in fact
my intention.  One can attempt batch rendering of man pages with groff
1.22.4--that was my starting point, after all--but unless all one's
documents are unrealistically well behaved, the result will, I predict,
disappoint.

Regards,
Branden

[1] https://savannah.gnu.org/bugs/?67372
[2] It also simplifies writing automated tests.

signature.asc
Description: PGP signature

Re: mini-book manual pages through multi-.so pages (i.e., the old proc(5) page)

Reply via email to