man(7), the hyperlink tagging challenge, and what's a node? (was: ripgrep author seems happy with groff_man_style(7))

G. Branden Robinson Wed, 22 Jan 2025 18:48:27 -0800

At 2025-01-22T07:10:54+0100, Ingo Schwarze wrote:
> i'll try to focus on the most pertinent, most technical points,
> and only respond selectively to political arguments, mostly where
> i fear that misperceptions might arise.

And in this retitled thread I'll deal with the technical points.  In
another thread, if I make time for it, I'll respond to your and onf's
rebuttals to my historical narrative.  I think onf in particular assumes
that I'm making up historical events.  Not the case, and I've posted
supporting citations in previous excursions of mine to this list.

But that will have to wait, as it's simply opinionated argument as
opposed to a program for development of the man(7) macro package.
And of the GNU troff formatter, as we'll see near the end.

So with that out of the way, I'd like to thank you for taking the time
to respond in so much detail to my tentative thoughts in this area.  You
have practical expertise in the problem domain (or a closely related
one, at least) and I appreciate your sharing of it.

> G. Branden Robinson wrote on Tue, Jan 21, 2025 at 01:19:08PM -0600:
> > At 2025-01-21T16:05:18+0100, onf wrote:
> >> On Tue Jan 21, 2025 at 6:59 AM CET, G. Branden Robinson wrote:
> 
> [...]
> > I want to implement a means to run the formatter in a mode that will
> > dump all of the tags in a document--which need not require anything
> > but changes to "an.tmac"--in a format useful for another tool to
> > consume.  That tool will probably be less(1).
> 
> For what it's worth, that is exactly what mandoc(1) has been doing since
> July 2019, first released with mandoc-1.14.6 in September 2021.

I didn't realize it was that recent; that's 2 years after I joined groff
development.  This reforms my notions of mdoc(7) partisan chronology a
bit, but to say more will have to wait for that other thread.  ;-)

> Except that it does not require a different "mode" but that this always
> happens if and only if a pager is used and that pager is more(1)
> or less(1).

The "mode" I have in mind would be a behavioral artifact solely of the
man(7) package.  Nothing to do with the formatter per se.  Kind of like
my `EXTRACT` string idea,[1] or the existing `CHECKSTYLE` register
feature.

> It does not require the user to provide any new configuration or
> options.  It happens completely automatically and transparently
> whenever the user types "man pagename" or "apropos -a
> search_expression".

Right.  I'll need to expose a mechanism via the man(7) interface (albeit
**NOT** that used for authoring documents) so that the man(1) program
can exercise it.  (Again, see [1].)  mandoc(1) controls both ends of
that pipeline; groff does not.

> > I don't know what tag format it uses; even if it's ctags(1), that
> > shouldn't be a problem, because at the time the document is
> > formatted, it knows what the current output line number is.
> 
> Yes, the format less(1) uses and mandoc(1) produces is ctags(1),
> which is a line-oriented text file using the line format
> "tagname filename linenumber".  Even though the POSIX standard
> for ctags(1) permits "search expressions" as an alternative to
> line numbers, that's irrelevant for the present purpose precisely
> for the reason you point out: we know the line numbers, so using
> search expressions would be nothing but a waste of processor time.

Agreed; less(1) has no notion of addressing within a line (that I know
of), so being character-precise would be a waste of effort.

> [...]
> > Hence the value of automatic tagging.  There is the problem
> > that we don't want the human man page reader to have to key
> > in a fully qualified tag name all the time.
> 
> Try the opposite prespective.
> KISS the tags such that they are trivial to type in.

I agree with that objective--in fact I think it's mandatory for this
thing to ever see adoption by users--but had not finalized my idea of
how to get there.  Your messages are shedding much helpful light.

> > All they need is an unambiguous match.
> 
> No you don't need that.  Do NOT even try to make them unique.
> It won't work in practice because even within der same manual page,
> the same identifier can legitimately be defined at two different
> places, and at both in an authoritative manner.  A mathematician
> would be horrified to have two definitions for the same term -

I knew I was on the side of the angels...

> a practitioner, not so much:
> 
>   https://man.openbsd.org/ksh.1#echo
>   https://man.openbsd.org/ksh.1#echo~2
> 
> The first defines the "echo" builtin command in POSIX mode,
> the second in default mode.  For terminal output, you don't have
> the HTML restriction that anchors must be unique, so you get:
> 
>    $ grep echo /tmp/man.oyHNMTMyV6
>   echo /tmp/man.DNP7iW4MJE 1260
>   echo /tmp/man.DNP7iW4MJE 1512
> 
> I'd say from the end user perspective, that is more usable than
> with unique tags, and additionally does not suffer from the
> problem that editing the manual page might change the meaning
> of "echo~2".

Hmm, yes.  And for output to the terminal, I agree.  But your mandoc(1)
has "man -T html"...how do you solve the problem of ambiguous tags
_there_?  I'm going for a general solution; I'm trying to solve the
tagging/linking problem for HTML, PDF, and the terminal all at once.

> You have another example in ksh(1) defining the tag "h" three
> times, see below:  the \h escape in the $PS1 variable, the -h option
> to the "set" builtin command, and the -h option to the "test"
> builtin command.  In long manual pages, such cases aren't all
> that uncommon.

Agreed.

> > I don't know what less(1)'s facilities are for supporting this.
> 
> The less(1) program does not support partial tags or glob(7)ing
> or regular expression search for tags, neither in :t nor in -t,
> so unless you plan to become a less(1) hacker, you have no choice
> anyway.

Okay.  And while web browsers and PDF readers _could_ support
regex-based metadata searches, I've never seen such a thing in my
experiences as a non-power-user of either.

> > I also don't happen to know how ctags(1) got extended to support
> > C++ name spaces and other means of qualifying colliding identifier
> > names.  But if ctags (perhaps Exuberant ctags, given the original
> > ctags format's advanced age) got extended to cope with that problem,
> > presumably less(1) learned how to interpret the extension.
> 
> Not that i'm aware of anything like that, no.  Also, given that ctags(1)
> is mandated by POSIX and less(1) aims for portability (AFAIK),
> relying on extensions might be a bad idea.

Exuberant ctags's extensions are ubiquitous.  I hacked support for them
into my personal fork of mg(1),[2] and they later showed up in, I think,
troglobit's.[3]

> > Also, presumably mandoc(1) has solved the problem when it renders
> > multiple pages and more than one supports, say, an `-h` option.
> 
>   $ man -ak Fl=h

How man-db man implements support for tag-based searches is an open
question.  That's Colin Watson's project, and while IIRC I have a commit
bit, my status in that project is as junior as one can get.  :)

My objective is to design the macro package side of the feature so that
Colin's job as designer (recognizing that he might foist implementation
onto me) is as easy as possible.

man-db man(1) already sits on `-a` and `-k` so some other option letter
selection will have to be made.

There's no urgency for the man-db man(1) librarian to do anything at
first, beyond tell the man(7) package via the nroff(1) command where to
dump the output of ctags(1).

Back of the napkin:

  nroff -t -man -dTAGS=/tmp/user/whatever-man.$$

If one wants to select a subset of tags of interest, I guess that's
where the analogue of mandoc(1)'s "-ak" comes in.  I have no concrete
ideas about that yet, and I don't think such a feature needs to be land
on day one.  Just getting tags at all and enabling, say, ":th" in
less(1) as shown in the next quote will be a win.

> and then typing ":th" and repeatedly pressing 't' gets me to:

[transcript of less(1) session elided]

> > Maybe Ingo (or a mandoc(1) power user) would like me to save me the
> > trouble of researching these points?
> 
> Hope this helps.  :-)

It does!  Thank you very much.

> >> I agree it would be nice if one could link to subsections and, more
> >> importantly, terms within other manpages. As a matter of fact
> >> though, man(7) can't even tag terms within the same page.
> 
> > It's the same problem with the same solution, as I conceive it.
> 
> Not unreasonable in so far as all these tasks ought to be considered
> together, and the designed solution ought to be uniform and
> consistent.

Yes.

> However, the final design needs to contain syntactical and operational
> details that may differ depending on whether we are linking into the
> same or into another manual page.  Coming up with a good design is
> certainly hardest for deep linking into other pages - but also much
> less frequently needed in practice than some seem to think, which
> is the reason why mandoc does not support it yet.  There simply isn't
> enough demand to make it a priority, even though it may be desirable
> in a few cases.

I agree with this too.  It seems a much less pressing matter.

> > A recurring theme in my contributions to groff's man(7)/mdoc(7)
> > rendering has been to solve problems when rendering N pages at a
> > time, where N can be 1 but might be greater.
> 
> Actually, the mandoc program demonstrates that from a pure user
> perspective, a very straightforward solution is feasible that
> will look totally trivial to the user.
> 
> Then again, programs like makewhatis(8) in the mandoc package that run
> over lots of manual pages in sequence have been quite good at exposing
> memory leaks and bugs caused by leaking context data from one manual
> page to the next, and i rarely enjoyed the ensuing crashes and
> debugging efforts.

This sounds like the "fun" I enjoyed when getting "groff-man-pages.pdf"
(and its UTF-8 text counterpart) ship-shape.  Many odd corner cases and
oversights.

> Still, running commands like
> 
>    $ man -ak ~.
> 
> can be useful for finding bugs.  By the way:
> 
>    $ man -ak ~. | col -b | grep -c ^NAME
>   8801
>    $ time man -ak ~.
>     0m17.16s real     0m15.07s user     0m01.18s system
>    $ time doas makewhatis 
>     0m10.58s real     0m08.96s user     0m01.35s system
> 
> That's less than two milliseconds for rendering one manual page
> on average, and 1.2 milliseconds on average for the parsing
> involved in makewhatis(8) - which includes building mandoc.db(5)
> database files referencing the mdoc macro arguments in all the
> manual pages on the system.
> 
>  $ ls -alh /usr/*/man/mandoc.db
> -rw-r--r--  1 root  wheel   190K Jan 21 22:22 /usr/X11R6/man/mandoc.db
> -rw-r--r--  1 root  wheel   540K Jan 21 22:22 /usr/local/man/mandoc.db
> -rw-r--r--  1 root  wheel   2.3M Jan 21 22:22 /usr/share/man/mandoc.db
> 
> Those times were taken on my notebook that is almost 10 years old by
> now and still uses a rotating harddisk, not an SSD.
> 
> Try that with groff at your own peril.

How many separate Unix processes are involved in doing that, though? ;-)

nroff(1) and man(1), going back to early editions of CSRC Unix, are
built around the filter model and pipe(2).  It's long been known that
context switches are expensive, mode switches more expensive still
(though if I understand early Unix well enough, there wasn't yet a
difference between them), and that IPC has overheads relative to memory
access within one's address space.

So I don't think it's an apples-to-apples comparison.

User-perceived performance does, of course, matter.

(As an aside, I haven't forgotten about Deri's charge that my tag lookup
implementation causes a 9,000% slowdown or whatever it was.  I haven't
followed up because the reproduction procedure is complex.  It involves
reverting Git to some past state, then applying a patch,
then--maybe?--employing some changes to pdf.tmac that Deri hasn't
committed yet.  I can't remember all the details.  I'll get around to
it, but if someone has a reproducer that makes tag searches in _groff
Git HEAD_ slow as molasses, that will increase the temperature of the
fire under my hams.[4])

[postponing Ingo's response to my critique of mdoc(7) to another time]

> FWIW, i think you are doing a decent job maintaining the mdoc
> implementation in groff.  In rare cases, i tend to think "now
> that tweak verges overengineering" but i would classify those
> as minor differences in design style.

Thank you!  I monitor mandoc(1)'s cvsweb closely and can't remember when
I've seen a change that caused me concern.  If I ever do, I'll let you
know.

> Updating the groff port in OpenBSD causes a horrendous amount of work
> - so much so that i still didn't manage to update to groff-1.23 in the
> OpenBSD ports tree, mostly due to large amounts of subtle changes in
> behaviour between 1.22.4 and 1.23.  Before being able to update, i
> have to disentangle and classify all those changes as follows:
> 
>  1. Desirable effects of bugfixes and of changes making groff
>     behaviour saner or more consistent; some of these require
>     adjustments in mandoc, too, and/or maybe in the test suite.

I indeed hope that the great preponderance of problems are of this kind.

>  2. Trivial changes that aren't necessarily intentional, but
>     only affect unimportant edge cases such that they can stay.
>     A few of these may requires adjusting mandoc, too, and
>     several require asjusting the test suite.

I'm curious to know about these if you'd care to record them.  They
might not be things I want to change or revert, but they might bring my
attention to consequences I hadn't considered, and maybe I can
anticipate similar impacts in the future.

>  3. Build system hiccups.  These typically require setting
>     configure or make(1) variables in the ports Makefile, but
>     diagnosing what exactly is needed is typically hard.
>     A few extreme cases might require patches.
>     Due to extensive use of GNU autotools and gnulib, this
>     class of problems is particularly large and annoying.

I can't claim credit/blame for restructuring groff around Autotools and
Gnulib, but I'll accept responsibility for not changing it.  For non-BSD
systems they deliver remarkably little hassle.

I should bring your attention to:

https://savannah.gnu.org/bugs/index.php?66518

I would very much like to get libgroff out of the portability library
business.

I have found Bruno Haible and Paul Eggert to be highly responsive and
easy to work with as regards portability problems in Gnulib.  I believe
they care about keeping it working on the BSDs.  I see commits from
Bruno with respect to FreeBSD 11, NetBSD 7.1, and OpenBSD 6.0 within the
past month (in time to make gnulib's 2025-01 stable tag).

I'd happily try {Free,Net,Open}BSD builds myself, but the FSF France's
compiler farm hosts for these OSes are down/unresponsive every time I
try them.

>  4. Regressions.  These require pushing bugfixes to groff,
>     plus patches in the ports tree until they get released.

I definitely want to know about these ASAP.  Or do you mean the ones
I've already noted and committed, kicking them up to "Important"
severity in Savannah?

>  5. Intentional changes that we do not want to have.
>     These require patches in the ports tree that may have
>     to stay even in the longer term.

I think I know one or two of the ones you have in mind, but I'd like to
know where I can look to stay apprised of these.  Even if I don't think
a distributor patch belongs upstream, it tells me important information
about the needs of the user community.  (And, regrettably, sometimes its
cluelessness.[5]

> Each of these classes contains several items.  While i have already
> worked through quite a few issues, there are still several issues that
> aren't even classified yet.
> 
> On OpenBSD, groff-1.23 and newer absolutely doesn't build out of
> the box.  Quite to the contrary, getting it to build at all is a
> serious challenge and getting harder and harder all the time, mostly
> due to the fact that pervasavive use of gnulib totally cripples
> portability.  If gnulib would be thrown out entirely and groff
> would merely assume POSIX behaviour without prividing any
> fallbacks, porting it would become massively easier, probably almost
> trivial.

...but a major reason we use gnulib is _to_ work around failures of
systems to be POSIX-conforming in myriad respects.

> I have done the following upgrades of groff in the OpenBSD ports tree:
> 
>  * from groff-1.15.4 to groff-1.21 on March 19, 2011
>  * to groff-1.22.2 on March 30, 2013
>  * to groff-1.22.3 on November 6, 2014
>  * to groff-1.22.4 on December 24, 2018
>  * now trying to get from there to groff-1.23.0
> 
> Of these five updates, the last one to 1.23.0 feels like by far
> the hardest, both in terms of behaviour changes and build system
> breakage all over the place, sognificantly harder even than the
> big leap from 1.15 directly to 1.21.

I'm not happy to hear that.  Apart from not using Gnulib or the GNU
build system, both of which I prefer to retain because I perceive them
of _relieving_ me of maintenance overheads (which is their reason for
existing in the first place), I'm keen to hear your suggestions for
things I can do that will reduce the pain.

> Don't take that as a complaint, though; i still appreciate your work.

Likewise!  Because mandoc(1) is conscientiously maintained, I make
frequent reference to it when advising man page authors/maintainers
regarding portability.  By contrast, specimens like Illumos troff are
seldom worth mentioning.  It's just Solaris 10 troff under a different
name, and its development community seems to have its focus utterly on
other aspects of the system.

> If there are lots of changes, we can at least be sure that work
> is being done!  :-D

I feel like I kill a lot of bugs.  Some, I didn't even create myself.

> >>>         Because mdoc(7) culture is rigidly prescriptive, its
> >>>         section headings are tightly controlled, and I expect that
> >>>         this problem only threatens when subsections are used (and
> >>>         referenced).
> 
> Not entirely accurate:
> 
>    $ man -cT lint ifconfig
>   [... no output whatsoever, not even a style warning ...]
>    $ echo $?
>   0
> 
> So even though the page contains lots of custom sections, there
> isn't a single warning.  For details of how that page looks like, see:
> 
>   https://man.bsd.lv/ifconfig.8

I don't think I was trying to assert anything about mandoc man(1)'s exit
status, just about the tenor of advice that man page authors using it
tend to receive.  And while I know I'm something of a man(7) style
prescriptivist myself, I think there are people farther along the axis
than I--like Alex Colomar.  :)

I don't need to be more Catholic than the Pope.  Just the Pope.  :-P
(For those who are blind to my self-effacing remarks--that's a joke.)

[trimming an exchange between you and onf]
> > Yes, some--not all--of those are unconventional.  I wouldn't say
> > "not standard" because we have no standard to which to point.  Just
> > conventions, some of which have been codified in style guides.
> 
> >> I think the point is more about sticking to conventional section
> >> names if possible than about forbidding non-standard ones.
> 
> > I think I have seen Ingo do the latter, but I could be mistaken.
> 
> That likely wasn't what i meant, i agree more with the former than
> with the latter, and do not want to "forbid" custom sections.
> 
> Maybe what you have in mind is that i abhor a few specific sections
> that are occasionally seen in the wild, most notably OPTIONS
> and NOTES.  Those are indeed always terrible style and deserve
> to be shot on sight, with no warning.  OPTIONS is usually the most
> important part of the DESCRIPTION and splitting it out is at best
> pointless, but usually causes disorganization.

I see your point but disagree.  I think I've made these points before
but I'll recapitulate.

1.  GNU and Linux programs have a tendency to efflorescence in the
    option department.  There often way too damned many.  Given that
    bloat, as you might put it, a structured document _must_ respond,
    just as historical macro packages listed rosters of symbols
    (registers, strings, macros) at the end of the man page (if they did
    so at all).  Similarly, even a troff of 1976 dimensions needed lists
    of requests and escape sequences.  Maybe OpenBSD doesn't have this
    problem, though that must depend a lot on whether you'll reject a
    program from the ports collection on that basis alone.

    Getting the authors of every executable command that any GNU/Linux
    or POSIX system user can run to rëengineer their tools around a
    principle of slimming down the option list seems a hopeless task.
    Knowing I'll die long before I get people to fix any of the other
    problems I notice with their man pages is enough futility for me.

> NOTES is almost always the hallmark of a totally disorganized page.
> The authors failed to make up their mind which material logically
> belongs together, such that below NOTES, they randomly return to
> aspects that have been discussed before, but not discussed properly.

Heh, as you're probably aware, I use a "Notes" section myself, in 3 of
groff's pages.

1.  In nroff(1), because of the SGR problem.  A lot of the same people
    who insist that groff not produce SGR escape sequences will not ever
    think to look in grotty(1).  (Incidentally, this same page uniquely
    _lacks_ an "Options" section; a piloted your approach to see what I
    thought of it.)

2.  In groff_me(1), to explain the origin of the package's name.  The
    page is otherwise pretty terse and businesslike, and more an
    aide-mémoire than a true reference, so I saw no other good place to
    put it.

3.  In groff_man_style(1), as a kind of FAQ.  I won't apologize for
    this; I need it to dispel myths and confusion.  I _could_ call it
    something else; its present name seems good enough.

[more politics/religion/groff/BSD history elided]

At 2025-01-22T22:04:59+0100, Ingo Schwarze wrote:
> G. Branden Robinson wrote on Mon, Jan 20, 2025 at 11:59:41PM -0600:
> 
> > We need a design for automatic construction of
> > tag/anchor names from the user-specified names of the items to be
> > tagged.  In man(7) documents, those taggable items are probably going to
> > be:
> > 
> > 1.  the identifier of the page itself, with "section" number;
> > 2.  section heading text;
> > 3.  subsection heading text; and
> > 4.  the tag text of tagged paragraphs (`TP`).
> 
> In addition to those, mandoc(1) also tags the tag text of .IP and .TQ.

Ah!  I forgot to mention `TQ`.  Yes, it's completely my intention to
treat `TQ` the same as `TP` in this respect.

But not for `IP`.  That macro seems, somewhat preponderantly, to already
be getting used as a non-semantic, or differently semantic, device, to
mark lists with symbols or enumerators.  (It's more structural than
semantic, we might say.)  That's good.  I want to encourage and
reinforce that practice.

> But for all these, it does some sanitation before creating a tag,
> see this comment in the file man_validate.c:
> 
> /*
>  * Skip leading whitespace, dashes, backslashes, and font escapes,
>  * then create a tag if the first following byte is a letter.
>  * Priority is high unless whitespace is present.
>  */
> 
> The "letter" condition is needed because .IP and .TP are also used
> for bullet and numbered lists, and in those cases, the tag is often
> something like "-", "\(en", "*", "\(bu", "1.", "2)" etc. which
> we clearly don't want to tag.

Yes.  I think I'll dispose of a lot of non-tag tags without trouble with
my `IP`/`TP` dichotomy.  I plan to meet complaints about superfluous or
missing tags with advise to use the correct macro for the purpose.

Along those lines, here's an item from the forthcoming 1.24.0 NEWS.
Actually two of them.

NEWS:
*  The an (man) macro package now supports a `TS` register to configure
   the minimum space required between the tag of a `TP` paragraph and
   its body.  (If the width of the tag's formatted text plus this space
   exceeds the paragraph indentation, the line is broken after the tag.)
   This parameter, formerly hard-coded as `1n`, now defaults to `2n`.

*  The an (man) macro package's `IP` macro no longer honors the formerly
   hard-coded 1n tag separation noted in the previous item.  This means
   that the first argument to the `IP` macro can abut the text of the
   paragraph with no intervening space.  If you use a word instead of
   punctuation or a list enumerator for `IP`'s first argument, consider
   migrating to `TP`.

I haven't brought this to your attention before now because I expected
you to:

A.  Ignore the `TS` register entirely; and either,
B1. Go along with my tweak to indentation, or
B2. Not, overriding it with a patch.

And none of these causes me any concern.

> > A.  Generation of _unique_ hyperlink tags from #2-#4 above.
> 
> Don't, just don't, for the reasons explained in my other mail.

I'll be giving this considerable thought.  As noted above I want to know
how you solve the unique-tag problem in HTML and PDF.  I acknowledge
that solving it for terminal output doesn't seem worth doing.

> [...]
> > C.  We then need a way to make references to these anchors/tags.
> 
> Please do not rush like that.  As explained in my other mail, that's 
> actually only useful in surprisingly rare cases.  Better first
> get the simpler case of local jumping well-designed and stable
> before progressing to the much, much harder next stage of non-local
> jumping.

Yes, this sounds hard and I haven't "scheduled" it mentally.  "Deep
links" have come up before on this list, years ago even.  Seems like a
tough problem with fragile solutions.

> >     For man(7) the `MR` macro new to groff 1.23 was an obvious site
> >     to add the appropriate machinery for document-level links.
> >     mdoc(7)'s `Xr` is closely analogous and has existed for many
> >     years.
> 
> Yes, both have almost identical semantics and are a likely candidate
> for extension, if we come to the conclusion an extension is needed.
> I didn't consider the details yet, though.

I think you were discussing `SX`/`Sx` here?  As I understand it, `MR`
support is already in mandoc(1) CVS (but not released), as of course is
`Xr`, for some years.

Also, you can't put a price on the pleasure of introducing a macro named
named `SX`.  It pleases the 13-year-old in everyone.

> >     i.  No way to hyperlink in a more fine-grained way, that is to
> >     (sub)section headings or, conceivably, to paragraph tags.  This
> >     is a tougher problem because if these are not unique within a
> >     page, the location making the link has to know about the
> >     structure of the document.  Possibly, we'll just punt on the
> >     issue of "deep" cross-document links.
> 
> Punt for now, yes; maybe we can find a good solution later,
> when the easier parts are done.
> 
> One possible solution is to just ask authors to engage their brain
> before deep linking.

Don't take _that_ strategy to the racetrack...

> It should be fairly obvious that deep linking to the tag "h" in the
> ksh(1) manual page is a stupid idea.  Even if there weren't three
> instances of that tag already (for three completely different
> features), everybody will expect that more such instances can easily
> pop up at any time and make your shiny new link point into the woods.

While deep structural links risk encouraging stagnant man page
structures, deep unstructured links will promote retention of features.
And there are enough forces favoring the latter.  Our industry has a bad
problem with not throwing old cruft away.

> On the other hand, linking to the tag "CIPHER_LISTING" or the
> tag "EVP_get_cipherbyname" in EVP_EncryptInit(3) is almost
> certainly fine because it's hard to image a scenario where
> those tags might become ambiguuos in the future, see
> 
>   https://man.openbsd.org/EVP_EncryptInit.3
>   https://man.openbsd.org/EVP_EncryptInit.3#CIPHER_LISTING
>   https://man.openbsd.org/EVP_EncryptInit.3#EVP_get_cipherbyname
> 
> In any case, it's important that the tag names exactly match
> the actual syntax elements, such that users can type them
> without any prior knowledge.  Invented or constructed tag names
> are next to useless.

I agree.  We can't expect page authors to have precise knowledge of how
other people's pages are structured, or if they do, for that knowledge
to remain time-invariant.

Using English instead of machinery might remain our most robust
technology.  "See the “-h” option in ksh(1)."

> I'm not sure you have exhaustively analyzed cross-document linking,
> mostly because i definitely haven't analyzed cross-document linking
> in manual pages exhaustively myself.

I'm confident you're right about deeply we've respectively pondered it.

> But i'm aware of at least two aspects you maybe missed:
> 
>  1. While you discussed tag generation (incompletely),
>     tag format (incompletely in IMHO in part misguided)
>     and link display, the purpose of a link is being followed.
>     As the first step, the requires the user to select the
>     link.  For HTML output, it is obvious how that works:
>     in a graphical browser, click the link with the mouse or
>     navigate to it with the keyboard (the latter probably
>     being the method of choice if you are using a screen
>     reader - though i'm not sure because i'm not blind and
>     have not talked that much to blind users).  In a text
>     browser navigate with the keyboard.
>     How is the user supposed to select a link in less(1)?
>     That looks like a problem requiring considerable design
>     and implementation effort even if you are a less(1) hacker.

I think we have most of the machinery for this.

For URLs (and email addresses), it's already in place.  We use OSC 8.

If less(1) strips that (it doesn't) or the terminal doesn't support it,
the user doesn't get hyperlinks.  Okay, so they're no worse off than
before.

Man page cross references work too, today (well, in groff Git) by dint
of having an URL scheme to represent them.

Now, deep links into a man page?  No.  No support.  But if we solve the
tag generation and format problem, the solution to this part should be
straightforward.  Right?

>  2. The purpose of selescting a link is displaying the target.
>     For HTML output, it's obvious how that works because that's
>     what hypertext was designed for in the first place:
>     when a link is selected, close the current document and
>     open the target one, or optionally open the target in a
>     new tab or window if the browser and/or window manager
>     support that and the user wants it.
>     For mandoc(1), implementing a selection mechanism would
>     actually not be all that difficult.  When the user selects
>     a link, mandoc can simply close the current file, look
>     up the desired target in the mandoc.db(5) database to
>     retrieve the file system path to the desired manual page
>     source file in the file system, open that file, parse it,
>     generate a new tags file from it, format it, and spawn
>     a new pager process passing the file names of both
>     temporary files.  Really not rocket science at all.
>     But groff does know about mandoc.db(5), so even when it
>     knows that it is looking for the "h" tag in the ksh(1)
>     manual page, it will have a very hard time figuring out
>     where in the file system to look for the file "ksh.1"
>     (if that is even the name in the file system!).  Once
>     it has the file, it can maybe do the parsing and formatting
>     to produce the two files, though i'm not sure because so
>     far, i don't think it has infrastructure to manage
>     temporyry files for such purposes.  And then what?
>     Spawn less(1)?  At least so far, groff(1) never does that.
>     If all this were solved, wouldn't that make the man-db
>     package obsolete?  Do you really feel that close to
>     obsoleting man-db, or incorporating it into groff?

I have no intention or desire to make man-db obsolete.  I think the
separation of its concerns from the formatter is sound, and I don't look
to disrupt it.

Solving the foregoing problem is something I may be able to help with,
but primary responsibility will lie elsewhere, and no doubt take level
of demand into account.

>  3. Then there is the following particularly interesting
>     special case.  The mandoc implementation of man(1)
>     already supports the command
> 
>       $ man EVP_get_cipherbyname
> 
>     even though there is no file EVP_get_cipherbyname.3 anywhere
>     in the filesystem.  It opens the manual page EVP_EncryptInit(§)
>     at the top, which documents EVP_get_cipherbyname further down.
>     Traditional man(1) implementations like BSD man, Eaton man,
>     man-1.5, man-1.6, and man-db support essentially the same with
>     symbolic or hard links or one-line files containing .so requests
>     on the file system level.

...or, if the page names in the symbol in its "Name" section, by
indexing the terms with makewhatis(8) or mandb(8).  This is technology
of long tooth, but perhaps is not that well understood by man page
readers (people) who are familiar with the `so` and link techniques.

groff_man_style(7) (section "Notes", no less ;-) ):

     • What’s the difference between a man page topic and identifier?

       A single man page may document several related but distinct
       topics.  For example, printf(3) and fprintf(3) are often
       presented together.  Moreover, multiple programming languages
       have functions named “printf”, and may document these in a man
       page.  The identifier is intended to (with the section) uniquely
       identify a page on the system; it may furthermore correspond
       closely to the file name of the document.

       The man(1) librarian makes access to man pages convenient by
       resolving topics to man page identifiers.  Thus, you can type
       “man fprintf”, and other pages can refer to it, without knowing
       whether the installed document uses “printf”, “fprintf”, or even
       “c_printf” as an identifier.

>     In mandoc, it would be trivial to make
>     "man EVP_get_cipherbyname" jump straight to the location of
>     https://man.openbsd.org/EVP_EncryptInit.3#EVP_get_cipherbyname
>     even in terminal output.  Is that desirable?  Likely not.
>     Does that mean "EVP_get_cipherbyname" is a tag like any
>     other even in the page also know as EVP_get_cipherbyname(3)?
>     Likely neither.  For example, it might be useful for less
>     to assume in that case that the user typed
> 
>       /EVP_get_cipherbyname<ENTER>g
> 
>     To search for the target function name such that it gets highlighted
>     in the text, then return to the top with the less(1) 'g' command.
>     I didn't really think about that yet.  It seems like that
>     will also need careful consideration and design.  How does that
>     (still completely unexplored) picture change with deep linking?

While we don't want to put every tag into the "Name" section so that it
will be indexed by the librarian--indexing all `-h` command line options
would be worse than useless--I think every symbol in a C API should be.
Yes, even macros and non-function objects.  These are "first-class"
concerns just like function calls.  Part of your interface?  Document
it.  Over the past 15 months or so I've been steering the ncurses man
pages in this direction (along with _many_ other changes), to no
resistance as it had already started down this path.  As often happens,
I notice inconsistencies and then cannot rest until I've rectified them.

>  4. I almost certainly did not find all design gaps.
> 
> So i suspect before this can become useful in practice, there
> is still some very serious design work that needs to be done.

Sure.  But once you've pre-worried everything you can on the whiteboard,
it's time to prototype and discover what you failed to think of
beforehand.  ;-)

That said, deep linking is still in the pre-worry stage, I think.

At 2025-01-22T23:04:33+0100, Ingo Schwarze wrote:
> > Possibly I'll formally propose an `SX` macro for man(7) at some point.
> 
> You mean that like mdoc(7) .Tp, not like mdoc(7) .Sx, right?

Uh, groff mdoc(7) doesn't have `Tp`.

Uh.  I don't see it in mdoc(7) from mandoc 1.14.6-1 on my Debian system
either.

What is it?

> > It's not a high priority, nor in the near future because the automatic
> > tagging problem is more fundamental and more important;
> 
> I strongly agree.  For exactly that reason, mandoc(1) supports
> automatic tagging since July 2015 and manual tagging via .Tg only
> since January 2020.  I allowed automatic tagging to mature for half a
> decade before finally deciding that in rare cases, supplementing it
> with manual tagging can be useful.
> 
> After five years of manual tagging being available, nine out of the
> 3472 manual page files in the OpenBSD manual page tree now use manual
> tagging:
> 
>    $ grep -Fl .Tg */*.[1-9]       
>   man1/man.1
>   man1/openssl.1
>   man1/tmux.1
>   man4/ddb.4
>   man5/bgpd.conf.5
>   man7/mdoc.7
>   man8/ifconfig.8
>   man8/pfctl.8
>   man8/route.8
>    $ ls */*.[1-9] | wc -l
>   3472
> 
> So as expected, demand isn't exactly overwhelming: about three
> permille.

Good stuff!  The man page authorship community simulation engine in my
head told me that people would seldom want to bother with manually
specifying tag names/"anchors".  Thank you for presenting empirical
evidence.

> > with it, one could automatically generate a hyperlinked multi-level
> > table of contents for any man(7) document, with no kludges.
> 
> You mean, like
> 
>   https://man.bsd.lv/mdoc.7
>   https://man.bsd.lv/pf.conf.5

Yes.

>   https://man.bsd.lv/openssl.1
>   https://man.bsd.lv/ddb.4
>   https://man.bsd.lv/tmux.1
>   https://man.bsd.lv/bgpd.conf.5
>   https://man.bsd.lv/ifconfig.8

No.  These aren't multi-level.  Even poor maligned grohtml(1) does this
much, and has done it for longer.  ;-)

> > That feature seems, by dint of having seen it done in ad hoc ways by
> > man-to-HTML converters, much more in demand than document-directed ad
> > libitum referencing at a finer-grained level than an entire man(7)
> > or mdoc(7) document.
> 
> TOCs more in demand than cross-document deep linking?
> Yes, i actually agree, that's how i remember interactions with users
> as well.
> 
> All the same, do not overerstimate the demand.  Not everybody likes
> or wants TOCs.  For example, Theo de Raadt and some other OpenBSD
> developers hate TOCs so much that they got vetoed from and
> disabled on man.openbsd.org - even though they are completely
> supported by the software running there and only appear in documents
> containing at least two non-standard section titles, which is a tiny
> fraction of documents.

Ah, there's the Theo I've heard so much about.  ;-)

> [...]
> > My plan for resolving _that_ problem is to introduce a string
> > sanitizer, probably in a new macro file "string.tmac", which people
> > can use for common operations on strings.
> 
> You mean, like the recursive function deroff() in
> 
>   http://cvsweb.bsd.lv/~checkout~/mandoc/roff.c?rev=HEAD

My chest lacks the hair to take all that in in the (not so) brief time
I've allotted to composing this email, so...I don't know?

> That one is an internal mandoc API and cannot be called from
> documents, but the purpose is similar: break down a tree of nodes to a
> plain string.
> 
> It is used:
>  - by the mdoc(7) validator to break down the content of the .Os
>    and the first .Nm macro and the heads of .Sh and .Ss macros
>  - by the HTML formatter for section and subsection titles
>  - by the man(7) parser of makewhatis(8) to break down the NAME section
>    and the content of .Nd and .Va macros
> 
> > Little or none of this is anything mandoc will ever have to care
> > about.
> 
> As you see above, mandoc does encounter some of these tasks as well.
> 
> Some of what you said above (not necessarily all of it) sounds as if
> you are about to expose internal implementation details to the end
> user.

Not any more than they already are, or at least not much.  Also, there
may be a subtle difference between the ways we're using the word "node"
here.  They're analogous in most but not all ways, and the difference is
what's important here, because that difference is entangled in GNU
extensions to the troff language (from long ago) that, fortunately, are
not directly involved in man page composition.  (In other words, man
page authors should not ever touch this stuff in their documents.)

Long story short, complicated things happen when one uses diversions, a
feature mandoc(1) doesn't support.

The long version?

[ ---------------- begin advanced *roff user material ---------------- ]

groff.texi:

5.29 Diversions
===============

In 'roff' systems it is possible to format text as if for output, but
instead of writing it immediately, one can "divert" the formatted text
into a named storage area.  It is retrieved later by specifying its name
after a control character.  The same name space is used for such
diversions as for strings and macros; see *note Identifiers::.  Such
text is sometimes said to be "stored in a macro", but this coinage
obscures the important distinction between macros and strings on one
hand and diversions on the other; the former store _unformatted_ input
text, and the latter capture _formatted_ output.  Diversions also do not
interpret arguments.  Applications of diversions include "keeps"
(preventing a page break from occurring at an inconvenient place by
forcing a set of output lines to be set as a group), footnotes, tables
of contents, and indices.  For orthogonality it is said that GNU 'troff'
is in the "top-level diversion" if no diversion is active (that is,
formatted output is being "diverted" immediately to the output device).
The top-level diversion has no name.

...

 -- Request: .asciify div
     "Unformat" the diversion DIV in a way such that Unicode basic Latin
     (US-ASCII) characters, characters translated with the 'trin'
     request, space characters, and some escape sequences that were
     formatted and diverted into DIV are treated like ordinary input
     characters when DIV is interpolated.  Doing so can be useful in
     conjunction with the 'writem' request.

     'asciify' can be also used for gross hacks; ...
...

     'asciify' cannot return all items in a diversion to their source
     equivalent: nodes such as those produced by the '\N' escape
     sequence will remain nodes, so the result cannot be guaranteed to
     be a pure string.  *Note Copy Mode::.  Glyph parameters such as the
     type face and size are not preserved; use 'unformat' to achieve
     that.

 -- Request: .unformat div
     Like 'asciify', unformat the diversion DIV.  However, 'unformat'
     handles only tabs and spaces between words, the latter usually
     arising from spaces or newlines in the input.  Tabs are treated as
     tokens, and spaces become adjustable again.  The vertical sizes of
     lines are not preserved, but glyph information (font, type size,
     space width, and so on) is retained.

[ ----------------- end advanced *roff user material ----------------- ]

So, in the sense I'm talking about, a "node" is a thing that you can
encounter in a macro or, more importantly for our purposes, a string,
that is essentially unrepresentable.  It's there, but you can't do much
with it.

There's a lot of undefined behavior around this phenomenon when you
start exploring it.

We have a `length` request.  Does a "node" contribute to the length of a
string?  Undefined.  Strictly, a string expression that I have no better
name for than "contents".  And the reason we don't encounter the term
"string expression" or even "string literal" in *roff documentation is
that the data type exists only in a rudimentary and limited form.

If a take a substring of a string, which is bounded by indices into it,
do I get any nodes contained in the range?  Undefined.  How does the
presence of a node at or adjacent to one of the boundary indices affect
the operation?  Undefined.

Now, these nodes _are_, in fact, pretty much what you'd expect from the
name, elements in an abstract syntax tree (more precisely, in AT&T
and GNU troffs, an abstract syntax list of lists).  But the language has
very little visibility into them, mostly for good reasons.
Unfortunately they can cause trouble, as noted above, and GNU troff
implemented some features to try to manage them.

For the forthcoming groff 1.24.0, I've added a `pline` request that
makes them more visible.  This doesn't really expose them any further to
the language proper; all `pline` does is dump the list of pending output
nodes.  I implemented this because I needed it for debugging, and it
rapidly occurred to me that it would (a) be useful to other (relatively
ambitious) groff users who were troubleshooting problems, and (b) gives
those same users a fighting chance to concretize this nebulous and
undefined term "node" term that we toss off casually in the more
difficult sections of the manual.

Things often get a lot simpler if you can just show people examples.

$ printf "Hi, Ingo.\nDon't \\%%hyphenate that.\n.pline\n" | nroff -z
{type: line_start_node, diversion level: 0},
{type: glyph_node, character: "H", diversion level: 0},
{type: glyph_node, character: "i", diversion level: 0},
{type: glyph_node, character: ",", diversion level: 0},
{type: word_space_node, diversion level: 0},
{type: glyph_node, character: "I", diversion level: 0},
{type: glyph_node, character: "n", diversion level: 0},
{type: glyph_node, character: "g", diversion level: 0},
{type: glyph_node, character: "o", diversion level: 0},
{type: glyph_node, character: ".", diversion level: 0},
{type: word_space_node, diversion level: 0},
{type: glyph_node, character: "D", diversion level: 0},
{type: glyph_node, character: "o", diversion level: 0},
{type: glyph_node, character: "n", diversion level: 0},
{type: glyph_node, character: "'", diversion level: 0},
{type: glyph_node, character: "t", diversion level: 0},
{type: word_space_node, diversion level: 0},
{type: hyphen_inhibitor_node, diversion level: 0},
{type: glyph_node, character: "h", diversion level: 0},
{type: glyph_node, character: "y", diversion level: 0},
{type: glyph_node, character: "p", diversion level: 0},
{type: glyph_node, character: "h", diversion level: 0},
{type: glyph_node, character: "e", diversion level: 0},
{type: glyph_node, character: "n", diversion level: 0},
{type: glyph_node, character: "a", diversion level: 0},
{type: glyph_node, character: "t", diversion level: 0},
{type: glyph_node, character: "e", diversion level: 0},
{type: word_space_node, diversion level: 0},
{type: glyph_node, character: "t", diversion level: 0},
{type: glyph_node, character: "h", diversion level: 0},
{type: glyph_node, character: "a", diversion level: 0},
{type: glyph_node, character: "t", diversion level: 0},
{type: glyph_node, character: ".", diversion level: 0}

Things like the "hyphen inhibitor node" cause headaches when they appear
in groff strings, because ambitious macro package writers populate
strings with section headings or a list of authors' names and then
blithely assume that they can both _format_ that string and spit it out
as part of a device extension command to add document metadata or inform
the output driver of the content of a hyperlink.

To be fair, the hyphen inhibitor node is a pretty easy case.  If it
appears anywhere other than a text formatting context, you can ignore
it.  Other are harder.  What about vertical motions?  Sure, discard 'em.
Probably.  If your Japanese PDF viewer normally renders text in a
vertical orientation, I'm not so sure.  What about horizontal motions?
Don't assume an answer--we've already hit a point where Deri and I
disagree.

The problems only get worse, and there may not always be a single
correct answer.  But I don't have to pre-worry them all.

That's why I want a string iterator and well-defined operations to
identify nodes so that they can be stripped from strings, without
bespoke formatter features and without hacks.

Will we ever need or want type-aware node operations in the groff
language?  Good grief, I hope not.  But the mere existence of nodes in
bona fide language objects has already created pain--esoteric pain that
produced diagnostic messages that no one on this mailing list
understood.  (Or if they did, they chose not to share knowledge.)

> If you do that, some fool out there will sooner or later use such
> internals, eventually even in manual pages, and then mandoc(1) may
> indeed be forced to deal with it.

I don't expect to expose much more than is already present.  And I don't
want people writing their own string iterators and such into their man
pages.  Even with the conveniences I aim to add, I expect there'll be
way too much to learn for most man page authors to bother.

I had considered adding a conditional expression operator to disclose
whether an operand was a "node" (as opposed to a character, be it
ordinary or special), but I now think that, if I proceed with my plans
to add other conditional operators to sort out the various _kinds_ of
character (primarily, "user"-defined versus font-provided), then I'll
have what I need.

.if c \*[foo] \" true if it's a glyph OR character (since ~1991)
.if C \*[foo] \" true if it's a user-defined character (`.char` et al.)
.if G \*[foo] \" true if it's a font-defined glyph
.if !c \*[foo] \" none of the above--it's empty or it's a node

When I say that a lot of the problems I'm working on in groff are
entangled with others, this is the sort of thing I mean.

Regards,
Branden

[1] https://lists.gnu.org/archive/html/groff/2024-11/msg00036.html
[2] https://github.com/g-branden-robinson/mg/tree/gbr
[3] https://github.com/troglobit/mg
[4] I have an idea for my own reproducer, so I'm not _blocked_.  But it,
    too, will take a bit of time and effort to craft.
[5] 
https://gitlab.com/procps-ng/procps/-/merge_requests/213/diffs?commit_id=a3ac4b667929320d4c8012435d63a9d1dd538a8d

signature.asc
Description: PGP signature

man(7), the hyperlink tagging challenge, and what's a node? (was: ripgrep author seems happy with groff_man_style(7))

Reply via email to