Re: man(7), the hyperlink tagging challenge, and what's a node?

Ingo Schwarze Thu, 23 Jan 2025 22:47:49 -0800

Hi Branden,

G. Branden Robinson wrote on Wed, Jan 22, 2025 at 08:47:20PM -0600:
> At 2025-01-22T07:10:54+0100, Ingo Schwarze wrote:

>> It does not require the user to provide any new configuration or
>> options.  It happens completely automatically and transparently
>> whenever the user types "man pagename" or "apropos -a
>> search_expression".

> Right.  I'll need to expose a mechanism via the man(7) interface (albeit
> **NOT** that used for authoring documents) so that the man(1) program
> can exercise it.  (Again, see [1].)  mandoc(1) controls both ends of
> that pipeline; groff does not.

Correct.  That doesn't mean that mandoc is monolithic, though.
In does have internal structure, albeit all inside the same
single-threaded, completely linear process.

 1. The steering program parses command line options and configuration
    files, consults zero or more mandoc.db(5) databases, produces a
    list of input files to process, and opens the required temporary
    files for writing, if any are needed.
 2. For each top-level input file, the following steps happen in turn.
    First, the parsers generate the syntax tree.
 3. After that, the appropriate validator for the input language
    used in the file validates and in some respects normalizes and
    transforms the syntax tree and collects tagging information.
 4. After that, the appropriate formatter for the desired (input
    language, output format) pair adds the formatted output to one
    of the temporary files and the tags to the other temporary file.
 5. Continue with step 2 until all input files have been processed.
 6. The steering program frees all parser resources.
 7. The steering program spawns the pager and waitpid(3)s for it.
 8. The steering program unlink(2)s the temporary files.

Items 1 and 5-8 (done by the mandoc steering program, main.c)
would probably be the job for man-db.  Items 2 and 3 clearly
need to be done by troff(1) and the macro packages, maybe even
including appending to the tags file.  Item 4 is the job for grotty(1).

So it appears man-db will likely have to generate the two filenames
with mkstemp(3), passing them both to both troff and less
and cleaning them up after the pager exits.

[...]
> Agreed; less(1) has no notion of addressing within a line (that I
> know of), so being character-precise would be a waste of effort.

Character-precise positioning would also be silly from a human
perspective.  Humans (at least when they are fluent readers, which
can probably be assumed for most of the audience of manual pages)
do not parse words character-by-character but use complex pattern
recognition to instantly recognize whole words or even many
words at once.  So a line of text is hardly longer than what a
human reader instantly recognizes anyway (except when using insane
line lengths like 100-200 characters; i never understood how people
who use such wide terminals manage to keep track of which line they
are currently reading in the first place).

[...]
> But your mandoc(1)
> has "man -T html"...how do you solve the problem of ambiguous tags
> _there_?  I'm going for a general solution; I'm trying to solve the
> tagging/linking problem for HTML, PDF, and the terminal all at once.

Trivially: if a tag occurs for the first time, just use it as-is.
When a tag occurs the second time, append the suffix "~2".
The third time, append "~3", and so on.

[...]
>>> I also don't happen to know how ctags(1) got extended to support
>>> C++ name spaces and other means of qualifying colliding identifier
>>> names.  But if ctags (perhaps Exuberant ctags, given the original
>>> ctags format's advanced age) got extended to cope with that problem,
>>> presumably less(1) learned how to interpret the extension.

>> Not that i'm aware of anything like that, no.  Also, given that ctags(1)
>> is mandated by POSIX and less(1) aims for portability (AFAIK),
>> relying on extensions might be a bad idea.

> Exuberant ctags's extensions are ubiquitous.

Is that so?  Personally, i have implemented a generator for ctags(1)
files (in mandoc) and i'm using that frequently.  But i don't
recall ever having heard the word "Exuberant" as a name of a
person or organization.  Also,

   $ grep -RFi Exuberant /usr/src/usr.bin/ctags/                  
   $ grep -RFi Exuberant /usr/src/usr.bin/less/
   $

and the *.c files below /usr/src/usr.bin/ctags/ all have pure "Regents
of the University of California" Copyright notices, no other
contributors being named, and this is from the ctags(1) manual page:

  HISTORY
     The ctags command appeared in 2BSD.

  STANDARDS
     The ctags utility is compliant with the IEEE Std 1003.1-2008
     (“POSIX.1”) specification, though its presence is optional.

     The flags [-BdFuvw] are extensions to that specification.

     Support for Pascal, YACC, lex, and Lisp source files is an IEEE Std
     1003.1-2008 (“POSIX.1”) extension.  The standard notes that ctags is
     "not required to accommodate these languages, although implementors
     are encouraged to do so".

I don't see anything about non-BSD authors there.  I even checked
the complete commit log for the directory and see medium amounts of
maintenance work being fone there, but no indication of non-BSD
contribution oder feature additions for compatibility.

[...]
>>> Also, presumably mandoc(1) has solved the problem when it renders
>>> multiple pages and more than one supports, say, an `-h` option.

>>   $ man -ak Fl=h

> How man-db man implements support for tag-based searches is an open
> question.  That's Colin Watson's project, and while IIRC I have a commit
> bit, my status in that project is as junior as one can get.  :)
> 
> My objective is to design the macro package side of the feature so that
> Colin's job as designer (recognizing that he might foist implementation
> onto me) is as easy as possible.
> 
> man-db man(1) already sits on `-a` and `-k` so some other option letter
> selection will have to be made.

???

I admit that mandoc supports several options that are POSIX extension,
but -k is mandated by POSIX:

  https://pubs.opengroup.org/onlinepubs/9799919799/utilities/man.html

  NAME
    man — display system documentation
  SYNOPSIS
    [UP] [Option Start] man [-k] name... [Option End]
  [...]
  -k
    Interpret name operands as keywords to be used in searching a
    utilities summary database [...]

And then:

   $ cat /etc/debian_version
  11.11
   $ dpkg-query -l man
  [...]
  ii  man-db     2.9.4-2    amd64      tools for reading manual pages

With that man(1), "man -a crontab" works similar to what i would
expect.  It first displays crontab(1), then crontab(5), then exits.
The string "-a" is hard to find in man(1) because there are so many
false positives, but it is documented:

  -a, --all
        By  default,  man  will  exit after displaying the most suitable
        manual page it finds.  Using this option forces man to  display
        all the manual pages with names that match the search criteria.

man-db is quite inconvenient in this respect because "man -a crontab"
first opens only crontab(1), so you cannot compare to crontab(5) because
that is not yet displayed.  When you are done reading crontab(1), you
have to press 'q' and then you need another (!!) keystroke (RETURN)
to get to crontab(5).  From there, you have no way at all to get back
to crontab(1): when you press 'q' there, the whole program terminates
and drops you back to the shell.

With mandoc man(1), "man -a crontab" shows both manual pages at the
same time, crontab(5) below crontab(1), such that you can scroll back
and forth to your heart's content with *no* keystrokes wasted for
switching from one to the other.  Also very useful when viewing
hundreds of manual pages at the same time:

  /^---<ENTER> n n n n n n n n n ...

moves foward by pages, one page per keystroke, until you find one
that interests you.  The commands :t/t/T search for a tag, forwards
and backwards, across all the displayed pages.

Also, in man-db man(1), i just noticed that the normally very useful
combination -ak is useless, even though both are supported and
documented with the same meanings as in mandoc:

   $ man -ak cron
  cron (8)     - daemon to execute scheduled commands (Vixie Cron)
  crontab (1)  - maintain crontab files for individual users (Vixie Cron)
  crontab (5)  - tables for driving cron

That's what i would expect for just -k, not for -ak.
For -ak, mandoc man(1) displays a concatenation of these pages:
crontab(1), jmb(4), jme(4), jmphy(4), crontab(5), cron(8).  (The jm*
pages document JMicron hardware support, notice the "cron" in there.)

So fortunately, man-db and mandoc agree about the basic meaning
of both -a and -k.  That's hardly a coincidence though:

  https://mandoc.bsd.lv/man/man.options.1.html

  -a  display all matching manual pages
      man: 4.3BSD-Tahoe (June 1988), Eaton (before July 7, 1993; 1990/91?);
      OpenBSD, FreeBSD, NetBSD, man-db, man-1.6, illumos, Solaris 9-11
      apropos, whatis, mandoc: OpenBSD 5.7 (August 27, 2014)

      only display items that match all keywords
      apropos: man-db (Aug 29, 2007)

      use all directories and files for mandoc.db(5)
      makewhatis: OpenBSD 5.6 (April 18, 2014)

      [superseded by -T ascii] ASCII output mode
      troff: Version 7 AT&T UNIX (January 1979)
      groff: probably before groff-0.4 (before July 14, 1990)

So, while man-db clashes *with itself* in so far -a has
two conflicting meanings in man(1) and apropos(1), it actually
agrees with mandoc (as far as that is still possible despite
its internal conflict).

Please avoid adding options to man(1) if you can, and if you
cannot avoid it, then *please* do consult
  https://mandoc.bsd.lv/man/man.options.1.html
before doing so in order to minimize the risk of clashes.

[...]
> If one wants to select a subset of tags of interest, I guess that's
> where the analogue of mandoc(1)'s "-ak" comes in.

Hmmm, no.  I'm not even sure what "subset of tags of interest"
is supposed to mean.  The meaning of -ak i simply "display all
pages matching the following search expression".

[...]
>>> A recurring theme in my contributions to groff's man(7)/mdoc(7)
>>> rendering has been to solve problems when rendering N pages at a
>>> time, where N can be 1 but might be greater.

>> Actually, the mandoc program demonstrates that from a pure user
>> perspective, a very straightforward solution is feasible that
>> will look totally trivial to the user.
>> 
>> Then again, programs like makewhatis(8) in the mandoc package that run
>> over lots of manual pages in sequence have been quite good at exposing
>> memory leaks and bugs caused by leaking context data from one manual
>> page to the next, and i rarely enjoyed the ensuing crashes and
>> debugging efforts.

> This sounds like the "fun" I enjoyed when getting "groff-man-pages.pdf"
> (and its UTF-8 text counterpart) ship-shape.  Many odd corner cases and
> oversights.

When formatting few pages at a time, say up to several hundred at
the same time, i don't recall ever having seen such problems, it
seems the likelihood is just too low at small scales.  I *could*
sometimes reproduce such problems by formatting just two or three
pages in a specific order, but actually hitting such problems in
practice seems to usually require thousands of pages.  With
thousands of pages, it has happened to me once every few years,
which is annoying enough that i remember and fear it.  =:c/

>> Still, running commands like
>>    $ man -ak ~.
>> can be useful for finding bugs.  By the way:
>>    $ man -ak ~. | col -b | grep -c ^NAME
>>   8801
>>    $ time man -ak ~.
>>     0m17.16s real     0m15.07s user     0m01.18s system
>>    $ time doas makewhatis 
>>     0m10.58s real     0m08.96s user     0m01.35s system

> How many separate Unix processes are involved in doing that, though? ;-)

Exactly one, no parallelization or multithreading whatsoever.

> nroff(1) and man(1), going back to early editions of CSRC Unix, are
> built around the filter model and pipe(2).  It's long been known that
> context switches are expensive, mode switches more expensive still

Admitted.

   $ time for in in $(jot 100); do man -c true; done
      0m00.83s real     0m00.01s user     0m00.02s system

Since the size of the true(1) manual is almost zero, that - eight
milliseconds for one page - is almost entirely process setup overhead,
and that's about four times the CPU time required to format an
average-sized (i.e. much larger) manual page.

[...]
> I monitor mandoc(1)'s cvsweb closely and can't remember when
> I've seen a change that caused me concern.  If I ever do, I'll
> let you know.

Appreciated!  :)

If you want to receive the mandoc commits automatically via mail,
i can subscribe you to the source@ mailing list, just let me know.
Some information about the lists is at

  https://mandoc.bsd.lv/contact.html

They are much less active than the groff@ list.

>> Updating the groff port in OpenBSD causes a horrendous amount of work
>> - so much so that i still didn't manage to update to groff-1.23 in the
>> OpenBSD ports tree, mostly due to large amounts of subtle changes in
>> behaviour between 1.22.4 and 1.23.  Before being able to update, i
>> have to disentangle and classify all those changes as follows:
>> 
>>  1. Desirable effects of bugfixes and of changes making groff
>>     behaviour saner or more consistent; some of these require
>>     adjustments in mandoc, too, and/or maybe in the test suite.

> I indeed hope that the great preponderance of problems are of this kind.

Not really.  By far the largest class is class 3 (build system
issues).  Class 1 and 2 (good vs. harmless gratuitious changes)
might be about on par - though i'm not completely sure yet, as
i'm still struggling to classify the large amount of issues.

>>  2. Trivial changes that aren't necessarily intentional, but
>>     only affect unimportant edge cases such that they can stay.
>>     A few of these may requires adjusting mandoc, too, and
>>     several require asjusting the test suite.

> I'm curious to know about these if you'd care to record them.  They
> might not be things I want to change or revert, but they might bring my
> attention to consequences I hadn't considered, and maybe I can
> anticipate similar impacts in the future.

Some of these are already fixed in mandoc or its test suite,
and i did not keep a list about those.  But i think i can produce
such a list after committing the groff port, because i certainly
note in the commit message when a tweak is motivated by a change
in groff.

>>  3. Build system hiccups.  These typically require setting
>>     configure or make(1) variables in the ports Makefile, but
>>     diagnosing what exactly is needed is typically hard.
>>     A few extreme cases might require patches.
>>     Due to extensive use of GNU autotools and gnulib, this
>>     class of problems is particularly large and annoying.

> I can't claim credit/blame for restructuring groff around Autotools and
> Gnulib, but I'll accept responsibility for not changing it.  For non-BSD
> systems they deliver remarkably little hassle.
> 
> I should bring your attention to:
> 
> https://savannah.gnu.org/bugs/index.php?66518

Yes, i'm aware of that, and that will likely result in further
degradation of portability.  There are three reasons why i didn't
mention my doubts about that earlier:
(1) In theory, the idea of a standardized approach to portability sounds
good, so it's hard to argue with that.  The problem is not the basic
idea, but that gnulib is overzealous *and* overengineered to such an
extent that the end result is nothing short of catastrophic.
(2) libgroff isn't exactly code of stellar quality either, so it's equally
hard to argue that getting rid of it would be a bad idea.  I think the
low code quality in libgroff (IIRC) also caused a very small number of
build failures in the (remote) past, but that was mostly before your
time, and both much rarer and much less severe and much easier to fix
than gnulib issues.  IIRC there may have on the order of one to three
issues with libgroff, grand total, ever - in groff-1.23.0, there are at
least a dozen issues right now, maybe more.
(3) I spent less time on mandoc and groff lately than in some previous
years and hence missed some stuff - and was leass eager to make a fuss.

[...]
> I see commits from Bruno with respect to FreeBSD 11, NetBSD 7.1,
> and OpenBSD 6.0 within the past month (in time to make gnulib's 2025-01
> stable tag).

OpenBSD 6.0?  That's...  hilarious.

   $ uname -a
  OpenBSD isnote.usta.de 7.6 GENERIC.MP#496 amd64

OpenBSD 6.0 is more than 8 years old and has been EOL and unsupported
for more than seven years now.

Supported FreeBSD releases currently are 13.4, 14.1, and 14.2.
FreeBSD 11 was released in 2016 and EOL in 2021.

Supported NetBSD releases currently are 9.4 and 10.1.
NetBSD 7.x was released in 2017 and EOL 2020.

So if your observation summarizes the situation adequately and if
i'm not missing something, that would mean that only bugs are getting
fixed that were reported at least seven years ago and affect
operating system versions that are no longer supported for at least
seven years.  Not sure what to make of that, i didn't try to work
with the gnulib folks and i'm not particularly eager to try, either,
given how their code looks.  In any case, getting groff to build at
all is clearly more important than trying to help fix gnulib - which
might well be a lost cause anyway, even if the maintainers are
freindly and well-intentioned and work hard.  I suspect the trouble
stems more from basic design principles and development goals than
from individual bugs or oversights.

> I'd happily try {Free,Net,Open}BSD builds myself, but the FSF France's
> compiler farm hosts for these OSes are down/unresponsive every time I
> try them.

I have no access to any FreeBSD or NetBSD machines.  I do have some
OpenBSD machines, but none of those can be used for testing purposes.
I have sent a crashing build log in private mail to you, though.

>>  4. Regressions.  These require pushing bugfixes to groff,
>>     plus patches in the ports tree until they get released.

> I definitely want to know about these ASAP.  Or do you mean the ones
> I've already noted and committed, kicking them up to "Important"
> severity in Savannah?

Some may already be fixed, not sure i have reported all yet.
I will definitely show you a complete list of patches we use
once the port is ready, and that includes patches for all bugs
fixed in the port.

I'm also working on reducing patches and instead adding lines
to groff site configuration files, as intended by groff developers,
where possible.

>>  5. Intentional changes that we do not want to have.
>>     These require patches in the ports tree that may have
>>     to stay even in the longer term.

> I think I know one or two of the ones you have in mind, but I'd like to
> know where I can look to stay apprised of these.  Even if I don't think
> a distributor patch belongs upstream, it tells me important information
> about the needs of the user community.  (And, regrettably, sometimes its
> cluelessness.[5]

Yes, those patches will be included in the final list.

[...]
> I'm not happy to hear that.  Apart from not using Gnulib or the GNU
> build system, both of which I prefer to retain because I perceive them
> of _relieving_ me of maintenance overheads (which is their reason for
> existing in the first place), I'm keen to hear your suggestions for
> things I can do that will reduce the pain.

Yes, that's definitely the plan to aim for providing feedback in a more
specific and more constructive manner.

I'm wasn't really expecting that suggesting to ditch gnulib would be
met with enthusiasm.  Even though i suspect that the *real* portability
issues groff needs to deal with require less than 1% of the behemoth
of code included from gnulib, simply because i suspect that groff
portability needs are very modest, given that groff doesn't exactly
use all the most modern features in its code base, even though i'm
quite sure that going for a simple, straightforward scheme like the
one used by mandoc, with one static configure script that is written
fully by hand and never regenerated, which produces a grand total of
52 lines of output, and which take the replacement implementations
from OpenBSD instead of from glibc, such that they are typically
a fraction of the size and don't typically contain any preprocessor
directives, would significantly improve groff portability *and*
significantly reduce your workload - i still fear that might not
very well fit into GNU philosophy and hence not be all that welcome,
even if it would be undeniably efficient.

I mean, just look at the list of systems on which mandoc runs:

  https://mandoc.bsd.lv/ports.html

>> Don't take that as a complaint, though; i still appreciate your work.

> Likewise!  Because mandoc(1) is conscientiously maintained, I make
> frequent reference to it when advising man page authors/maintainers
> regarding portability.

Thanks.

> By contrast, specimens like Illumos troff are
> seldom worth mentioning.  It's just Solaris 10 troff under a different
> name, and its development community seems to have its focus utterly on
> other aspects of the system.

Is troff in Illumos still relevant for what we are discussing here?

Illumos folks told me in 2014 that they started using mandoc(1) for
manual page formating at that point instead of *roff, even though,
last time i checked, they still used Solaris man(1) as the viewer.

So like many Linux systems use groff+man-db, for all i know,
Illumos uses mandoc+Solaris-man, and by the way, MacOS now
uses mandoc+FreeBSD-man.

[...]
> 1.  GNU and Linux programs have a tendency to efflorescence

You have a point, trying to make GNU and Linux programs lean
such that their manuals become more manageable probably isn't
the easiest or most rewarding project anyone could pick.
Also, it doesn't hurt much when we agree to disagree on OPTIONS.

> Heh, as you're probably aware, I use a "Notes" section myself, in 3 of
> groff's pages.
> 
> 1.  In nroff(1), because of the SGR problem.  A lot of the same people
>     who insist that groff not produce SGR escape sequences will not ever
>     think to look in grotty(1).  (Incidentally, this same page uniquely
>     _lacks_ an "Options" section; a piloted your approach to see what I
>     thought of it.)

Hmmm, that's indeed a tough case because that information hardly belongs
into the nroff(1) page, and consequently there probably isn't any good
place at all - but i still see why you want it there.  Shallow wrapper
programs (like nroff) are often hard to document properly because
most of the information users need for using them does not belong in
the manual page of the wrapper but into the various other programs
that the wrapper wraps.  It's a dilemma resulting from dubious API
design involving too many abstractions and layers.

> 2.  In groff_me(1), to explain the origin of the package's name.  The
>     page is otherwise pretty terse and businesslike, and more an
>     aide-mémoire than a true reference, so I saw no other good place to
>     put it.

I tend to think the first sentence of those NOTES (the one about
single-letter names) belongs into the HISTORY section of groff_tmac(5).
Most of the second sentence belongs in the AUTHORS section, which
should probably also say something like "reimplemented for groff
by James Clark in 198x".  A few bits near the end of the second
sentence belongs in the HISTORY section, which should probably say
something like

  The me macros first appeared in 2BSD.

Since groff is not primarily used in BSD communities, many groff
users might not understand what the somewhat cryptic "2BSD" means,
so maybe something like

  ... in the second Berkeley Software Distribution (2BSD),
  released in May 1979.

So here, i really see no need for NOTES.

> 3.  In groff_man_style(1), as a kind of FAQ.  I won't apologize for
>     this; I need it to dispel myths and confusion.  I _could_ call it
>     something else; its present name seems good enough.

You got me there.  A style guide in manual page form is so unusual
that some unusual decisions might be called for.  Also, a style
guide, in particular when it is long, might prioritize pedagogy
over conciseness and rigour, in which case my argument that every
topic needs to be discussed in exactly one place breaks down and
afterthoughts may become legitimate.  Looks like you may have found
an example unusual enought that even a NOTES section can be defended.

[...]
> At 2025-01-22T22:04:59+0100, Ingo Schwarze wrote:
>> G. Branden Robinson wrote on Mon, Jan 20, 2025 at 11:59:41PM -0600:

>>> We need a design for automatic construction of
>>> tag/anchor names from the user-specified names of the items to be
>>> tagged.  In man(7) documents, those taggable items are probably going to
>>> be:
>>> 
>>> 1.  the identifier of the page itself, with "section" number;
>>> 2.  section heading text;
>>> 3.  subsection heading text; and
>>> 4.  the tag text of tagged paragraphs (`TP`).

>> In addition to those, mandoc(1) also tags the tag text of .IP and .TQ.

> Ah!  I forgot to mention `TQ`.  Yes, it's completely my intention to
> treat `TQ` the same as `TP` in this respect.
> 
> But not for `IP`.  That macro seems, somewhat preponderantly, to already
> be getting used as a non-semantic, or differently semantic, device, to
> mark lists with symbols or enumerators.  (It's more structural than
> semantic, we might say.)  That's good.  I want to encourage and
> reinforce that practice.

Maybe.  Given that both serve almost the same purpose - the main
difference only being that .TP supports macros in the tag and .IP does
not - some style guidance regarding when to use which one might make
sense.  Deprecating .IP outright doesn't seem like a good idea because

  .TP
  \(bu
  text body

is very ugly, and bullet+numbered lists is a reasonable scope that
works well with the .IP syntax.

However, even in the OpenBSD tree, which does not contain particularly
many man(7) manuals, significant numbers of manual pages contain
long tag arguments after .IP macros.  Most of these are GNU manuals.
Even worse, pod2man(1) emits .IP, not .TP, for tagged lists.
So you would be punishing end users for something that was
considered OK in the past and that documentation maintainers and
code generator maintainers need to fix - possibly before knowing
whether documentation maintainers will even agree with your position.

Here is a list of some manual pages affected, where users won't
have tags (or at least less tags) because of your policy on .IP:
addr2line(1) ar(1) as(1) c++filt(1) ld.bfd(1) objdump(1) readelf(1)
objcopy(1) strings(1) readline(3) mkhybrid(8) and almost all
Perl manual pages.
On top of that, all FVWM manual pages, editres(1), sessreg(1), twm(1),
xbacklight(1), xedit(1), xpr(1), xrandr(1), xsetroot(1), XF86VM(3),
Xsecurity(7), and almost all X11 section 3 library manuals.
>From a random collection of a few ports i have currently installed:
arara(1), bib2gls(1), bzless(1), python3.12.1(1), practically all
FFmpeg manuals, afm2afm(1), albatross(1), autoinst(1), curl(1),
cvs2cl(1), dvipng(1), epstopdf(1), gslp(1), install-tl(1),
luafindfont(1), mk-ca-bundle(1), ofm2opl(1), ovf2ovp(1), pedigree(1),
ps2pk(1), repstopdf(1), thumbpdf(1), ttf2afm(1), unzip(1), updmap(1),
and all the GnuTLS manuals ...

So not providing support before deprecation takes effect may not
be the friendliest move.

[...]
> NEWS:
> *  The an (man) macro package now supports a `TS` register to configure
>    the minimum space required between the tag of a `TP` paragraph and
>    its body.  (If the width of the tag's formatted text plus this space
>    exceeds the paragraph indentation, the line is broken after the tag.)
>    This parameter, formerly hard-coded as `1n`, now defaults to `2n`.
> 
> *  The an (man) macro package's `IP` macro no longer honors the formerly
>    hard-coded 1n tag separation noted in the previous item.  This means
>    that the first argument to the `IP` macro can abut the text of the
>    paragraph with no intervening space.  If you use a word instead of
>    punctuation or a list enumerator for `IP`'s first argument, consider
>    migrating to `TP`.

Regression suite fun at the horizon for the 1.24.0 ports update.

> I haven't brought this to your attention before now because I expected
> you to:
> 
> A.  Ignore the `TS` register entirely; and either,

Certainly, but the change of the default is likely to cause a few
hours of work on the mandoc and test suite sides.

> B1. Go along with my tweak to indentation, or

Probably (because off the top of my head, i suspect it makes .TP more
similar in style to .Bl -tag), but i'm not sure yet.

> B2. Not, overriding it with a patch.

Unlikely, i somewhat dislike patches for trivial tweaks like that.

If it turns out it causes more serious trouble than just churn,
there is a third possibility:

  B3. Harass you for a revert, even if i'm late to the party.

[...]
>>>     For man(7) the `MR` macro new to groff 1.23 was an obvious site
>>>     to add the appropriate machinery for document-level links.
>>>     mdoc(7)'s `Xr` is closely analogous and has existed for many
>>>     years.

>> Yes, both have almost identical semantics and are a likely candidate
>> for extension, if we come to the conclusion an extension is needed.
>> I didn't consider the details yet, though.

> I think you were discussing `SX`/`Sx` here?  As I understand it, `MR`
> support is already in mandoc(1) CVS (but not released), as of course is
> `Xr`, for some years.

I guess you misunderstand.  I did not mean "add .MR as an extension" -
that has indeed been agreed and implemented already.  I meant
"extend the existing .MR and .Xr macros with another argument or
something like that".

> Also, you can't put a price on the pleasure of introducing a macro
> named `SX`.  It pleases the 13-year-old in everyone.

I scratched my head for some time what you even meant here before
finally getting it - even though it works in German exactly like
in English.  I think i wouldn't have noticed that even as a teenager,
neither as 13-year old nor as a 19-year old.

And no, i wasn't talking about .Sx.  The .Sx macro is for local
links inside one page.  But here you want to design a macro to
link from one page to another.  That's what .MR/.Xr do, so those
are the first candidates that come to mind for extension.

Also, extending .Sx for that purpose doesn't seem easy, since it
already accepts an arbitrary number of arguments (e.g. .Sx SEE ALSO).
Number theorists may feel comfortable reasoning about the number
omega+2 - computers get very nervous when you ask them to handle it,
they don't even like omega itself all that much.

[...]
> While deep structural links risk encouraging stagnant man page
> structures, deep unstructured links will promote retention of features.
> And there are enough forces favoring the latter.  Our industry has a bad
> problem with not throwing old cruft away.

OpenBSD suffers from that problem less than other projects.
Ted Unangst (tedu@) has earned such a high reputation and so much
respect for excising large amounts of code from all over the tree
that "teduing" has become a neologistic synonym for "cleaning up".
LibreSSL is among the most active parts of the OpenBSD tree, and
currently, at least 80% of the work done there is teduing (during
the first two year of development, it felt more like 98%).
That still holds even though Ted U. is currently no longer involved
in LibreSSL (he was years ago and is among the founding members
of LibreSSL precisely because he wanted to tedu there, and indeed
he did so.

I'm not convinced we should let speculations about possible effects
on people's bad habits influence the deep linking design - that
design task is difficult enough without additional non-technical
constraints.  Besides, i doubt that the design of deep linking
can really improve people's willingness to tedu.

> Using English instead of machinery might remain our most robust
> technology.  "See the “-h” option in ksh(1)."

That's exactly what Jason McIntyre (jmc@) and myself have been
doing in OpenBSD for decades, and i'm not aware of any problems
it may have caused.

> For URLs (and email addresses), it's already in place.  We use OSC 8.

Yikes.  ANSI X3.64.  The nightmare.

Reminds me the we ought to disable ANSI X3.64 support in OpenBSD
xterm(1) by default.  It's just too dangerous due to the very
lange number of escape codes that make it hard to secure and
the fact that many of them can wreak havoc.  A manual page
viewer is a typical example of a program that must be able to
run securely, even as root, and that must not, under any
circumstances, make the terminal window unusable.  It might
contain the last available shell on a remote machine that is
in trouble, and reading the manual page may be needed to
implement a fix for whatever problem there is.

But manual pages are essentially untrusted data, so allowing a
manual page viewer processing manual pages to send dangerous
escape codes to a terminal is not acceptable.  Maybe you argue
"well, the manual page viewer must not *pass through* escape
codes, but there is no risk in *creating* certain escape codes
from scratch" - but that violates the principle of multi-layer
security.  Sanitize input and reject ANSI X3.64 contained in
the manual page.  AND sanitize output, making sure that you never
create ANSI X3.64.  AND make sure that the pager always ignores
ANSI X3.64, i.e. never run it with -r or -R.  AND make sure
all potentially dangerous ANSI X3.64 codes are disabled in the
terminal emulator - and that certainly includes OSC 8.
I would go so far that not only should the "Operating Sytem Command"
ANSI code (as a whole, not just OSC 8) be disabled by default,
but the possibility to enable it should be patched out of xterm(1)
lest users do that by accident when editing xterm(1) config
files or blissfully ignorant of the risks - as in "oh, colour
sounds nice, let's do that".

Multi-layered security actually provides security.  Artful
arrangement of slices of swiss cheese such that (hopefully?) the
holes never align is a recipe for disaster.

> If less(1) strips that (it doesn't) or the terminal doesn't support it,
> the user doesn't get hyperlinks.  Okay, so they're no worse off than
> before.

In short, we have no solution for the task we set out to solve,
right?

[...]
> I have no intention or desire to make man-db obsolete.  I think the
> separation of its concerns from the formatter is sound, and I don't look
> to disrupt it.

OK, i get it, that part makes sense now.

[...]
>      • What’s the difference between a man page topic and identifier?
> 
>        A single man page may document several related but distinct
>        topics.  For example, printf(3) and fprintf(3) are often
>        presented together.

What you call "topic" here is called "name" in the mandoc(1)
documentation.  Maybe not a huge problem because the mandoc
documentation does not define a term "topic" at all, so there
is no clash.

>        Moreover, multiple programming languages
>        have functions named “printf”, and may document these in a man
>        page.  The identifier is intended to (with the section) uniquely
>        identify a page on the system; it may furthermore correspond
>        closely to the file name of the document.

What you call "identifier" here is called "title" in the mandoc
documentation.  Mandoc treats the title a an additional name;
for that reason, the man(7) page in the mandoc package takes the
shortcut of presenting this synopsis:

  .TH name section date [source [volume]]

Of course, the title usually matches one of the other names,
and often the first one, though that is not necessary.

The stipulation that the "identifier is unique within the section"
is completely unrealistic in practice, on Linux even more than on
*BSD.  The package manager *can* make sure that two packages do not
in install a file into the same file system path, clobbering each
other.  And indeed, package managers often do check that -
sometimes already by flagging such clashes in a centralized
package database in the build system infrastructure of the
operating system developer, requiring packages to be fixed or
marked as conflicting when they clash, such that both cannot be
installed at the some time.  Some other package managers also or only
do such checks of not overwriting existing files owned by other
packages at install time, and refuse installing when encountering 
conflicts.

But i have not heard of package managers attempting to parse manual
pages and try to detect *logical* conflicts in the *content* of manual
pages.  That would seem very hard to implement and in addition
rather fragile and inefficient.  Even Marc Espie's pkg_add(1)
packet manager in OpenBSD, which spends a lot of effort on
handling documentation with special care, does not do that -
and it would also be wrong to do that, because .TH/.Dt clashes
cause no problem and are simply legitimate.

>        The man(1) librarian makes access to man pages convenient by
>        resolving topics to man page identifiers.

That is not true and misleading.  In your terminology, the correct
statement would be:

  The man(1) librarian makes access to man pages convenient by
  resolving each topic to one or more fully qualified file system
  paths to manual page files.

That is not only true for mandoc, but for all man(1) implementations
i'm aware of, in particular including man-db.

>        Thus, you can type
>        “man fprintf”, and other pages can refer to it, without knowing
>        whether the installed document uses “printf”, “fprintf”, or even
>        “c_printf” as an identifier.

The term "identifier" is badly misleading because multiple
pages with the same "identifier" (or "title" in mandoc terminology)
can exist in the same section of the same manual page tree, and all
those files can even be in the same directory.  This does not
prevent man(1) from finding all these files anyway.

What really "identifies" a manual page is the fully qualified
files system path - because you cannot have two files with the
same filename in the same directory.

[...]
> While we don't want to put every tag into the "Name" section so that it
> will be indexed by the librarian--

Complete agreement, that would make the NAME section totally unreadable.

> indexing all `-h` command line options would be worse than useless--

Not true.  The makewhatis(8) program in the mandoc package does
exactly that, index all -h command line options in all manual pages,
and you can search for them with man(1) -k, as i demonstrated in an
earlier mail.  That's quite useful, too.

> I think every symbol in a C API should be.
> Yes, even macros and non-function objects.

Here, we can happily agree to disagree.

Treating macros that take arguments exactly like functions is fine
because the distinction is a technicality.

Treating constant macros like functions, however, is over the top.
In some pages that only mention one or two constants, in might not
cause much grief, but pages exist that document large numbers of
constants.  Here is an extreme example that would massively bloat
the NAME section:

  https://man.openbsd.org/errno.2

Constant macros are like wolves, they often come in packs.
Often, one constant requires much less documentation than one
function because it has far fewer moving parts: no arguments,
no return value, no semantics, not even any syntax apart from
the literal name itself.  (Exceptions exist where constants do
require massive amounts of documentation because they are
essentially abused like functions.  Yes, i'm looking
at you, EVP_PKEY_CTX_ctrl(3).)
And finally, significant numbers of constants are used by more
than one function, sometimes even by functions in several
manual pages, which makes the question which manual should have
the constant in the name section rather arbitrary.

In mandoc, the whole question is moot anyway because you can say

   $ man -k Er=EINVAL

and get all 221 manual pages refering to it listed, without needing
it in any NAME section.  Simlarly for type names: each type is
typically used in many manual pages, even though some exceptions
of specialized types that are only relevant for a single page do
exist.

> These are "first-class" concerns just like function calls.
> Part of your interface?  Document it.

Absolutely!  But not every public symbol needs to be a "topic",
in your parlance (or a manual page name according to mandoc).

[...]
> At 2025-01-22T23:04:33+0100, Ingo Schwarze wrote:
>>> Possibly I'll formally propose an `SX` macro for man(7) at some point.

>> You mean that like mdoc(7) .Tp, not like mdoc(7) .Sx, right?

> Uh, groff mdoc(7) doesn't have `Tp`.
> 
> Uh.  I don't see it in mdoc(7) from mandoc 1.14.6-1 on my Debian system
> either.
> 
> What is it?

Sorry for the typo, i meant .Tg (manual tagging), not .Tp, which
indeed does not exist and is not planned.

[...]
> there may be a subtle difference between the ways we're using the
> word "node" here.

Not subtle at all, it's a drastic difference.

Here are examples of nodes in mandoc:

 * The root node, which contains the syntax tree of the whole document.
 * A section node, for example one that contains the whole DESCRIPTION
   section including its title.
 * A full explicit node, for example a complete list or display
 * A partial explicit node, for example all that results from
   a quotation macro like .Do that has a matching end macro, .Dc,
   including the content.
 * A partial implicit node, for example all that results from
   a quotation macro like .Dq that extends to the end of the input line,
   including the content.
 * An in-line node, for example an emphasis macro including its content
 * An input text line, possibly including escape sequences
 * A text string that is a single argument of a macro,
   possibly including escape sequences
 * Certain low-level roff requests that directly produce output
   or change formatting state in a way similat to what macros do,
   in particular: .br .ce .fi .ft .in .ll .mc .nf .po .rj .sp .ta .ti
   Most roff requests are *not* nodes but get fully resolved at the
   pre-parser stage.
 * Also, so far, no escape sequence can ever be a node - even though
   making escape sequences nodes would be beneficial for some
   purposes, so i might do that at some point

What roff calls a "node" would be called "escape sequence or character"
in mandoc.  There are some artifacts that roff calls "node" that
mandoc does not represent at all, neither as a node nor as an
escape sequence, and that it never needs for anything, for example
"line_start_node".  Furthermore, mandoc does not distinguish between
the ordinary space character (U+0020) and word_space_node.
Instead, mandoc internally represents blanks that are not word spaces
and hence do not allow a line break by a special ASCII code point,
and similar for some other whitespace-related cases, see mandoc.h:

#define ASCII_NBRSP      31  /* non-breaking space */
#define ASCII_NBRZW      30  /* non-breaking zero-width space */
#define ASCII_BREAK      29  /* breakable zero-width space */
#define ASCII_HYPH       28  /* breakable hyphen */
#define ASCII_TABREF     26  /* reset tab reference position */

> The long version?
[about diversions]

Thanks, that was instructive.

> 5.29 Diversions
> ===============
> In 'roff' systems it is possible to format text as if for output, but
> instead of writing it immediately, one can "divert" the formatted text
> into a named storage area.

The reason why mandoc is quite unlikely to ever implement diversions
is that the most central design principle of mandoc is that the
parse tree is guaranteed to be independent of the output device, and
completely finanalized before the program even looks at the question
which output device the user selected.  No part of the formatters
can ever be called before the parse tree is fully complete and
immutable.  All substitutions (in particular, of user defined macros,
strings and number registers) must be completed before the parsers
can even be started.  Whatever is stored in any user defined string
or number register after parsing has started is guaranteed to have no
effect on the output.

Hence, storing anything produced by the formatters (which can only
by invoked after parsing is complete) into any string or register
(which no longer have any effect if parsing was even started) is
totally out of the questions.  I would have to throw away the
most fundamental parts of the software architecture and start over
completely from scratch.

> That's why I want a string iterator and well-defined operations to
> identify nodes so that they can be stripped from strings, without
> bespoke formatter features and without hacks.

Yes, mandoc also contains string iterators in a number of places -
the iteration itslef is so short and straightforward that there
is no abstraction for it even though it's needed at more than one
place.  The tricky part is the handling of escape sequences (what
groff would call nodes), implemented in the recursive functions 
in the file roff_escape.c and called whenever needed for iteration
or processing.

> Will we ever need or want type-aware node operations in the groff
> language?  Good grief, I hope not.  But the mere existence of nodes in
> bona fide language objects has already created pain--esoteric pain that
> produced diagnostic messages that no one on this mailing list
> understood.  (Or if they did, they chose not to share knowledge.)

Oh the joys of in-band meta messaging: escape sequences embedded in
plain text strings.  Didn't i already get riled up about that very
subject earlier in this message?

Yours,
  Ingo

Re: man(7), the hyperlink tagging challenge, and what's a node?

Reply via email to