Re: man(7), hyphen, and minus

G. Branden Robinson Tue, 27 Dec 2022 02:26:13 -0800

At 2022-12-24T14:43:44-0800, Russ Allbery wrote:
> I probably should have assumed.  One of the things that I've noticed
> over and over about free software is that nothing new ever truly
> replaces something old in a comprehensive sense.  I can think of very
> few programs that truly no one is using any more, because once the
> source code is available to keep them alive, someone will keep them
> alive.  It makes for a rather interesting diversity of software (and
> other things; for instance, I still use Usenet).


I'd happily get back on USENET if someone has solved the spam problem.

I'm old enough to remember those green-card hawking lawyers who were the
harbingers of death.

> Oh, so I was going to mention: currently, Pod::Man rolls its own
> macros for verbatim text:
> 
> .de Vb \" Begin verbatim text
> .ft CW
> .nf
> .ne \\$1
> ..
> .de Ve \" End verbatim text
> .ft R
> .fi
> ..
> 
> This looks basically equivalent to .EX/.EE,

Yup.  Except for the detail of the name of the constant-width font,
which is not consistently defined across implementations or even output
devices within an implementation (as already discussed).

groff's tmac/an-ext.tmac says these days:

.\" Define this to your implementation's constant-width typeface.
.ds mC CW
.if n .ds mC R

> so I thought about using those macros (and defining my own if they're
> not available, at least until no one is using older implementations
> that don't have them).  But the main thing that .EX doesn't support
> that the long-standing Pod::Man behavior does is the .ne invocation,
> which is used like this:
> 
>     # Get a count of the number of lines before the first blank line, which
>     # we'll pass to .Vb as its parameter.  This tells *roff to keep that many
>     # lines together.  We don't want to tell *roff to keep huge blocks
>     # together.
>     my @lines = split (m{ \n }xms, $text);
>     my $unbroken = 0;
>     for my $line (@lines) {
>         last if $line =~ m{ \A \s* \z }xms;
>         $unbroken++;
>     }
>     if ($unbroken > 12) {
>         $unbroken = 10;
>     }
> 
> This logic is very long-standing and was designed for troff printing of a
> manual page (and older nroff setups that still did pagination) to avoid
> unnecessary page breaks in the middle of a verbatim block.  I'm not sure
> how much this matters given how people use man pages these days, but I
> hate to break it for no reason.

You've managed to wangle a display, and once people get that religion
they're loath to give it up.  Despite my commitment to a limited man(7)
dialect I have proven unable to stop myself from adding `ne` requests to
groff's own man pages to keep our PDF compilation from looking ugly.

> So I think I'd need to add an .ne line after (before?) the .EE macro
> if I switched to it?

Well, you can throw away that line counting logic in Perl altogether and
simply use `ne` _before_ EX (not EE).

Another point of detail is that you should break with `br` _before_ the
`ne` request.  `ne` won't always do what you want if there is a pending
output line.

I have plans to add keep macros `KS`/`KE` to groff man(7) in the near
future; they are probably the least controversial extensions I can
possibly add because it will always be okay for an implementation to
totally ignore them.  No text will be lost or misformatted; page breaks
will just happen in dumb places, and for the overwhelming majority of
terminal users who experience the continuous rendering default, even
that won't apply.

> Okay, fair.  :)  Although historically people sometimes did, and of
> course once upon a time people would sometimes typeset the full manual
> for something with troff.

They still do.  Alex Colomar, the new linux-man maintainer, is shy of
learning ms(7) or any other macro package.

If a "full manual" doesn't need features that man(7) doesn't provide, I
see no real harm in using it for non-man-page documents.  Colin Watson's
"-l" extension to man(1) has made this extremely straightforward.

> That output probably isn't as nice as it used to, since I have
> subsequently dropped a lot of the attempted magic that only applied to
> troff output (replacing paired " quotes with `` '', adding small caps
> to long strings of all capital letters, and things like that) because
> they were all using scary regexes and occasionally broke things and
> mangled things in weird ways, causing lots of maintenance issues.

Yes, and there are concerns I would raise with both of those helpful
bits of automagic anyway.

> > Yes.  But there are two problems to solve: (1) acceptance of Unicode
> > (probably just UTF-8) input
> 
> I was pleasantly surprised at how well this just worked with the
> man-db setup on a Debian system, although I think that may involve a
> fair amount of preprocessing.

Mainly just running preconv(1), I think, which groff has supplied since
1.20, so for about 14 years I guess.

> Just to provide additional detail for the record (and this is almost
> certainly the sort of thing you mean by "acceptance of Unicode input")
> here's the simple document I was using for some testing.
> 
> https://raw.githubusercontent.com/rra/podlators/main/t/data/man/encoding.utf8
> 
> % groff -man -Tpdf -k encoding.utf8 > encoding.pdf
> troff: encoding.utf8:72: warning: can't find special character 'u0308'
> troff: encoding.utf8:74: warning: can't find special character 'u1F600'
> 
> u1F600 is presumably a problem with the output font,

Yes.  Try that to the terminal (-Tutf8) and it should work.

> but u0308 is a combining accent mark that groff does definitely
> support, just not as a separate character.

Right.  It's \[ad].

> (Without preconv, one instead gets mojibake, as I expected.)

I got warnings, too (using -ww):

troff:EXPERIMENTS/encoding.utf8:72: warning: invalid input character code 136
troff:EXPERIMENTS/encoding.utf8:74: warning: invalid input character code 159
troff:EXPERIMENTS/encoding.utf8:74: warning: invalid input character code 152
troff:EXPERIMENTS/encoding.utf8:74: warning: invalid input character code 128

There is a whole universe of validity problems to cope with even if we
had support for direct input of valid UTF-8.  :(

> My theory was that combining accent marks pose a bit of an interesting
> issue for groff because groff probably shouldn't think of them as a
> separate output character that can be mapped in an output font, but
> instead needs to essentially transform them into something like
> \[u0069_0308] during the input processing.  (This may therefore
> essentially be a preconv bug as opposed to a troff bug, and maybe
> nroff gets away with it because it can just copy combining accent
> marks to the output device and let xterm take care of rendering.)

I don't actually know if xterm performs combinations like this or it
expects precomposed characters.

The groff_char(7) man page from groff Git covers some of this stuff in
increased detail, such as `composite` request and the Normalization Form
D requirement.  But the discussion still may not be complete, as I
haven't tried to solve the Unicode input problem myself.  Fortunately we
have a patch pending for CJK/UTF-16 font support which promises to give
me an excuse to widen groff's internal character type.

Here's hoping I haven't worn out the submitter's patience while I tried
to get 1.23.0 ready...

> It all makes sense when viewed through the lens of the *roff language,
> but of course in the Unicode world one expects to be able to just
> produce a stream of code points and have everything cope.

Yes..."just coping" is achieved with a massive pile of standards
documents that augment the ISO 10646 character encoding.  :D

> I am sad that currently Pod::Man is one of the impediments to good
> rendering of manual pages in other formats, since I make use of more
> of the *roff language (mostly to work around bugs) than those tools
> often understand.  So I have an incentive to want to simplify the
> output as much as I can, consistent with remaining portable.

Consider me a resource for this effort.

Regards,
Branden

signature.asc
Description: PGP signature

Re: man(7), hyphen, and minus

Reply via email to