Hi Russ, At 2025-09-27T09:45:16-0700, Russ Allbery wrote: > "G. Branden Robinson" <[email protected]> writes: > > At 2025-08-24T13:55:29-0700, Russ Allbery wrote: > > >> The position I have taken in Pod::Man is that people writing pure > >> ASCII POD should use " for quotation marks (not `` or '' or > >> anything else). That character is then copied directly to the > >> output (with escaping as necessary for macro arguments). At > >> present, this produces reasonable behavior with groff. Earlier > >> versions of Pod::Man attempted to do other things to enable > >> typographic paired quotes, and those efforts were a fragile > >> maintenance disaster with all sorts of failing edge cases. All of > >> that code was removed in 5.00 and v6.0.0. > > > That's fine, and is consistent with the presumption that underlies > > my proposed change: most of the time, in man pages, authors use `"` > > for some purpose other than prose quotation. > > Well, what I'm trying to say here is that this presumption may be true > for man pages that authors write directly in *roff and use `` and '' > as originally intended,
Likely not on Arch (and Bookworm or older Debian) systems, where that
mechanism is defeated. (Or, rather, it merely looks ugly on
Unicode-supporting terminal emulators whereas it doesn't with stock
groff.)[1]
> but in the universe of man pages that include ones generated from POD,
> I am dubious about that presumption. It is common in POD source to
> use `"` for prose quotation, and Pod::Man will preserve that usage in
> the generated *roff.
A plausible claim, and a quantitative one to boot. So, curious, I
resolve to try my own measurement.
First we need a denominator. How many POD-generated pages do I have on
my box?
$ find /usr/share/man -type f -and -not -type l \
| xargs zgrep -li 'generated by Pod::Man' | wc -l
3796
Now let's get a numerator. Before the laborious stage requiring human
judgement, let's see how many pages use `"` on a text line--meaning we
can rule out any line starting with `.`, because (1) POD probably
doesn't generate (many) lines using the no-break control character, (2)
if it does, it's probably not going to do so with request or macro
arguments containing a `"` to be formatted as text, because (3) requests
and macro calls handle `"` distinctly (moreover, some requests handle
`"` differently from others).
First I realize belatedly that I should have saved the file list.
$ find /usr/share/man -type f -and -not -type l \
| xargs zgrep -li 'generated by Pod::Man' > /tmp/PODs
Now, do the count.
$ while read f; do zgrep '^[^.].*"' "$f"; done < /tmp/PODs > /tmp/quotes
$ wc -l /tmp/quotes
36763 /tmp/quotes
That's a lot. Let's gather a sample.
$ shuf /tmp/quotes | head | nl -ba
1 \& return "J\*(Aqai trouv\exE9 $files dans $dirs.";
2 described for \-size in \*(L"\s-1FONT OPTIONS\*(R"\s0 below. Any
additional optional
3 \& $self\->pushline( "<p>"
4 the rule for \*(L"borning\*(R" new cells. Higher order bits encode for
an
5 \& normalizer("a", "\e034b")
6 \& my $b = "fghij";
7 \& print $z "hello world\en";
8 Binary \f(CW"&"\fR returns its operands ANDed together bit by bit.
Although no
9 See also: \*(L"Declaring a Reference to a Variable\*(R" in perlref
10 encountered \*(L"y \*(R". The \f(CW\*(C`*?\*(C'\fR quantifier
effectively tells the regular
Items 1, 3, 5, 6, 7 are code examples, not prose quotation.
Items 2, 4, 9, and 10 interpolate the names of a *roff strings that POD
defined. I'll have to resample with these excluded.
Item 8 is a prose quotation, of a single character, and not one that is
implicated by end-of-sentence detection.
This crude approach puts the ratio of prose quotation to code literal at
1:5, but the string names inject noise into our sample, so let's redo
this procedure without them.
$ while read f; do zgrep '^[^.].*"' "$f" | grep -v '\\\*([LR]"'; done \
< /tmp/PODs >| /tmp/quotes2; wc -l /tmp/quotes2
18878 /tmp/quotes2
Yes, that cuts our sample size almost in half. _Now_ let's see what we
can see.
$ shuf /tmp/quotes2 | head | nl -ba
1 \& print "Array is now: ", @a, "\en"
2 \& $rule\->new\->name("foo*"),
3 \& "Feynman" => "Richard",
4 \& "exit Pod::Simple::Checker\->filter(shift)\->any_errata_seen" \e
5 \& or die "deflate failed: $DeflateError\en";
6 \& idn_to_ascii("Räksmörgås.Josefßon.ORG") eq
7 \& the object for the "em" element
8 \& "%.6f degrees" 54.989667 degrees
9 \&\f(CW"hv_iternext"\fR.
10 \& "([^\e"\e\e]*(?:\e\e.[^\e"\e\e]*)*)",? # groups the phrase
inside the quotes
These _all_ look like code examples to me, or so I infer from the
leading dummy character escape sequences `\&`. Item 7 looks prosey
anyway, though, so let's inspect more closely.
$ zgrep -C2 'the object for the "em" element' $(cat /tmp/PODs)
/usr/share/man/man3/HTML::Tree::AboutTrees.3pm.gz-\& * nodes it contains:
/usr/share/man/man3/HTML::Tree::AboutTrees.3pm.gz-\& the string "I\*(Aqve
got "
/usr/share/man/man3/HTML::Tree::AboutTrees.3pm.gz:\& the object for the "em"
element
/usr/share/man/man3/HTML::Tree::AboutTrees.3pm.gz-\& the string "!"
/usr/share/man/man3/HTML::Tree::AboutTrees.3pm.gz-\& * its parent:
That _does_ look prosey. Give me a moment to read the actual page.
---snip---
NAME
HTML::Tree::AboutTrees -- article on tree-shaped data structures in
Perl
SYNOPSIS
# This an article, not a module.
...
Trees
-- Sean M. Burke
"AaaAAAaauugh! Watch out for that tree!"
‐‐ George of the Jungle theme
...
But trees are easy to build and manage in Perl, as I’ll demonstrate
by showing off how the HTML::Element class manages elements in an
HTML document tree, and by walking you through a from‐scratch
implementation of game trees. But first we need to nail down what
we mean by a "tree".
Socratic Dialogues: "What is a Tree?"
My first brush with tree‐shaped structures was in linguistics
...
calling "trees" the same as what programmers call "trees"? So I
-- So what is a "tree", a tree‐shaped data structure?
-- A tree is a special case of an acyclic directed graph!
-- What’s a "graph"?
...
A: Links. Also called "arcs". They just symbolize the fact that
each node holds a list of nodes it links to.
...
Q: "generation"? This is a family tree?
A: No, not unless it’s a family tree for just yeast cells or
something else that reproduces asexually. But for sake of having
lots of terms to use, we just pretend that links in the tree
represent the "is a child of" relationship, instead of "is a kind
of" or "is a part of", or "could result from", or whatever the real
relationship is. So we get to borrow a lot of kinship words for
describing trees ‐‐ B and C are "children" (or "daughters") of A; A
is the "parent" (or "mother") of B and C. Node C is a "sibling"
(or "sister") of node C; and so on, with terms like "descendants"
(a node’s children, children’s children, etc.), and "generation"
(all the nodes at the same "level" in the tree, i.e., are either
all grandchildren of the top node, or all great‐grand‐children,
etc.), and "lineage" or "ancestors" (parents, and parent’s parents,
etc., all the way to the topmost node).
...
...and so forth. Okay, I think I see where you got the notion that
literal double quotes are "commonly" used for prose quotation in POD.
They might not strictly be statistically common (maybe 10% of pages),
but where they do occur, the document author employs the hell out of
them. Not, however, always, or even frequently, after sentence-ending
punctuation.
A worthwhile experiment at this point is to format this very page with
groff twice: once with a man(7) package that includes `cflags 0 "` and
one that does not, and see what the differences are.
First, I stick `cflags 32 "` into my "$HOME/groff-HEAD/share/groff/
ite-tmac/man.local" to restore the status quo ante. And verify that it
took.
$ echo '.pchar "' | nroff -man
character '"'
is not translated
does not have a macro
special translation: 0
hyphenation code: 0
flags: 32 (is transparent to end of sentence)
Unicode mapping: U+0022
ASCII code: 34
ASCII code: 34
asciify code: 0
is found
is transparently translatable
is not translatable as input
mode: normal
...good.
Now format the page with this nroff.
$ zcat $(man -w HTML::Tree::AboutTrees.3pm) | nroff -man > /tmp/burke1
Rip out the aforementioned stuck-in request and reverify.
$ echo '.pchar "' | nroff -man
character '"'
is not translated
does not have a macro
special translation: 0
hyphenation code: 0
flags: 0 (none)
Unicode mapping: U+0022
ASCII code: 34
ASCII code: 34
asciify code: 0
is found
is transparently translatable
is not translatable as input
mode: normal
Format the page with this nroff again (reflecting groff Git's master
branch).
$ zcat $(man -w HTML::Tree::AboutTrees.3pm) | nroff -man > /tmp/burke2
Confront the moment of truth.
$ cmp /tmp/burke[12] && echo THEY ARE THE SAME
THEY ARE THE SAME
At this point I must ask that you direct me to a specific document that
you think will actually be adversely affected by this change.
(The `pchar` request is another groff 1.24 innovation of mine, and this
exercise increases my confidence that adding it was worthwhile.
groff(7):
.pchar c ...
Report, to the standard error stream, information about
each ordinary, special, or indexed character c. A
character defined by a request (char, fchar, fschar, or
schar) reports its contents as a JSON‐encoded string,
but the output is not otherwise in JSON format.
)
Incidentally, here's more context of the "hit" in our search.
---snip---
$ sed -n '423,438p' /usr/share/perl5/HTML/Tree/AboutTrees.pod
So, for example, when HTML::TreeBuilder builds the tree for the above
HTML document source, the object for the "body" element has these pieces of
data:
* element name: "body"
* nodes it contains:
the string "I've got "
the object for the "em" element
the string "!"
* its parent:
the object for the "html" element
* bgcolor: "#d010ff"
Now, once you have this tree of objects, almost anything you'd want to
do with it starts with searching the tree for some bit of information
in some element.
---end snip---
I don't know what indentation is _supposed_ to mean in POD, so I can't
form an opinion as to whether Burke abused the input language or not.
This looks more like an attempt at an itemized list (with nesting) than
a code display. Even so, I reiterate that he _didn't set off_ the land
mine you suspect I am planting.
> >> I think you are making an assumption that " in *roff input is
> >> mostly only used for code.
>
> > Not quite. I'm making the assumption that " _in man(7) and mdoc(7)
> > input_ is mainly used for code.
>
> If by "mainly" you mean 55%, sure, maybe. I have no statistics. If by
> "mainly" you mean more like 90%, I'm dubious.
90% seems like a pretty good guess, as it happens.
And even within the remaining 10%, it appears one can still get pretty
quote-happy without adverse effect.
> We've lived with the rare misformatting of code for many years. It's
> rare in part because the `."` sequence, while not impossible in code,
> is not common. I don't think it's a good approach to trade that
> misformatting for what I believe will be more common misformatting of
> text.
Have I given you enough evidence to update your Bayesian priors?
> I continue to be surprised that .nf does not disable this type of
> reformatting. I believe you that this is how *roff has always worked,
> but it sure feels to me like how *roff has always worked is incorrect.
> :)
`nf` is a mnemonic for "no fill [mode]", not "no formatting".
Let me offer you another exhibit. We can put inter-sentence spacing
aside for a moment, because that was not configurable in AT&T troff.
groff lets you shut it off, and AT&T did not.
When filling is disabled, automatic breaking is disabled, and thus too
are automatic hyphenation and adjustment. What remains? Word spacing.
Yet another AT&T weirdness raises its head here.
groff_diff(7):
AT&T troff ignores the ss request if the output is a terminal
device; GNU troff rounds down the values of minimum inter‐word and
additional inter‐sentence space each to the nearest multiple of 12.
One might presume that, between `nf` and `fi` requests, a word space is
"literal": you get out one word space width (as the font defines it,
typically one-quarter to one-third em for proportional fonts, and one en
for monospaced ones) for each you put in. Consequently, we would expect
the `ss` request to have no effect in "non-formatted" [sic] regions.
Let's format the following document with AT&T troff, using DWB 3.3, a
representative traditional "ditroff".
$ cat EXPERIMENTS/nf-and-ss.roff
.nf
I expect the following two lines to look the same.
Jeffrey Epstein didn't hang himself.
.ss 72
Jeffrey Epstein didn't hang himself.
$ DWBHOME=~/dwb ~/dwb/bin/troff EXPERIMENTS/nf-and-ss.roff \
| DWBHOME=~/dwb ~/dwb/bin/dpost >| /tmp/nf-and-ss.ps
I viewed the output with Okular and took a screenshot; find it attached.
Thus, if word spaces are configurable when filling is disabled, why
wouldn't inter-sentence spaces be?
If you want a "literal mode" for POD output, I daresay you're going to
have write macros to get it, even if I changed groff, because there are
a few other *roff formatters out there. You will have to decide how
literal you want literal mode to be--do you expect escape sequences
still to be recognized? If so, how true is that to the principle of
"literalness"? (Since your audience is POD authors, this may not
matter, if they can't already "punch through the floor" and inject raw
*roff code into their documents. I don't know this detail of POD.)
> Personally, I'd rather see you change that behavior than to make `"`
> non-transparent for sentence boundaries. I think that change would
> eliminate most of the problem cases for code blocks, since at least
> IMO it's poor style to inline complex expressions involving quotes
> rather than setting them off in what in POD we'd call a verbatim
> paragraph. I'm also not seeing a drawback to that change. Obviously
> that would be a lot more work, though, and I certainly understand why
> you wouldn't want to do it.
Changing the way `nf` works is necessarily more disruptive than any
change to man(7) package behavior.
> Meanwhile, I think I prefer the status quo.
So far, I've been unable to find a real-world POD exhibit where my
proposal _affects_ the status quo.
I'm running out of steam for the evening, but, equipped with my
"/tmp/PODs" file, maybe in the coming days I'll repeat the foregoing
`cflags` experiment with _all_ of those documents. I don't know how
much that will prove, as one can say that my installed POD repertoire is
unrepresentative of a Perl hacker's; it's true that I don't program in
Perl all that much.
> > (That's /etc/groff/man.local in Debian-based systems.)
>
> This doesn't help. That solves a different problem, which is to
> display one space after periods when reading man pages on your own
> system.
I think it _would_ help, _because_ the property of the `"` character's
end-of-sentence transparency can directly affect whether one space or
more shows up after a period when reading a man page on one's system.
But I would agree that it's not the best place to attack the quotation
character problem.
> When *writing* POD, one obviously cannot go and add this change to
> *other people's* systems, and therefore one has to take irritating and
> inobvious special measures to format the input such that *roff can
> recognize the sentence boundaries if one wants the sentence spacing to
> be consistent on systems that have not made this change to the
> defaults. Or, if one wants your text to render with one space after
> periods on arbitrary systems that you cannot reconfigure (which, I
> agree, may not be a good thing to want, but I certainly do have users
> who want that and who have complained to me about it), one has to take
> different and even more annoying and inobvious special measures to
> interfere with *roff's ability to recognize sentence boundaries.
Okay. I'm acknowledging this but not seeing how it impacts the proposed
change _concretely_.
> This is tedious and irritating and therefore mostly doesn't happen.
> It's hard enough to get people to write documentation of any kind,
Oh, I hear ya.
> let alone adopt what seems to them as unnatural prose formatting.
> Therefore, many man pages translated from POD (and ones not translated
> from POD, for that matter) are routinely rendered with spacing
> inconsistencies.
Doug McIlroy has also lamented *roff's "artificial intelligence"
regarding sentence boundary detection. Early on, the Bell Labs CSRC
figured out a solution, and Brian Kernighan evangelized it vigorously,
anticipating XKCD 1285 by decades.
But your phrase "adopt what seems to them as unnatural prose formatting"
is the shoal upon which your preferences are breaking. Your users
simultaneously want the AI and don't. Or, rather, they want the AI, and
you want to produce *roff that will behave predictably, and these two
desires are incompatible for some inputs.
> I've somewhat made my peace with this and mostly assume this is
> unfixable because the necessary measures on the author's side are too
> annoying to be viable for most authors. Maybe eventually everyone will
> just feed text through a machine learning model trained to recognize
> sentence boundaries and insert special markup. :)
That may be the only hope here for the populations of the left
two-thirds of XKCD 1285.
> But at least, in the current system, an author who *wants* to get this
> right can get 99% of the way there by adopting a policy of adding a
> line break after each sentence. If I understand the proposal
> correctly, the `cflags 0 "` change would make the problem much worse,
> since there would no longer be a way to tell *roff to consistently
> space a sentence ending with `."`, even by adding line breaks after
> sentences.
True, but who's doing this? Sean Burke evidently isn't. I'm not averse
to adding an escape sequence that explicitly means "sentence ends here",
or that makes the existing (and long-toothed) `\)` GNU troff extension
"sticky" until the end of the word, which would achieve the same end.
Dave Kemper and I have spitballed ideas like this for other reasons.[2]
Such extensions might be adopted by other *roffs slowly or never, but
that won't matter if they don't also incorporate the `cflags 0 "` thing.
Do you need me to look harder for examples of what worries you?
> When groff reflows the paragraph, it will always treat `."` as the end
> of a regular word, not a sentence, and space it accordingly. The
> *only* option for correct sentence spacing consistent with the
> defaults, when writing in POD, would be to use Unicode quotes, with
> all the tedium that involves on many systems and keyboard entry
> methods.
Nevertheless I think Unicode input is the horse to bet on in the long
term. I've spent years accumulating the knowledge I'll need to migrate
GNU troff's internal representation format for characters from `unsigned
char` to `char32_t`.[3]
> Maybe I still misunderstand something?
Your understanding seems pretty good to me. But one (or both!) of us
may have inadequate Bayesian priors. ;-)
Regards,
Branden
[1]
https://gitlab.archlinux.org/archlinux/packaging/packages/groff/-/commit/e474b541a32fc905b4f748de0313acfb8b98c081
groff_man_style(7):
Notes
Some tips on composing and troubleshooting your man pages follow.
[...]
• When and how should I use quotation marks?
As noted above in subsection “Font style macros”, apply quotation
marks to “brief specimens of literal text, such as article
titles, inline examples, mentions of individual characters or
short strings, and (sub)section headings in man pages”. Multi‐
word literals, such as Unix commands with arguments, when set
inline (as opposed to displayed between EX and EE), should be
quoted to ensure that the boundaries of the literal are clear
even when the material is stripped of font styling by, for
example, copy‐and‐paste operations. groff, Heirloom Doctools
troff, neatroff, and mandoc support all of the special characters
\[oq], \[cq], \[lq], \[rq], \[aq], and \[dq] described in
subsection “Portability” above. DWB, Plan 9, and Solaris 10
troffs do not.
Historically, man pages used ` and ' exclusively for directional
single quotation marks. However, in recent years, some
distributors of groff have chosen to override the meanings of
these characters in man pages, remapping them to their Unicode
Basic Latin code points. Unfortunately, ` and ' are the only
reliable means of obtaining directional single quotation marks in
AT&T troff; in that implementation, often no special character
escape sequences exist to obtain them. Further, AT&T troff’s
special character identifiers, like its font names, were device‐
specific. To achieve quotation portably in man pages rendered
both by AT&T and more modern troffs, consider adding a preamble
to your page after the TH call as follows.
.ie \n(.g \{\
. ds oq \[oq]\"
. ds cq \[cq]\"
.\}
.el \{\
. ds oq `\"
. ds cq '\"
.\}
You must then use the \* escape sequence to interpolate the
quotation mark strings.
The command
.RB \*(oq "while !\& git pull; do sleep 10; done" \*(cq
retries an update from the repository until it succeeds.
If this procedure seems complex, petition your distributor to
revert their remapping of the ` and ' characters.
[2] https://savannah.gnu.org/bugs/?67347
[3] Part of this learning has involved refactorings of the code to
consistently _use_ `unsigned char` (Clark's choice to make life
under ISO Latin-1 easy), rather than punning to `int` or a `char` of
undefined signedness all over the place because C++ implementations
still let you get away with this.
Before taking the `char32_t` plunge I think I want to refactor GNU
troff to stop using `unsigned char` for any other purpose, so that
the character handling logic is easy to find. Right now, it also
uses `unsigned char` for short bit masks (like, as it happens,
character flags). These could easily be `typedef`ed, but a more
scrupulous approach might be to make them one-element structs to get
full-powered type checking.
signature.asc
Description: PGP signature
