Re: Need background on design of groff character classes

G. Branden Robinson Mon, 01 Dec 2025 22:45:55 -0800

Hi Dave,

Werner seems not to be available, as it's now 2 weeks since I lit up the
bat-signal, and I haven't heard his cape flutter yet.

I suggest that it falls to us (and any interested list subscribers) to
find the way forward.

(But I'm still CCing him, just in case.)

At 2025-12-01T00:37:42-0600, Dave Kemper wrote:
> On Tue, Nov 18, 2025 at 12:45 AM G. Branden Robinson
> <[email protected]> wrote:
> > Is it really asking too much of the user to write:
> >
> > .class \[EOS] .?!\[em]
> > .cflags 1 \C'[EOS]'
> > .cflags 5 \[em]
> >
> > instead of:
> >
> > .class \[EOS] .?!\[em]
> > .cflags 1 \C'[EOS]'
> >
> > ?
> 
> Yes, it is.  To set a new cflag on arbitrary character x, you're
> saying the user must examine every macro package that document loads
> to see if one of them already sets any flags on character x, so that
> he can do his own arithmetic to add those values to the new one he
> wants.

Not at all.  To set a new character flag on an arbitrary character x,
one just does it.  If one really means for `\[em]` to both end a
sentence (if followed by appropriate whitespace) and for the line to be
eligible for breaking after it, one says so.

.cflags 5 \[em]

Also, if one wants to know what flags a character possesses, one can ask
with the new `pchar` request.  Not programmatically, no--there's no
mechanism for returning a character's assigned flags in a register,
say--but this is a more easily obtained insight than has been available
heretofore.

Do people really not know some of the "character flag" properties they
desire for a special character, but not others?  I think most groff
users don't mess with character flags at all.

I can _imagine_ this partial interest being the case for the horizontal/
vertical overlap flags--has anyone outside of the groff development team
_ever_ manipulated these?  I'll return to this point momentarily.  For
all of the others, which involve the mutually implicating properties of
end-of-sentence significance and breaking eligibility, I think not.

Let's review what all the character flags are.

https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/charinfo.h?id=198346d187de9e340bbf9d4f80c2dc4d42f5f74e#n46

(For non-source divers, I've put the full explanation in a footnote.[1])

All of these save two entail breaking or sentence termination
properties.  I assert that these are mutually entailing because breaking
implies (potential) hyphenation, and hyphenation is never applicable
after the end of a sentence.

I therefore challenge you to identify a scenario where one is going to
assign a sentence termination property to a character in _indifference_
to its breaking properties.

Now, then, what about the other two properties--the self-overlapping
glyph properties?

Well, I'd agree that they are orthogonal to the other character flags.
In fact, I think they bear a closer resemblance to the "featural
description" properties declared in font description files.

groff_font(5):
     The directive charset starts the character set subsection.  (On
     typesetters, this directive is misnamed since it starts a list of
     glyphs, not characters.)  It precedes a series of glyph
     descriptions, one per line.  Each such glyph description comprises
     a set of fields separated by spaces or tabs and organized as
     follows.

            name metrics type index [entity‐name] [-- comment]
...
     For fonts used with typesetters, the type field gives a featural
     description of the glyph: it is a bit mask recording whether the
     glyph is an ascender, descender, both, or neither.  When a \w
     escape sequence is interpolated, these values are bitwise or‐ed
     together for each glyph and stored in the ct register.  In font
     descriptions for terminals, all glyphs might have a type of zero,
     regardless of their appearance.

     0      means the glyph lies entirely between the baseline and a
            horizontal line at the “x‐height” of the font, as with “a”,
            “c”, and “x”;

     1      means the glyph descends below the baseline, like “p”;

     2      means the glyph ascends above the font’s x‐height, like “A”
            or “b”); and

     3      means the glyph is both an ascender and a descender——this is
            true of parentheses in some fonts.

I'd say, then, that we would do better to retire character flags 8 and
16, and make them featural "type" bits 4 and 8.  Initially, I'd try to
get away with not adding a request to manipulate these bits, but if we
do, we can thereby make the 2 bits of of a glyph's "type" runtime
programmable as well.  (When assigning them, one would have to supply a
"resolved font name".[5])

I'd further add that for the HTML and terminal output devices, these
"overlap" character flags make promises the formatter can't keep.

> Two years later, when he needs to include a new macro package, he has
> to, first, remember that he used a .cflags request in this document
> two years ago, and second, audit the new macro package for any .cflags
> usage affecting character x.

...and this differs from anything else we might announce in the "NEWS"
file how, exactly?  Here's the sum total of `cflags` matches in the
parts of our source that contain macro packages.

$ git grep -w cflags contrib tmac
contrib/mom/ChangeLog:  o Added .cflags 4 /\(en -- was driving me nuts that 
lines wouldn't
contrib/mom/NEWS:Added .cflags 4 /\(em to om.tmac.  By default, mom now 
obligingly
contrib/mom/om.tmac:.cflags 4 /\[en]      \" So slash and en-dashes get broken
tmac/an.tmac:.cflags 0 "
tmac/doc.tmac:.cflags 0 "
tmac/dvi.tmac:.cflags 8 \[an]
tmac/html.tmac:.cflags 0 -\[hy]\[em]\[en]
tmac/ja.tmac:.cflags 128 \C'[CJKprepunct]'
tmac/ja.tmac:.cflags 266 \C'[CJKpostpunct]'
tmac/ja.tmac:.cflags 512 \C'[CJKnormal]'
tmac/ps.tmac:.cflags 8 \[an]
tmac/zh.tmac:.cflags 128 \C'[CJKprepunct]'
tmac/zh.tmac:.cflags 266 \C'[CJKpostpunct]'
tmac/zh.tmac:.cflags 512 \C'[CJKnormal]'

(Do I spot a bug in the Chinese and Japanese "postpunct" flags here?[3])

Peter Schaffter struck the `\[en]` character with lightning--in mom
1.1.2, which is over 20 years ago (I see a date of 8 August 2004 on mom
1.2).  He virtuously documented the change.

The `cflags` changes in our man(7) and mdoc(7) are very recent[2] and
might in fact be what got me looking into this stuff.  One might argue
that they illustrate nothing vis-a-vis your argument involving use of
character classes, since they, well, don't, and secondarily, you're
championing a model of flagful character classes that can "OR" (set)
flags on but necessarily can't mask them off (which requires "AND" and
"NOT").  You may challenge me to document this change, which I'd
characterize as recognition of existing implied preference, in our
"NEWS" file (and therefore 1.24.0 release notes), but please see [2]
first.

The others are all pretty old decisions.

$ for f in tmac/{dvi,html,ja,ps,zh}.tmac; do git blame "$f" | grep cflags; done
385c59960d tmac/dvi.tmac (Werner LEMBERG      2002-02-25 17:19:21 +0000  97) 
.cflags 8 \[an]
be5c0f723d tmac/html.tmac (Werner LEMBERG      2006-02-23 20:07:25 +0000  28) 
.cflags 0 -\[hy]\[em]\[en]
38e6049d0d (Werner LEMBERG      2010-12-18 09:13:18 +0000 47) .cflags 128 
\C'[CJKprepunct]'
38e6049d0d (Werner LEMBERG      2010-12-18 09:13:18 +0000 48) .cflags 266 
\C'[CJKpostpunct]'
38e6049d0d (Werner LEMBERG      2010-12-18 09:13:18 +0000 49) .cflags 512 
\C'[CJKnormal]'
0fcc774e06 tmac/ps.tmac (Werner LEMBERG      2002-03-24 11:38:34 +0000  28) 
.cflags 8 \[an]
ab3cf0445c (Werner Lemberg      2015-04-30 09:42:22 +0200 46) .cflags 128 
\C'[CJKprepunct]'
ab3cf0445c (Werner Lemberg      2015-04-30 09:42:22 +0200 47) .cflags 266 
\C'[CJKpostpunct]'
ab3cf0445c (Werner Lemberg      2015-04-30 09:42:22 +0200 48) .cflags 512 
\C'[CJKnormal]'

So our churn rate for cflags changes for a given macro package is 1-3
per somewhere between 10 years and never.[6]

And I'm not even proposing, at this time, to change any of these
`cflags` requests (modulus the bug I think I see in \C'[CJKpostpunct]').
Even the "featural" idea I spitballed above would affect only the "dvi"
and "ps" devices.

> Whenever you find yourself telling a user he has to manually perform a
> set of algorithmic steps that a computer can do in a matter of
> nanoseconds, you should probably instead be telling yourself there's a
> design flaw.  Computers, not humans, should be doing algorithmic
> tasks.

I think that's a better argument for championing a leading "+/-" syntax
extension to the first argument to `cflags` than to make official a dark
corner of GNU troff usage which only you, as far as I know, have made
any use.

I observe that even Donald Knuth long ago resolved to surrender TeX and
Metafont to Hyrum's Law only at the time of his own death.

https://www.hyrumslaw.com/
https://www-cs-faculty.stanford.edu/~knuth/abcde.html ("Errata")

At 2025-12-01T03:56:39-0600, Dave Kemper wrote:
> On Tue, Nov 18, 2025 at 12:45 AM G. Branden Robinson
> <[email protected]> wrote:
> > However let me illustrate some cans of consistency worms it opens.
> 
> To be clear, these cans are not being opened, but have been sitting
> open for a long time--yet we have no record of anyone complaining
> about them.

I agree with your observation but not with inference you draw thence.

> > So what I think can happen is that, via the character class feature,
> > a character can be made to carry contradictory character flags.
> 
> I'm sure you're right,  But that doesn't point to a design flaw, it
> points to a missing sanity check--a check that can be added without
> tossing out the design.

I'd say I'm proposing to _modify_ (and simplify) the design, and make
the formatter do what the most parsimonious reading of the documentation
suggests.  If you don't want me to take my point about what the groff
1.22.3 documentation implied about character class behavior as
unrebutted, I'll have to ask that you rebut it.[4]

> > It might be doing that already.  See Savannah #67570 and #67571.
> 
> #67570 looks like it can be pinned on a recent refactor that snuck in
> an unintended code change.  In any case, the observable behavior
> change only showed up since August and doesn't seem to have any real
> bearing on the design issues in this thread.

Agreed, and I've pushed a fix.

> #67571 remains undiagnosed and may yet reveal more cans that have been
> open long enough that the worms have had time to crawl away and evolve
> into weasels.  Or it could be nothing more sinister than a misplaced
> closing brace.

I'll concede that that bug needs a root-cause analysis.

However, I submit that even short of that, we know enough to conclude,
with substantial if not perfect confidence, that groff developers
neglected to fully audit what was in those cans in the first place.  We
may have thought they were nice beneficent earthworms for composting,
but they may have been cnidarians living symbiotically with botulinum.

Doesn't it look to you like this feature might not have gotten enough
testing when it landed?  It does to me, and that I why I fulsomely
praised Bertrand (and would do so again) for integrating an automated
test harness into groff's build system.  He made it straightforward for
a dopey old shell script programmer like me to feel confident in groff's
behavior, and demonstrate that confidence to others (by asking them to
run "make check" for themselves).

Regards,
Branden

[1]

groff_diff(7):
     .cflags n c1 c2 ...
             Assign properties encoded by the number n to characters c1,
             c2, and so on.  Characters, whether ordinary, special, or
             indexed, have certain associated properties.  The first
             argument is the sum of the desired flags and the remaining
             arguments are the characters to be assigned those
             properties.  Spaces need not separate the cn arguments.
             Any argument cn can be a character class defined with the
             class request rather than an individual character.

             The non‐negative integer n is the sum of any of the
             following.  Some combinations are nonsensical, such as “33”
             (1 + 32).

             1      Recognize the character as ending a sentence if
                    followed by a newline or two spaces.  Initially,
                    characters “.?!” have this property.

             2      Enable breaks before the character.  A line is not
                    broken at a character with this property unless the
                    characters on each side both have non‐zero
                    hyphenation codes.  This exception can be overridden
                    by adding 64.  Initially, no characters have this
                    property.

             4      Enable breaks after the character.  A line is not
                    broken at a character with this property unless the
                    characters on each side both have non‐zero
                    hyphenation codes.  This exception can be overridden
                    by adding 64.  Initially, characters “-\[hy]\[em]”
                    have this property.

             8      Mark the glyph associated with this character as
                    overlapping other instances of itself horizontally.
                    Initially, characters
                    “\[ul]\[rn]\[ru]\[radicalex]\[sqrtex]” have this
                    property.

             16     Mark the glyph associated with this character as
                    overlapping other instances of itself vertically.
                    Initially, the character “\[br]” has this property.

             32     Mark the character as transparent for the purpose of
                    end‐of‐sentence recognition.  In other words, an
                    end‐of‐sentence character followed by any number of
                    characters with this property is treated as the end
                    of a sentence if followed by a newline or two
                    spaces.  This is the same as having a zero space
                    factor in TeX.  Initially, characters
                    “'")]*\[dg]\[dd]\[rq]\[cq]” have this property.

             64     Ignore hyphenation codes of the surrounding
                    characters.  Use this value in combination with
                    values 2 and 4.  Initially, no characters have this
                    property.

             The remaining values were implemented for East Asian
             language support; those who use alphabetic scripts
             exclusively can disregard them.

             128    Prohibit a break before the character, but allow a
                    break after the character.  This works only in
                    combination with values 256 and 512 and has no
                    effect otherwise.  Initially, no characters have
                    this property.

             256    Prohibit a break after the character, but allow a
                    break before the character.  This works only in
                    combination with values 128 and 512 and has no
                    effect otherwise.  Initially, no characters have
                    this property.

             512    Allow a break before or after the character.  This
                    works only in combination with values 128 and 256
                    and has no effect otherwise.  Initially, no
                    characters have this property.

             In contrast to values 2 and 4, the values 128, 256, and 512
             work pairwise.  If, for example, the left character has
             value 512, and the right character 128, no break will be
             automatically inserted between them.  If we use value 6
             instead for the left character, a break after the character
             can’t be suppressed since the neighboring character on the
             right doesn’t get examined.

[2] commit ae0adea18c, 2025-09-18

$ sed -n '1469,+3p' tmac/an.tmac
.\" Man pages more often use the neutral double quote `"` as a "code
.\" literal" than as a quotation character.  Give it the same (empty)
.\" set of character flags as its special character equivalent, \[dq].
.cflags 0 "

[3] Shouldn't these CJK glyphs be "256" instead of "266"?

Factored into powers of 2, 266=256+8+2.

Let's use groff Git HEAD to ask the Chinese macro package what
characters are in this class.

$ ./build/test-groff -Tutf8 -mzh
.pchar \C'[CJKpostpunct]'
character class '[CJKpostpunct]'
  defined at: file name: "/home/branden/src/GIT/groff/build/../tmac/zh.tmac", 
line number: 39
  contains ranges: U+201C U+3008 U+300A U+300C U+300E U+3010 U+FF08
  contains nested classes: (none)

Should these CJK post-punctuation glyphs have the flags 2 ("allows
breaks before the character") and 8 ("overlaps copies of itself
horizontally")?  Really?  For U+201C?

$ for c in 201C 3008 300A 300C 300E 3010 FF08; do unicode U+$c | head -n 3; done
U+201C LEFT DOUBLE QUOTATION MARK
UTF-8: e2 80 9c UTF-16BE: 201c Decimal: &#8220; Octal: \020034
“
U+3008 LEFT ANGLE BRACKET
UTF-8: e3 80 88 UTF-16BE: 3008 Decimal: &#12296; Octal: \030010
〈
U+300A LEFT DOUBLE ANGLE BRACKET
UTF-8: e3 80 8a UTF-16BE: 300a Decimal: &#12298; Octal: \030012
《
U+300C LEFT CORNER BRACKET
UTF-8: e3 80 8c UTF-16BE: 300c Decimal: &#12300; Octal: \030014
「
U+300E LEFT WHITE CORNER BRACKET
UTF-8: e3 80 8e UTF-16BE: 300e Decimal: &#12302; Octal: \030016
『
U+3010 LEFT BLACK LENTICULAR BRACKET
UTF-8: e3 80 90 UTF-16BE: 3010 Decimal: &#12304; Octal: \030020
【
U+FF08 FULLWIDTH LEFT PARENTHESIS
UTF-8: ef bc 88 UTF-16BE: ff08 Decimal: &#65288; Octal: \0177410
（

I see yet another bug here.  How about you?

[4] https://lists.gnu.org/archive/html/groff/2025-11/msg00029.html

[5] groff(7):

Using fonts
...
     Like the AT&T troff formatter, GNU troff does not itself load or
     manipulate a digital font file; instead it works with a font
     description file that characterizes it, including its glyph
     repertoire and the metrics (dimensions) of each glyph.  This
     information permits the formatter to accurately place glyphs with
     respect to each other.  Before using a font description, the
     formatter associates it with a mounting position, a place in an
     ordered list of available typefaces.  So that a document need not
     be strongly coupled to a specific font family, in GNU troff an
     output device can associate a style in the abstract sense with a
     mounting position.  Thus the default family can be combined with a
     style dynamically, producing a resolved font name.  A user‐
     specified font name that combines family and style, or refers to a
     font that is not a member of a family, is already “resolved”.

[6] Checking commit history and Werner's commit message after
    introducing the CJK-motivated character flags, I'm increasingly
    confident that a typo snuck in.

commit 38e6049d0d1ad035e6a562c285dc6017530f5745
Author: Werner LEMBERG <[email protected]>
Date:   Sat Dec 18 09:13:18 2010 +0000

    Improve CJK support with new values for `.cflags'.

    This patch introduces three new values to `.cflags':

      don't break before character: 128
      don't break after character:  256
      allow inter-character break:  512

...
-.cflags 2 \C'[CJKprepunct]'
-.cflags 4 \C'[CJKpostpunct]'
-.cflags 66 \C'[CJKnormal]'
+.cflags 128 \C'[CJKprepunct]'
+.cflags 266 \C'[CJKpostpunct]'
+.cflags 512 \C'[CJKnormal]'
...

...because '66' and '266' look more similar in decimal than binary.

signature.asc
Description: PGP signature

Re: Need background on design of groff character classes

Reply via email to