Re: Need background on design of groff character classes

G. Branden Robinson Mon, 17 Nov 2025 22:45:56 -0800

Hi Dave,

At 2025-11-17T23:43:32-0600, Dave Kemper wrote:
> I don't really envision anything.  I note that the behavior I've
> observed since first using classes in 2011 differs from the model
> you've presented, and in a way that makes the language more useful
> than how you seem to want it to work.


I think it makes character properties and therefore the behavior of the
formatter more difficult to predict.

> I'd use a simpler example than the one you presented to make my point.
> 
> Suppose you want the em dash to have the end-of-sentence property (in
> character flags parlance, having flag 1 set).  You might be tempted to
> say:
> 
> .cflags 1 \[em]
> 
> And this would work to set its end-of-sentence property--but it would
> have the side effect of overwriting the flag (4) that \[em] already
> had set.

Yes.  `cflags` is defined as _assigning_ flags to one or more
characters, not logically or-ing them with whatever flags are already
set on those characters.

Let's go back to the groff 1.22.3 documentation, in case I messed it up
at some point in the past 8 years.

$ nroff -t -man -P-cbou ~/groff-1.22.3/share/man/man7/groff.7 \
  | sed -n '/\.cflags/,+1p'; \
  nroff -dAD=l -man -P -cbou ~/groff-1.22.3/share/man/man7/groff_diff.7 \
  | sed -n '/\.cflags/,+3p'
       .cflags mode c1 c2 ...
                 Treat characters c1, c2, ... according to mode number.
       .cflags n c1 c2 ...
              Characters c1, c2, ..., have properties determined by n, which
              is ORed from the following:

              be referred to from other requests easily (currently .cflags
              only).  Character ranges (indicated by an intermediate ‘-’) and
              nested classes are possible also.  This is useful to assign
              properties to a large set of characters.

Contrast what is said with what is not said.

"according to mode number", not "according to mode number and whatever
flags characters c1, c2, ... already carry".

"which is ORed from the following", not "which ORs each character's
existing flags with each of the following".

"useful to assign properties to a large set of characters", not "useful
to apply further properties to a large set of characters".

> So instead you might do this:
> 
> .class [EOS] \[em]
> .cflags 1 \C'[EOS]'
> 
> This has, since the first groff release where .class was introduced,
> had the observable effect of making the em dash recognized as ending a
> sentence (flag 1) while retaining its breaking behavior (as encoded in
> flag 4).
> 
> Why is this significant?  Because if you want to add a flag to a
> character's existing flags, there is no other way to do it.  The
> .cflags request only overwrites all existing flags; it does not have a
> syntax to preserve flags while adding a new one.  And the language has
> no mechanism for returning a character's current flags.  (Recent,
> post-1.23 groff builds include a request that lets you dump the flags,
> along with much other info about a character, to stderr, but groff is
> still unable to access that value from within a document.)  So you
> can't compute the flag you want to add ANDed with any existing flags.
> 
> So the current .class behavior provides useful functionality.  Thus I
> presumed the behavior was intended; but even if not, I see no benefit
> to altering it to fit a different model that makes impossible an
> operation that is currently possible.

I don't claim that you haven't identified a possibly desirable feature.
However let me illustrate some cans of consistency worms it opens.

1.  Under your proposal, you can say `.cflags 512` on whatever character
    classes you like, and reading that request in your document or macro
    package _will not suffice to tell you what flags the characters in
    the class carry.  You might think they all have the 09 bit set.

But that might not be a good thing.

2.  Some character flags are mutually contradictory.

static void set_character_flags_request()
{
...
    if (((flags & charinfo::ENDS_SENTENCE)
          && (flags & charinfo::IS_TRANSPARENT_TO_END_OF_SENTENCE))
        || ((flags & charinfo::ALLOWS_BREAK_BEFORE)
          && (flags & charinfo::PROHIBITS_BREAK_BEFORE))
        || ((flags & charinfo::ALLOWS_BREAK_AFTER)
          && (flags & charinfo::PROHIBITS_BREAK_AFTER))) {
      warning(WARN_SYNTAX, "ignoring contradictory character flags: "
              "%1", flags);
      skip_line();
      return;
    }

You'll don't get those warnings with any version of groff when applying
`cflags` to a character class if the contradiction isn't in the "mode"
argument itself.

$ cat EXPERIMENTS/break-it.groff
.cflags 1 \[em] \" ends sentence
.class [transparent] \[em]
.cflags 32 \C'[transparent]' \" transparent to end of sentence
.\" What flags does `\[em]` have now?
.pchar \[em]

(I attach the output of various versions of groff corresponding to this
input as a semi-illuminating footnote.[1])

So what I think can happen is that, via the character class feature, a
character can be made to carry contradictory character flags.  Much of
the formatter's code in "env.cpp"--character flags affect breaking and
hyphenation decisions, except for the two "overlapping instances of
itself" bits--is written on the assumption that a character's flags are
in a well-defined state.

If they're not, that's bad.

That will lead to unpredictable formatter behavior.

It might be doing that already.  See Savannah #67570 and #67571.

What is the use case for applying flags to a character in blindness to
what flags are already present?

Is it really asking too much of the user to write:

.class \[EOS] .?!\[em]
.cflags 1 \C'[EOS]'
.cflags 5 \[em]

instead of:

.class \[EOS] .?!\[em]
.cflags 1 \C'[EOS]'

?

You note that the formatter lacks a means of inquiring a character of
its flags, say by storing them in a register.  I agree.  In the absence
of my still-being-hammered-out `pchar` request initiative, is there
_any_ way to exfiltrate this information short of launching a debugger
or otherwise inspecting process memory?

I think my model/approach simplifies troubleshooting.

Regards,
Branden

[1]

$ ~/groff-1.22.3/bin/nroff -ww -Wspace EXPERIMENTS/break-it.groff
EXPERIMENTS/break-it.groff:5: warning: macro `pchar' not defined
$ ~/groff-1.22.4/bin/nroff -ww -Wspace EXPERIMENTS/break-it.groff
troff: EXPERIMENTS/break-it.groff:5: warning: macro 'pchar' not defined
$ ~/groff-1.23.0/bin/nroff -ww -Wspace EXPERIMENTS/break-it.groff
troff:EXPERIMENTS/break-it.groff:5: warning: macro 'pchar' not defined
$ ~/groff-HEAD/bin/nroff -ww -Wspace EXPERIMENTS/break-it.groff
special character 'em'
  is not translated
  has a macro: "file name": "tty.tmac", "starting line number": 22, "length": 
10, "contents": "\\[em]\\[em]", "node list": [ ]
  special translation: 0
  hyphenation code: 0
  flags: 1 (ends sentence)
  asciify code: 0
  ASCII code: 0
  Unicode mapping: U+2014
  is found
  is transparently translatable
  is not translatable as input
  mode: normal

(The report of the character flags on `\[em]` might not be correct.
This groff is from my working copy and the immediate issues around
character flags are causing me to beat on the code pretty aggressively.)

signature.asc
Description: PGP signature

Re: Need background on design of groff character classes

Reply via email to