Hi Dave, Werner seems not to be available, as it's now 2 weeks since I lit up the bat-signal, and I haven't heard his cape flutter yet.
I suggest that it falls to us (and any interested list subscribers) to find the way forward. (But I'm still CCing him, just in case.) At 2025-12-01T00:37:42-0600, Dave Kemper wrote: > On Tue, Nov 18, 2025 at 12:45 AM G. Branden Robinson > <[email protected]> wrote: > > Is it really asking too much of the user to write: > > > > .class \[EOS] .?!\[em] > > .cflags 1 \C'[EOS]' > > .cflags 5 \[em] > > > > instead of: > > > > .class \[EOS] .?!\[em] > > .cflags 1 \C'[EOS]' > > > > ? > > Yes, it is. To set a new cflag on arbitrary character x, you're > saying the user must examine every macro package that document loads > to see if one of them already sets any flags on character x, so that > he can do his own arithmetic to add those values to the new one he > wants. Not at all. To set a new character flag on an arbitrary character x, one just does it. If one really means for `\[em]` to both end a sentence (if followed by appropriate whitespace) and for the line to be eligible for breaking after it, one says so. .cflags 5 \[em] Also, if one wants to know what flags a character possesses, one can ask with the new `pchar` request. Not programmatically, no--there's no mechanism for returning a character's assigned flags in a register, say--but this is a more easily obtained insight than has been available heretofore. Do people really not know some of the "character flag" properties they desire for a special character, but not others? I think most groff users don't mess with character flags at all. I can _imagine_ this partial interest being the case for the horizontal/ vertical overlap flags--has anyone outside of the groff development team _ever_ manipulated these? I'll return to this point momentarily. For all of the others, which involve the mutually implicating properties of end-of-sentence significance and breaking eligibility, I think not. Let's review what all the character flags are. https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/charinfo.h?id=198346d187de9e340bbf9d4f80c2dc4d42f5f74e#n46 (For non-source divers, I've put the full explanation in a footnote.[1]) All of these save two entail breaking or sentence termination properties. I assert that these are mutually entailing because breaking implies (potential) hyphenation, and hyphenation is never applicable after the end of a sentence. I therefore challenge you to identify a scenario where one is going to assign a sentence termination property to a character in _indifference_ to its breaking properties. Now, then, what about the other two properties--the self-overlapping glyph properties? Well, I'd agree that they are orthogonal to the other character flags. In fact, I think they bear a closer resemblance to the "featural description" properties declared in font description files. groff_font(5): The directive charset starts the character set subsection. (On typesetters, this directive is misnamed since it starts a list of glyphs, not characters.) It precedes a series of glyph descriptions, one per line. Each such glyph description comprises a set of fields separated by spaces or tabs and organized as follows. name metrics type index [entity‐name] [-- comment] ... For fonts used with typesetters, the type field gives a featural description of the glyph: it is a bit mask recording whether the glyph is an ascender, descender, both, or neither. When a \w escape sequence is interpolated, these values are bitwise or‐ed together for each glyph and stored in the ct register. In font descriptions for terminals, all glyphs might have a type of zero, regardless of their appearance. 0 means the glyph lies entirely between the baseline and a horizontal line at the “x‐height” of the font, as with “a”, “c”, and “x”; 1 means the glyph descends below the baseline, like “p”; 2 means the glyph ascends above the font’s x‐height, like “A” or “b”); and 3 means the glyph is both an ascender and a descender——this is true of parentheses in some fonts. I'd say, then, that we would do better to retire character flags 8 and 16, and make them featural "type" bits 4 and 8. Initially, I'd try to get away with not adding a request to manipulate these bits, but if we do, we can thereby make the 2 bits of of a glyph's "type" runtime programmable as well. (When assigning them, one would have to supply a "resolved font name".[5]) I'd further add that for the HTML and terminal output devices, these "overlap" character flags make promises the formatter can't keep. > Two years later, when he needs to include a new macro package, he has > to, first, remember that he used a .cflags request in this document > two years ago, and second, audit the new macro package for any .cflags > usage affecting character x. ...and this differs from anything else we might announce in the "NEWS" file how, exactly? Here's the sum total of `cflags` matches in the parts of our source that contain macro packages. $ git grep -w cflags contrib tmac contrib/mom/ChangeLog: o Added .cflags 4 /\(en -- was driving me nuts that lines wouldn't contrib/mom/NEWS:Added .cflags 4 /\(em to om.tmac. By default, mom now obligingly contrib/mom/om.tmac:.cflags 4 /\[en] \" So slash and en-dashes get broken tmac/an.tmac:.cflags 0 " tmac/doc.tmac:.cflags 0 " tmac/dvi.tmac:.cflags 8 \[an] tmac/html.tmac:.cflags 0 -\[hy]\[em]\[en] tmac/ja.tmac:.cflags 128 \C'[CJKprepunct]' tmac/ja.tmac:.cflags 266 \C'[CJKpostpunct]' tmac/ja.tmac:.cflags 512 \C'[CJKnormal]' tmac/ps.tmac:.cflags 8 \[an] tmac/zh.tmac:.cflags 128 \C'[CJKprepunct]' tmac/zh.tmac:.cflags 266 \C'[CJKpostpunct]' tmac/zh.tmac:.cflags 512 \C'[CJKnormal]' (Do I spot a bug in the Chinese and Japanese "postpunct" flags here?[3]) Peter Schaffter struck the `\[en]` character with lightning--in mom 1.1.2, which is over 20 years ago (I see a date of 8 August 2004 on mom 1.2). He virtuously documented the change. The `cflags` changes in our man(7) and mdoc(7) are very recent[2] and might in fact be what got me looking into this stuff. One might argue that they illustrate nothing vis-a-vis your argument involving use of character classes, since they, well, don't, and secondarily, you're championing a model of flagful character classes that can "OR" (set) flags on but necessarily can't mask them off (which requires "AND" and "NOT"). You may challenge me to document this change, which I'd characterize as recognition of existing implied preference, in our "NEWS" file (and therefore 1.24.0 release notes), but please see [2] first. The others are all pretty old decisions. $ for f in tmac/{dvi,html,ja,ps,zh}.tmac; do git blame "$f" | grep cflags; done 385c59960d tmac/dvi.tmac (Werner LEMBERG 2002-02-25 17:19:21 +0000 97) .cflags 8 \[an] be5c0f723d tmac/html.tmac (Werner LEMBERG 2006-02-23 20:07:25 +0000 28) .cflags 0 -\[hy]\[em]\[en] 38e6049d0d (Werner LEMBERG 2010-12-18 09:13:18 +0000 47) .cflags 128 \C'[CJKprepunct]' 38e6049d0d (Werner LEMBERG 2010-12-18 09:13:18 +0000 48) .cflags 266 \C'[CJKpostpunct]' 38e6049d0d (Werner LEMBERG 2010-12-18 09:13:18 +0000 49) .cflags 512 \C'[CJKnormal]' 0fcc774e06 tmac/ps.tmac (Werner LEMBERG 2002-03-24 11:38:34 +0000 28) .cflags 8 \[an] ab3cf0445c (Werner Lemberg 2015-04-30 09:42:22 +0200 46) .cflags 128 \C'[CJKprepunct]' ab3cf0445c (Werner Lemberg 2015-04-30 09:42:22 +0200 47) .cflags 266 \C'[CJKpostpunct]' ab3cf0445c (Werner Lemberg 2015-04-30 09:42:22 +0200 48) .cflags 512 \C'[CJKnormal]' So our churn rate for cflags changes for a given macro package is 1-3 per somewhere between 10 years and never.[6] And I'm not even proposing, at this time, to change any of these `cflags` requests (modulus the bug I think I see in \C'[CJKpostpunct]'). Even the "featural" idea I spitballed above would affect only the "dvi" and "ps" devices. > Whenever you find yourself telling a user he has to manually perform a > set of algorithmic steps that a computer can do in a matter of > nanoseconds, you should probably instead be telling yourself there's a > design flaw. Computers, not humans, should be doing algorithmic > tasks. I think that's a better argument for championing a leading "+/-" syntax extension to the first argument to `cflags` than to make official a dark corner of GNU troff usage which only you, as far as I know, have made any use. I observe that even Donald Knuth long ago resolved to surrender TeX and Metafont to Hyrum's Law only at the time of his own death. https://www.hyrumslaw.com/ https://www-cs-faculty.stanford.edu/~knuth/abcde.html ("Errata") At 2025-12-01T03:56:39-0600, Dave Kemper wrote: > On Tue, Nov 18, 2025 at 12:45 AM G. Branden Robinson > <[email protected]> wrote: > > However let me illustrate some cans of consistency worms it opens. > > To be clear, these cans are not being opened, but have been sitting > open for a long time--yet we have no record of anyone complaining > about them. I agree with your observation but not with inference you draw thence. > > So what I think can happen is that, via the character class feature, > > a character can be made to carry contradictory character flags. > > I'm sure you're right, But that doesn't point to a design flaw, it > points to a missing sanity check--a check that can be added without > tossing out the design. I'd say I'm proposing to _modify_ (and simplify) the design, and make the formatter do what the most parsimonious reading of the documentation suggests. If you don't want me to take my point about what the groff 1.22.3 documentation implied about character class behavior as unrebutted, I'll have to ask that you rebut it.[4] > > It might be doing that already. See Savannah #67570 and #67571. > > #67570 looks like it can be pinned on a recent refactor that snuck in > an unintended code change. In any case, the observable behavior > change only showed up since August and doesn't seem to have any real > bearing on the design issues in this thread. Agreed, and I've pushed a fix. > #67571 remains undiagnosed and may yet reveal more cans that have been > open long enough that the worms have had time to crawl away and evolve > into weasels. Or it could be nothing more sinister than a misplaced > closing brace. I'll concede that that bug needs a root-cause analysis. However, I submit that even short of that, we know enough to conclude, with substantial if not perfect confidence, that groff developers neglected to fully audit what was in those cans in the first place. We may have thought they were nice beneficent earthworms for composting, but they may have been cnidarians living symbiotically with botulinum. Doesn't it look to you like this feature might not have gotten enough testing when it landed? It does to me, and that I why I fulsomely praised Bertrand (and would do so again) for integrating an automated test harness into groff's build system. He made it straightforward for a dopey old shell script programmer like me to feel confident in groff's behavior, and demonstrate that confidence to others (by asking them to run "make check" for themselves). Regards, Branden [1] groff_diff(7): .cflags n c1 c2 ... Assign properties encoded by the number n to characters c1, c2, and so on. Characters, whether ordinary, special, or indexed, have certain associated properties. The first argument is the sum of the desired flags and the remaining arguments are the characters to be assigned those properties. Spaces need not separate the cn arguments. Any argument cn can be a character class defined with the class request rather than an individual character. The non‐negative integer n is the sum of any of the following. Some combinations are nonsensical, such as “33” (1 + 32). 1 Recognize the character as ending a sentence if followed by a newline or two spaces. Initially, characters “.?!” have this property. 2 Enable breaks before the character. A line is not broken at a character with this property unless the characters on each side both have non‐zero hyphenation codes. This exception can be overridden by adding 64. Initially, no characters have this property. 4 Enable breaks after the character. A line is not broken at a character with this property unless the characters on each side both have non‐zero hyphenation codes. This exception can be overridden by adding 64. Initially, characters “-\[hy]\[em]” have this property. 8 Mark the glyph associated with this character as overlapping other instances of itself horizontally. Initially, characters “\[ul]\[rn]\[ru]\[radicalex]\[sqrtex]” have this property. 16 Mark the glyph associated with this character as overlapping other instances of itself vertically. Initially, the character “\[br]” has this property. 32 Mark the character as transparent for the purpose of end‐of‐sentence recognition. In other words, an end‐of‐sentence character followed by any number of characters with this property is treated as the end of a sentence if followed by a newline or two spaces. This is the same as having a zero space factor in TeX. Initially, characters “'")]*\[dg]\[dd]\[rq]\[cq]” have this property. 64 Ignore hyphenation codes of the surrounding characters. Use this value in combination with values 2 and 4. Initially, no characters have this property. The remaining values were implemented for East Asian language support; those who use alphabetic scripts exclusively can disregard them. 128 Prohibit a break before the character, but allow a break after the character. This works only in combination with values 256 and 512 and has no effect otherwise. Initially, no characters have this property. 256 Prohibit a break after the character, but allow a break before the character. This works only in combination with values 128 and 512 and has no effect otherwise. Initially, no characters have this property. 512 Allow a break before or after the character. This works only in combination with values 128 and 256 and has no effect otherwise. Initially, no characters have this property. In contrast to values 2 and 4, the values 128, 256, and 512 work pairwise. If, for example, the left character has value 512, and the right character 128, no break will be automatically inserted between them. If we use value 6 instead for the left character, a break after the character can’t be suppressed since the neighboring character on the right doesn’t get examined. [2] commit ae0adea18c, 2025-09-18 $ sed -n '1469,+3p' tmac/an.tmac .\" Man pages more often use the neutral double quote `"` as a "code .\" literal" than as a quotation character. Give it the same (empty) .\" set of character flags as its special character equivalent, \[dq]. .cflags 0 " [3] Shouldn't these CJK glyphs be "256" instead of "266"? Factored into powers of 2, 266=256+8+2. Let's use groff Git HEAD to ask the Chinese macro package what characters are in this class. $ ./build/test-groff -Tutf8 -mzh .pchar \C'[CJKpostpunct]' character class '[CJKpostpunct]' defined at: file name: "/home/branden/src/GIT/groff/build/../tmac/zh.tmac", line number: 39 contains ranges: U+201C U+3008 U+300A U+300C U+300E U+3010 U+FF08 contains nested classes: (none) Should these CJK post-punctuation glyphs have the flags 2 ("allows breaks before the character") and 8 ("overlaps copies of itself horizontally")? Really? For U+201C? $ for c in 201C 3008 300A 300C 300E 3010 FF08; do unicode U+$c | head -n 3; done U+201C LEFT DOUBLE QUOTATION MARK UTF-8: e2 80 9c UTF-16BE: 201c Decimal: “ Octal: \020034 “ U+3008 LEFT ANGLE BRACKET UTF-8: e3 80 88 UTF-16BE: 3008 Decimal: 〈 Octal: \030010 〈 U+300A LEFT DOUBLE ANGLE BRACKET UTF-8: e3 80 8a UTF-16BE: 300a Decimal: 《 Octal: \030012 《 U+300C LEFT CORNER BRACKET UTF-8: e3 80 8c UTF-16BE: 300c Decimal: 「 Octal: \030014 「 U+300E LEFT WHITE CORNER BRACKET UTF-8: e3 80 8e UTF-16BE: 300e Decimal: 『 Octal: \030016 『 U+3010 LEFT BLACK LENTICULAR BRACKET UTF-8: e3 80 90 UTF-16BE: 3010 Decimal: 【 Octal: \030020 【 U+FF08 FULLWIDTH LEFT PARENTHESIS UTF-8: ef bc 88 UTF-16BE: ff08 Decimal: ( Octal: \0177410 ( I see yet another bug here. How about you? [4] https://lists.gnu.org/archive/html/groff/2025-11/msg00029.html [5] groff(7): Using fonts ... Like the AT&T troff formatter, GNU troff does not itself load or manipulate a digital font file; instead it works with a font description file that characterizes it, including its glyph repertoire and the metrics (dimensions) of each glyph. This information permits the formatter to accurately place glyphs with respect to each other. Before using a font description, the formatter associates it with a mounting position, a place in an ordered list of available typefaces. So that a document need not be strongly coupled to a specific font family, in GNU troff an output device can associate a style in the abstract sense with a mounting position. Thus the default family can be combined with a style dynamically, producing a resolved font name. A user‐ specified font name that combines family and style, or refers to a font that is not a member of a family, is already “resolved”. [6] Checking commit history and Werner's commit message after introducing the CJK-motivated character flags, I'm increasingly confident that a typo snuck in. commit 38e6049d0d1ad035e6a562c285dc6017530f5745 Author: Werner LEMBERG <[email protected]> Date: Sat Dec 18 09:13:18 2010 +0000 Improve CJK support with new values for `.cflags'. This patch introduces three new values to `.cflags': don't break before character: 128 don't break after character: 256 allow inter-character break: 512 ... -.cflags 2 \C'[CJKprepunct]' -.cflags 4 \C'[CJKpostpunct]' -.cflags 66 \C'[CJKnormal]' +.cflags 128 \C'[CJKprepunct]' +.cflags 266 \C'[CJKpostpunct]' +.cflags 512 \C'[CJKnormal]' ... ...because '66' and '266' look more similar in decimal than binary.
signature.asc
Description: PGP signature
