Hi Keith,

I want to address only one part of your message, as other folks seem to
be handling the remainder just fine.

At 2026-04-12T16:32:55-0700, Keith Thompson wrote:
> Apparently groff doesn't do well with UTF-8 input. I'd like to
> see that changed, but I don't know nearly enough about groff to
> even start that work, or to speculate about whether it would be a
> good idea.

This has been a goal of groff's developers for many years, since well
before I joined them.

https://savannah.gnu.org/bugs/?40720

Also see:

https://www.gnu.org/software/groff/groff-mission-statement.html

The reason that this goal hasn't been achieved yet, in my opinion, is
that James Clark gambled in 1989 on possible future configurations of
character encoding popularity--I suspect to minimize groff programs'
memory requirements and avoid critiques of consequent reduced
performance--and lost, because Unicode happened.

The presumption that a single datum of the C/C++ `unsigned char` type is
adequate to represent any desired character code on input is deeply
woven into groff's architecture.

I've been working my way through the code to annotate and in some cases
remove barriers to GNU troff's acceptance of UTF-8 input, but it is slow
going and there are many frustrations.  Here's an exhibit that came up
recently.

https://savannah.gnu.org/bugs/?68230

Also, my efforts to date to prepare for a UTF-8 input future have not
gone without occasional complaint.

https://lists.gnu.org/archive/html/groff/2026-03/msg00001.html

One point I could add or clarify here is that as the code base moves in
the direction of expecting UTF-8/Unicode input, the tractability of
inferring character properties by manually maintained sets of numeric
tests of character codes diminishes dramatically.

When dealing with Unicode input, it's hard to keep one's sanity without
relying on library functions that classify characters according to
various properties.

As a simplified analogy, handling of Unicode character streams makes
writing

  iscntrl(c)

rather than

  if (c < 32)

a matter of survival rather than elegance.

That's one reason I felt it necessary to disappoint John Gardner, who
values being able to use control characters in the names of his *roff
registers, strings, and macros.

Regards,
Branden

Attachment: signature.asc
Description: PGP signature

Reply via email to