Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?)

onf Thu, 14 Nov 2024 13:43:29 -0800

Hi Branden,

On Thu Nov 14, 2024 at 2:08 AM CET, G. Branden Robinson wrote:
> [...]
> > It's not so long ago I saw some mentions of support for
> > the \[u_...] characters being added to some driver,
>
> You might be thinking of this:
>
> commit a6289c1508acf31dce73da2ffa9e7de102986298
> Author: G. Branden Robinson <[email protected]>
> Date:   Wed Aug 21 08:40:27 2024 -0500
>
>     font/devps/ZD: Regen from updated dingbats.map.
>
>     * font/devps/ZD: Regenerate using updated dingbats.map.
>
>     Fixes <https://savannah.gnu.org/bugs/?63018>.  Thanks to Deri James and
>     Dave Kemper for (extensive) consultation.
>
> ...of which part of the commit's diff looks like:
>
> +u27BA  831,579 3       250     a187
> +u27BB  873,578 3       251     a188
> +u27BC  927,542 3       252     a189
> +u27BD  970,616 3       253     a190
> +u27BE  918,593 3       254     a191


I was actually thinking of this:
*  GNU troff now performs some limited processing/transformation of the
   argument to the `\X` escape sequence and its counterpart `device`
   request, to address the requirement that some documents have to pass
   metadata that must encode non-ASCII characters in device extension
   commands.  (For example, a document author may desire a document's
   section headings containing non-ASCII code points to appear correctly
   in PDF bookmarks.  Further, GNU troff encodes its output page
   description language only in ASCII.)  This change is expected to be
   of significance mainly to developers of output drivers for groff;
   groff_diff(7) describes the transformations.  If you have been using
   `\X` or `.device` to pass ASCII data to the output driver as a device
   extension command and require that it remain precisely as-is, use the
   `\!` escape sequence or `output` request, and prefix your data with
   "x X ", the device-independent troff means of expressing a device
   extension command (see groff_out(5)).

I remembered it had something to do with asciify and either grops or
gropdf, but forgot the rest...

> > so I figured it might for some reason be much easier than proper UTF-8
> > support.
>
> That's a different part of the problem.  We can express any Unicode code
> point in GNU troff _output_.  The reason people say "groff doesn't
> support UTF-8" is that GNU troff, the formatter program specifically,
> does not correctly interpret UTF-8-encoded input files.
> [...]

I know, I know. I guess I just don't understand how groff has had
Unicode output for so long and yet input is still lacking. To me,
adding UTF-8 support to a program in C means changing char to
uint32_t and adding conversion from UTF-8 strings to Unicode
codepoints to the parts that read data in
(i.e. char[1..4] -> uint32_t).

I realize groff does some pretty complex text processing and it's
C++, but still I wouldn't expect it to be so complex given that
both Heirloom troff and neatroff have UTF-8 input support -- and 
those are essentially one-man projects (especially the latter).

> [...]
> > Perhaps, but you said it works fine for "temporary disablement with
> > `nh`". Disabling hyphenation once and for all does not classify as
> > temporary disablement, imho.
>
> You're kind of confusing me here.  Whether changing the line length with
> `ll` is "temporary" or not depends on whether you issue a subsequent
> request to do so.  In _this_ respect, disabling hyphenation is no more
> or less permanent than most other operations in troff.
> [...]

When you say it "works fine for temporar[ily] disabl[ing]" hyphenation,
I expect there to be some simple way how I might disable hyphenation
and then return it to the exact same state it had before. That's not
the case, as we've discussed for a while now. Compare with .na and .ad,
which actually DO work fine even though they can be confusing to the
beginner.

> [...]
> It's also okay to ask others.  That's one of the reasons this mailing
> list is here.  Also, occasionally something is hard because troff's
> design isn't everything it could be.

The sort of things I tend to get stuck on are either:
 * a complex macro breaking because I made several oversights (or poor
   decision) when writing it, and their complexity makes fixing this
   at least an hour long task; these experiences have taught me to
   make macros as simple as possible and to not try to automate
   everything (because fixing it is much harder when it breaks)
   [and as a result I tend to run more into the next one instead...]
   
 * basic troff syntax breaking inside my several hundred lines long
   macro package, but working just fine when I copy it elsewhere;
   in other words, bugs I can't reproduce separately from the rest
   of the macro file

The latter is worse. I have run multiple times into an if-elsif-else
conditional not working correctly within a macro file that's loaded
with .so, but working just fine when I paste the macro definition
containing it into a new document.

I guess if it happens again I might seek your support; so far I
never really felt like spending more time messing around with
troff's horrible conditionals.

> [...]
> > My proposal was based on the assumption that maintaining compatibility
> > with other troffs is desired.
>
> I'm concerned mainly with compatibility only with AT&T troff.

I see. I have looked at the adjustment/alignment proposal again.

It makes sense, although I disagree with the addition of .adjust.
It seems unnecessary to me given that .fi doesn't accept a boolean
argument either. To me, the changes which allow .ad/.na to be used
just like .fi/.nf are enough.

Given that these changes make .ad finally true to its mnemonic of
"adjust", I would suggest renaming .align to .al because:
  * it matches the naming scheme used with .ad
  * it seems more natural given the arguments are single characters:
    compare .al r with .align r (one would expect .align right)
  * short names make more sense for basic functions that are expected
    to be used often such as adjustment, alignment, filling, and
    various font properties (and all of them currently have them)
  * even many requests added by groff use aggressively shortened
    names (.als instead of .alias being the most salient example),
    so it cannot be argued that long names are somehow preferred

Yes, I know I could do .als al align. It's just that I wish I didn't
have to type that at the top of each document I write in plain troff.

And given how many other basic functions are provided with two letter
requests, I don't think making this one easier to remember for
beginners would be of much value; they will have to remember all the
other ones (or create aliases for them) anyway.

> Heirloom Doctools troff and neatroff both came along much later and
> I'm not aware that a large corpus of documents has ever been written
> specifically for them. [...]

There is also Plan 9 troff, which seems to be a descendant of AT&T
troff with UTF-8 support. Its changelog makes for a fun read:

  December 18, 1992:
          Some people have complete novels as comments, so we need
          to skip comments while checking the legality of font files.
          thaks Rixh
  
  May 12, 1993:
  
      Syntax change
  
          Some requests accept tabs as a separator, some don't and
          this can be a nuisance.  Now a tab is also recognized as
          an argument separator for requests, this makes
  
                  .so     /dev/null
  
          works.
  
          To be more precise, any motion character is allowed, so
  
                  .so\h'5i'/dev/null
  
          will work as well, if one really wants that.
  
          It will be a problem for users who really relied on this as in
  
                  .ds x   string
  
          and expect the tab to become part of the string a, but I haven't
          seen any use of that (obscure trick).

... :)

~ onf

Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?)

Reply via email to