Re: Do Latin-2-based hyphenation files work with Unicode?

G. Branden Robinson Wed, 13 Nov 2024 07:52:41 -0800

Hi onf,

At 2024-11-12T16:43:56+0100, onf wrote:
> On Tue Nov 5, 2024 at 8:15 PM CET, onf wrote:
> > as the title says. If I use UTF-8 via preconv and request
> >   .hy 2
> >   .hpf hyphen.cs
> > will that work, given that the file is using the Latin-2 encoding
> > for characters with diacritics? If not, what changes need to be done?
> 
> I did a little bit of testing. The hyphenation patterns work correctly
> with UTF-8, but ONLY if Latin2 is loaded, like so:
>   .mso latin2.tmac


That's as I would expect.

> and hyphenation codes must be specified, like so for Latin2 Czech:
>   .hcode <e1> <e1>  <c1> <e1>
>   .hcode <e8> <e8>  <c8> <e8>
>   .hcode <ef> <ef>  <cf> <ef>
>   .hcode <e9> <e9>  <c9> <e9>
>   .hcode <ec> <ec>  <cc> <ec>
>   .hcode <ed> <ed>  <cd> <ed>
>   .hcode <f2> <f2>  <d2> <f2>
>   .hcode <f3> <f3>  <d3> <f3>
>   .hcode <f8> <f8>  <d8> <f8>
>   .hcode <b9> <b9>  <a9> <b9>
>   .hcode <bb> <bb>  <ab> <bb>
>   .hcode <fa> <fa>  <da> <fa>
>   .hcode <f9> <f9>  <d9> <f9>
>   .hcode <fd> <fd>  <dd> <fd>
>   .hcode <be> <be>  <ae> <be>

Yes.  That is because GNU troff, the formatter, cannot handle UTF-8
input; it assumes an 8-bit character encoding, and so hyphenation codes
must be assigned within an 8-bit integer range.

> Without loading the latin2.tmac file, it doesn't hyphenate correctly.
> Given that latin2.tmac specifies a bunch of translations which convert
> Latin2 bytes into respective character codes, e.g.:
>   .trin \[char248]\[r ah]
> my guess is that these translations enable the Latin2 bytes in
> hyphen.cs to be converted to their character counterparts, which the
> UTF-8 codes are converted to as well, so that in the end both input
> methods result in the same glyph. [Pardon my inadequate terminology.]

Your explanation sounds correct to me.

I recently spent some time learning these matters the hard way.

https://savannah.gnu.org/bugs/?66051

> Latin1 characters continue working even when loading Latin2 as long as
> they are specified as the respective UTF-8 codes.

And they _should_ continue to hyphenate at appropriate locations because
s set of hyphenation codes is associated with the hyphenation
_language code_ ("en", "cs", "fr", etc.), which can change from
environment to environment.

There has been a problem going way back that the hyphenation language
code was inadvertently global rather than bound to the environment.  But
the hyphenation codes (and patterns) were never thrown away; you simply
would have had to remember to invoke the `hla` request by hand when
switching environments.  groff 1.24 will make that unnecessary.

https://savannah.gnu.org/bugs/?66387

> My conclusion is that, given the intricacies of all this, loading the
> appropriate localization file is THE way to setup hyphenation
> correctly.

Yes!  Our documentation does actually try to get this idea across.  If
there are spots where you feel it is failing to do so, please bring them
to my attention (but also base your recommendations on groff Git--I
revise documentation all the time).

> I feel like splitting the hyphenation part of localization files off
> (into hycs.tmac etc.) would be beneficial in that one could load the
> hyphenation settings for a given language without all the localization
> strings.

This, I'm less sure about.  The localization strings are namespaced, so
the only real advantage to separating them is a minuscule reduction in
formatter startup time, about which I have never read any complaints.

The one groff locale that would really benefit from your suggestion,
though, is (new to 1.24) "ru" (Russian).  Here's why:

https://git.savannah.gnu.org/cgit/groff.git/commit/?id=f486938c51ca3a39a8e9b46d3422e3a25ae4bd1c

> Groff's documentation of hyphenation could then be updated
> with a simple mention of
>   .mso hycs.tmac
> before specifying the technical details (.hy, .hla, .hpf, ...) which
> ordinary users won't need to deal with.

The existing recommendation for localization is to specify loading of
the groff locale via the command-line `-m` option, _after_ loading any
full-service package.

There is nevertheless _some_ flexibility.  A document could load the
localization file itself:

$ groff -K utf8 -ms <<EOF
.mso fr.tmac
.LP
Les représentants du peuple français, constitués en Assemblée nationale,
considérant que l'ignorance, l'oubli ou le mépris des droits de l'homme
sont les seules causes des malheurs publics et de la corruption des
gouvernements, ont résolu d'exposer, dans une déclaration solennelle,
les droits naturels, inaliénables et sacrés de l'homme, afin que cette
déclaration, constamment présente à tous les membres du corps social,
leur rappelle sans cesse leurs droits et leurs devoirs\~; afin que les
actes du pouvoir législatif, et ceux du pouvoir exécutif, pouvant être à
chaque instant comparés avec le but de toute institution politique, en
soient plus respectés\~; afin que les réclamations des citoyens, fondées
désormais sur des principes simples et incontestables, tournent toujours
au maintien de la Constitution et au bonheur de tous.
EOF

Or load the desired package _and_ localization file, in that order.

$ groff -K utf8 <<EOF
.mso s.tmac
.mso fr.tmac
.LP
Les représentants du peuple français, constitués en Assemblée nationale,
considérant que l'ignorance, l'oubli ou le mépris des droits de l'homme
sont les seules causes des malheurs publics et de la corruption des
gouvernements, ont résolu d'exposer, dans une déclaration solennelle,
les droits naturels, inaliénables et sacrés de l'homme, afin que cette
déclaration, constamment présente à tous les membres du corps social,
leur rappelle sans cesse leurs droits et leurs devoirs\~; afin que les
actes du pouvoir législatif, et ceux du pouvoir exécutif, pouvant être à
chaque instant comparés avec le but de toute institution politique, en
soient plus respectés\~; afin que les réclamations des citoyens, fondées
désormais sur des principes simples et incontestables, tournent toujours
au maintien de la Constitution et au bonheur de tous.
EOF

I believe the foregoing is the approach Dave Kemper prefers.

Regards,
Branden

signature.asc
Description: PGP signature

Re: Do Latin-2-based hyphenation files work with Unicode?

Reply via email to