Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?)

onf Wed, 13 Nov 2024 15:02:47 -0800

Hi Branden,

On Wed Nov 13, 2024 at 9:48 PM CET, G. Branden Robinson wrote:
> At 2024-11-13T20:42:57+0100, onf wrote:
> > [...]
> > I never said it's not used like that. I just meant to say that groff(7)
> > suggests the translation happens at the moment the character is
> > formatted for output rather than at the moment it is read in:
> >   .tr abcd...
> >       Translate ordinary or special characters a to b, c to d, and
> >       so on PRIOR TO OUTPUT. [emphasis added]
> > 
> > which is why I wondered about the things you quote below.
>
> I guess I interpret those words more generally than you do.  To me,
> "prior to output" can mean _any time_ prior to output (once the
> formatter has started running), and you seem to be inferring some later
> stage of processing.


To me, output seems to imply putting words on a page. I haven't given
the word's meaning in the context of troff much thought before, though.

> [...]
> > I was just trying
> > to make it work that way because that's what we have now, but it
> > would obviously be much better if one could use UTF-8 directly in
> > the hyphenation files (or at least the \[u...] characters) without
> > having to jump through all these hoops.
>
> We get our hyphenation pattern files from various TeX-related projects.
> I observe that these either use a "native" 8-bit encoding or
> (increasingly as years pass) UTF-8.  Implementing support for spelling
> non-ASCII characters with \[uXXXX] escape sequences seems like a detour
> to me.  If we solve the formatter's UTF-8 reading disability, the
> problems with the character encoding of hyphenation pattern files should
> pretty much disappear.

I don't disagree. I just wasn't sure how complex adding full UTF-8
support is. It's not so long ago I saw some mentions of support for
the \[u_...] characters being added to some driver, so I figured it
might for some reason be much easier than proper UTF-8 support.

> [...]
> > I eventually realized that .ad is not meant to switch back-and-forth
> > between adjustment modes, but to restore adjustment after it was
> > disabled with .na.
>
> Right!  That's how it was born.  See my email from last month about
> Sixth Edition troff and the telling shape of the `ad` request at that
> time,[1] before Seventh Edition and CSTR #54 came along and a sort of
> religious cult organized around the idea that troff had worked that way
> all the way back to 1971.

They should have called .ad without arguments .ra (restore adjustment,
like .rs) and we could have avoided all that confusion.

> [...]
> > > [...]
> > > I think these are horrible warts in the *roff language that an
> > > iconoclast should have smashed years ago.  But they work fine for the
> > > most common cases (temporary disablement with `nh` and `na`,
> > > respectively) [...]
> > 
> > I would disagree it works fins for temporary disablement with .nh;
> > see above.
>
> It does if you're using AT&T troff with its bespoke hyphenation system,
> or you're a man page author who either hates automatic hyphenation or
> doesn't pay very close attention to where hyphenation breaks occur.

Perhaps, but you said it works fine for "temporary disablement with
`nh`". Disabling hyphenation once and for all does not classify as
temporary disablement, imho. And it working poorly and the user just
not noticing doesn't mean it works fine, either.

> [...]
> > Sounds like what .hy should have been doing from the beginning :)
>
> Or it should have worked like .ev, .ft, .in, .ll, .ls, .lt, .po, .ps, or
> .vs--yes.  And had an introspection register, damn it.

That sounds too good, I would be happy if it at least worked
like .ad does... (:

> Speaking of introspection, I've added several new requests to the GNU
> troff language for 1.24.  I barrelled ahead with these because they have
> no effect on formatter state--they are there solely to help a person
> troubleshooting groff, a macro package, or a document figure out what's
> going on.

Cool.

> In my opinion, a person who is not developing groff itself
> should never have to launch GDB to discover relevant state about how
> their document is being formatted.  All too often, though, that is the
> case.  Less often in 1.24, I hope.
> [...]

In my case, if I really can't figure out a way to make something work,
I give up, figuring that a less complex approach will be easier to
work with and debug in the future anyway.

> > > I have plans to fix the argumentless `ad` request, but just today I
> > > decided to kick that out past 1.24.
> > >
> > > https://savannah.gnu.org/bugs/?65954
> > 
> > I don't feel like this fixes anything, honestly.
> > [...]
> > I suggest this instead:
> >   .ad
> >       Set adjustment mode to \n[.J] if set, b otherwise.
> >   .ad 0
> >       Disable adjustment.
> >       Update \n[.j] and \n[.J] (previous value of \n[.j]).
> >   .ad MODE
> >       Set adjustment mode to MODE (l,c,r,b,n).
> >       Update \n[.j] and \n[.J].
> >   .na
> >       As .ad 0.
> > 
> > This should make both scenarios work as expected without breaking any
> > other ways in which people currently use it. (At least I hope so.)
>
> It's pretty important to me to detangle adjustment from alignment.
> Continuing to heap complications on the existing `ad` request doesn't
> seem like a promising path forward to me. [...]

The mixing of alignment and adjustment functionality in .ad puzzles me
to this day, especially combined with the existence of .ce and .rj.
My proposal was based on the assumption that maintaining compatibility
with other troffs is desired.

~ onf

Re: character translation, hyphenation, and adjustment (was: Do Latin-2-based hyphenation files work with Unicode?)

Reply via email to