Re: Translating manpages into several idioms (gettextization)

G. Branden Robinson Thu, 27 Mar 2025 19:28:32 -0700

Hi Colin,

At 2025-03-27T01:00:17+0000, Colin Watson wrote:
> I still very much don't understand how po4a-translate would work with
> this sort of approach.  My understanding is that the only way that you
> could take a preprocessed version of the document, feed it into po4a,
> and expect to get useful results out of the po4a-translate stage would
> be if you could round-trip from your preprocessed form back to
> something closely resembling the original document - and
> round-tripping entire pages through POD (rather than just the
> translatable bits) seems like an unnecessarily hard problem to solve,
> and probably not viable for a large corpus.
[snip]


Thanks a lot for your thoughtful response.  I can't rebut most of your
points, in large part because I have, it now seems to me, a deficient
grasp of how po4a is used in the field.

I may have gotten carried away by Martin Quinson's enthusiastic response
to my pitch, thinking I was facing down a tangle of hemp while equipped
with a strong sword arm, a sharp blade, and a hungry eye for Asia.

[snip]
> Is that helpful?  I realize that preserving fragments of the original
> markup may not actually be possible with your current implementation
> vision,

Yes, that's intractably hard or even computationally impossible (because
irreversible macro interpolations, et al., have already taken
place)--under the strictly confined alternative-node-output scheme I had
in mind.

It could be that the problem is still solvable with a technique similar
to that used for grohtml, combined with how I envision refactoring the
troff/grohtml relationship, and that is by pushing more "tagging" work
into the macro packages themselves.  In this case, man(7) and mdoc(7),
of course.

This is already done to a large degree, of course, by "devtag.tmac", but
unfortunately pretty much everything about that macro file and the
support infrastructure for its technique inside the formatter (the `tag`
and `taga` requests) is undocumented.  But over time I've developed a
gradually clearing notion of what it is trying to do, by better
understanding the pieces of the system around it.

> but that's exactly why I wanted to outline the sorts of things that I
> think are likely to be needed sooner rather than later.

I appreciate it.

> Alternatively, if the output could include accurate offsets for each
> translatable chunk, that would probably also work: Locale::Po4a::Groff
> could run your new code to get a preprocessed version and then match
> everything up.  I still think we'd need a richer format than just a
> stream of lines of text though; there's the context issue I mentioned
> above, but also translatable chunks don't always match up with lines
> very well.  For example, I'd say that this line of input:
> 
>   .TH curs_beep 3X 2025-02-01 "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@" 
> "Library calls"
> 
> ... should produce two msgids, "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@" and
> "Library calls".

That's not hard at all with the package-directed tagging approach.  What
you want to split up are already separate arguments to the `TH` macro,
so it could spit out distinct "tags" for each.  In mdoc(7), these data
are arguments to (or determined by) separate macros, an even stronger
distinction.

> > 5.  At this point in the formatting process, the formatter's notion of a
> >    font is an integer referring to a mounting position.  We don't know
> >    what the font "is".  The current font is also a property of the
> >    environment, not of nodes per se.  But: (a) we know when the font
> >    selection _changes_, and (b) for man page formatting I'll bet we can
> >    assume that fonts are mounted in traditional order: 1, 2, 3, 4 -> R,
> >    I, B, BI.[1]
> 
> po4a does exactly that today, FWIW.
> 
>   https://github.com/mquinson/po4a/blob/v0.73/lib/Locale/Po4a/Man.pm#L1800

I don't have a citation handy, unfortunately, but I have read that po4a
sometimes guesses the typeface wrong.  I suspect that's due to having to
track a lot of state when parsing man(7) macros for itself, because font
selection escape sequences can be (and are, for good reasons like the
three-font problem, and bad ones like undisciplined and/or clueless man
page authors) nested inside arguments to font selection macros.  To be
bulletproof, one must also have mastered the semantics of switching
to the "previous" font, and have decided consistently with the formatter
how the identity of the previous font is affected when a font selection
escape sequence attempts to select a nonexistent font.

> > 6.  Text in a man page that uses special characters (trout/grout:
> >     the "C" command) probably doesn't need to be translated.
> > 
> >     One exception: as usual we'd likely special-case what "groff -a"
> >     renders as `<->` and `<hy>` as good old `-`, and punt (warn on
> >     and ignore) any other special character.
> 
> This seems a bit too simplistic.

Admitted.

> Looking at grout for man(1), for instance, I see a bunch of "Caq"
> commands that correspond to where the page source has "'".

Yes.  Having the giant list of special characters in groff_char(7) in
mind, most of which see use in _no other man page_, I waved my hands too
aggressively.

Specifically, the notorious 7 ASCII code points that *roff formatters
traditionally handle non-literally, as discussed in groff_char(7) and
groff_man_style(7), would all demand handling.

> And I wouldn't be surprised to find other C commands in the grout for
> mostly-English prose; what if somebody described an approach as
> "naïve", for instance?

_That_, I think, on the other hand, will be relatively rare.  Most
people writing man(7) (or mdoc(7), for that matter), seem to be stumped
by how to input such words, so they do something non-portable or just
give up the attempt, and degrade their input to Basic Latin ("ASCII").

And in fact there isn't really any way to represent "naïve" that is
portable to both AT&T troff and GNU troff.  In AT&T troff, you either
overstruck characters (ugly on typesetters and ineffective on video
terminals), or the macro package supplied facilities for overstriking
(somewhat) more artfully, as with the CSRC's and CSRG's competing
"accent mark" implementations in ms(7).  Neither suits today's fonts.

But back on the gripping hand, everybody trying to translate a man page
into a non-English language is either already well aware of this problem
or is not concerned about page portability to AT&T troff, now
unmaintained for 31 years and counting, even while its dessicated corpse
ambulates to this day in lingering commercial System V Unices.

Bottom line: I'll have to think more about this.  Thanks for saving me
from a tail-chase.

Regards,
Branden

signature.asc
Description: PGP signature

Re: Translating manpages into several idioms (gettextization)

Reply via email to