Re: Request to support CJK fonts in grops

G. Branden Robinson Fri, 15 Jul 2022 21:30:50 -0700

Hello Tanaka-san,

At 2022-07-02T19:39:44+0900, t...@t-lab.opal.ne.jp wrote:
> Dear All,
> 
> I wrote a set of patches in order to support CJK fonts in PostScript
> outputs.
> 
> https://github.com/t-tk/groff-CJK-font


Thank you for doing this work!

> - It enables to support CJK non-embedded fonts in UTF16 encoding by
>   grops.
> - It enables to define charset by a range of Unicode code points in
>   font definition.  It is useful for CJK fonts because the number of
>   glyphs with the same metric are very large.
> 
> The tested examples are shown in tests/* in the GitHub repository.
> They include typeset of Chinese, Japanese, Korean.
> 
> I hope the maintainers to consider accepting my request.

I think your changes are well-worth consideration, but they will take
some effort to integrate with the current state of groff 1.23.  There
have been literally thousands of commits since 1.22.4 was released.

https://git.savannah.gnu.org/cgit/groff.git/log/

I have looked over the commits at your GitHub site and I have some
comments on a few of them.  Would you prefer to receive that feedback
there, or here on this mailing list?  (If the latter, I'll quote the
diffs, of course.)

Let me highlight a few things that give me pause.  This is meant as
constructive criticism and not as a suggestion that this work isn't
worth doing, or integrating into groff.  Also, I want this much at least
to go to the list so that some of the other experts here can correct me
anywhere I need it.  (We have several contributors with knowledge of
PostScript and PDF that is far superior to mine.)

A.  I'm not sure about the change that makes the C1 controls valid input
    code points.
    
https://github.com/t-tk/groff-CJK-font/commit/0f84463ecbe584cdbf91156819cd03e7d71d77f1

    As I understand it, this shouldn't be necessary: this table is used
    to screen out invalid _input_ code points.  It is not used to forbid
    output code points or glyph indices in fonts.  This is an important
    distinction because, in the 1-byte code space groff uses for input
    processing, nearly all of the invalid input code points are used by
    the troff program for internal purposes.  That, in turn, is one of
    the first things that needs to be changed if groff is ever to
    natively accept UTF-8 input.  (If I were doing this work, I'd
    probably relocate groff's internal-usage code points to the Unicode
    Private Use Area.)

    For example, in the Times Roman font we use with grops, code point
    decimal 130 encodes "guillemotright" as named in the Adobe Glyph
    List (AGL).  Decimal 130 is _not_ valid groff input; this same value
    (octal 0202) to encode the transparent file request.

    https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h#n46

    You might try reverting this patch to see if anything breaks.  If my
    understanding is correct, and if your input is processed with
    preconv (which, as we'll see in the next item, it is) to encode the
    many CJK characters it will require, then all your tests should
    still pass.

    If I'm wrong, then I'd very much like to figure out what I don't
    understand!

B.  I like the broad test coverage but I'm a little doubtful of its
    architecture on two levels.

    
https://github.com/t-tk/groff-CJK-font/commit/c3be2e29ae362283f442bc7ab99210a58425edbc

    First, it is not integrated into the Automake test harness we
    otherwise use.  You could be forgiven for overlooking that in groff
    1.22.4, as I think back then we had only a handful of such automated
    tests--maybe 4.  But now, we have well over 100.  So I would prefer
    to see the tests run under the Automake-managed test harness.  I've
    written most of those 100+ tests so I'm happy to help with the
    migration.

    Secondly, I think the testing is targeted very broadly.  It seems to
    use raw groff input and keep entire PostScript files around as test
    artifacts to compare against (that is, if I'm understanding what you
    have, they are the "expected" output).

    Moreover, a latter commit adds PNG conversions of the PostScript
    output.  I don't see where the output of patched groff is tested for
    conformance with these PNGs, but I think doing so would be
    misguided, and while visually-inspectable output is of interest,
    it's beside the point of your changes.  Comparing PNG output is
    really a test of whatever is rendering the PostScript fonts, and
    that's not groff.  It is furthermore sensitive to alternation or
    replacement of the font files used--but as long as the fonts
    maintain the same encoding--and if "UTF-16" is embedded in their
    file names, I reckon this should be the case--then such alterations
    to the rendering are not groff's problem to solve.

    On the other hand, placement of the glyphs on the page is very
    important and solidly groff's responsibility!

    What I am getting at is that these tests appear to skip over an
    important level of transformation, and does not target the scope of
    most of what your patches change with specificity.  I am referring
    to the transformation of device-independent GNU troff output to
    PostScript.

    Were I writing tests for this, I would run groff to produce
    device-independent output from an input document that exercises the
    code paths and Unicode code point ranges of interest (much as you
    have already done), but run "groff -Z" to save that
    device-independent output, and use _it_ as the test input.  This
    output format is documented and fairly stable (see the groff_out(5)
    man page).  Some knowledge of it will be helpful to troubleshoot any
    test failures, but I emphasize that it is okay to produce it with
    "groff -Z".  I've written device-independent *roff output from
    scratch but that's only because I'm a paranoid, crazy documentarian
    who wants to rewrite the groff_out(5) man page some day soon.  ;-)

    With the PostScript expected output in hand, I would then not check
    for an exact match, but simply for the PostScript commands that
    direct the renderer/printer to place the relevant encoded glyph(s)
    on the page.  This to me would seem to be simply a matter of
    verifying that the right PostScript command and font code points are
    in place.  The basic function could be tested with a groff document
    that puts a single glyph on the page.

    For instance, adapting your "dankeCJK.1", I might do this.

  $ groff -K utf8 -Z > dankeCJK.grout
  早
  $ cat dankeCJK.grout
  x T ps
  x res 72000 1 1
  x init
  x F -
  p1
XXX
  V24000
  H72000
  n12000 0
  x trailer
  V792000
  x stop

    At about the place "XXX" shows up (this is groff Git HEAD output),
    with CJK UTF-16 font support we would instead see commands to mount
    a font (like "CSS") and write the glyph, I expect with the "C"
    command (documented in groff_out(5).

    It seems to me that almost any other difference in "grout" or
    PostScript produced by groff would be irrelevant.  The scope of your
    work is to get groff to recognize and use UTF-16-encoded CJK fonts,
    correct?  Checking the entire output for exact matches makes it
    likely that unrelated changes in groff will cause spurious test
    failures.

C.  Of lesser importance, and easier to solve, are some reservations I
    have about code style.  I am dubious about indirecting the changed
    data type for `ps_output` through the preprocessor, altering its
    first argument from a `char` to an unsigned short.  I would prefer
    to have the function tell the truth about its argument types.
    "CHAR" is misleading.  Furthermore, my first inclination would be to
    go ahead and use an entire machine register for this datum, since
    Unicode is a 21-bit code in a 32-bit space.  Only if performance
    problems, reliably measured and verified, with grops arose would I
    consider an optimization to a shorter type.

    More broadly, to get groff more Unicode friendly in the future we're
    going to need to widen its internal data type for character and
    glyph code points anyway, as discussed in item (A) above.  I'd
    personally be perfectly happy to see grops become the first step in
    such an improvement.  I also get the feeling it's going to be more
    manageable and less disruptive to change that data type everywhere
    else first, _before_ we change the core (GNU troff, the program,
    itself).

    Another small style issue is that I don't see any reason to hide
    this new feature behind C preprocessor conditionals, _except_ for
    the fact that you changed the syntax of the font description file.

    
https://github.com/t-tk/groff-CJK-font/commit/a34710636508177025ba1c1b9ec979ec191aae53

    I think I would prefer to add a new font description file directive,
    something like "charset-range", and use that to exercise your new
    feature of applying a single set of metrics to a large number of
    code points.  This would better maintain continuity with a file
    format and syntax that dates back to Kernighan's device-independent
    troff in 1981.  I say "continuity" not "compatibility", because a
    non-groff *roff that tries to parse a groff font description file
    using your new feature would fail to correctly process either form,
    but by using a new keyword we make it much more obvious and clear to
    anyone troubleshooting problems what has gone wrong--an incompatible
    extension is in use.  But every troff veteran who encounters groff
    is accustomed to those, so it should be no great shock.

    (On reflection, maybe "ranged-charset" would be a better name,
    because it would avoid misinterpretation by lazy parsers that only
    compare the first 7 bytes of the keyword--but maybe the diversity of
    *roff font description file readers is so low that this doesn't
    really matter.)

It has taken some time to consider these issues and compose this email;
I apologize for making you wait for the feedback.

Can you tell me where I might obtain the font files you are using?  I am
eager to start experimenting and planning the least disruptive route to
integration.

> Thanks,
> Takuji TANAKA

Best regards,
Branden

signature.asc
Description: PGP signature

Re: Request to support CJK fonts in grops

Reply via email to