Hello Tanaka-san, At 2022-07-02T19:39:44+0900, t...@t-lab.opal.ne.jp wrote: > Dear All, > > I wrote a set of patches in order to support CJK fonts in PostScript > outputs. > > https://github.com/t-tk/groff-CJK-font
Thank you for doing this work! > - It enables to support CJK non-embedded fonts in UTF16 encoding by > grops. > - It enables to define charset by a range of Unicode code points in > font definition. It is useful for CJK fonts because the number of > glyphs with the same metric are very large. > > The tested examples are shown in tests/* in the GitHub repository. > They include typeset of Chinese, Japanese, Korean. > > I hope the maintainers to consider accepting my request. I think your changes are well-worth consideration, but they will take some effort to integrate with the current state of groff 1.23. There have been literally thousands of commits since 1.22.4 was released. https://git.savannah.gnu.org/cgit/groff.git/log/ I have looked over the commits at your GitHub site and I have some comments on a few of them. Would you prefer to receive that feedback there, or here on this mailing list? (If the latter, I'll quote the diffs, of course.) Let me highlight a few things that give me pause. This is meant as constructive criticism and not as a suggestion that this work isn't worth doing, or integrating into groff. Also, I want this much at least to go to the list so that some of the other experts here can correct me anywhere I need it. (We have several contributors with knowledge of PostScript and PDF that is far superior to mine.) A. I'm not sure about the change that makes the C1 controls valid input code points. https://github.com/t-tk/groff-CJK-font/commit/0f84463ecbe584cdbf91156819cd03e7d71d77f1 As I understand it, this shouldn't be necessary: this table is used to screen out invalid _input_ code points. It is not used to forbid output code points or glyph indices in fonts. This is an important distinction because, in the 1-byte code space groff uses for input processing, nearly all of the invalid input code points are used by the troff program for internal purposes. That, in turn, is one of the first things that needs to be changed if groff is ever to natively accept UTF-8 input. (If I were doing this work, I'd probably relocate groff's internal-usage code points to the Unicode Private Use Area.) For example, in the Times Roman font we use with grops, code point decimal 130 encodes "guillemotright" as named in the Adobe Glyph List (AGL). Decimal 130 is _not_ valid groff input; this same value (octal 0202) to encode the transparent file request. https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h#n46 You might try reverting this patch to see if anything breaks. If my understanding is correct, and if your input is processed with preconv (which, as we'll see in the next item, it is) to encode the many CJK characters it will require, then all your tests should still pass. If I'm wrong, then I'd very much like to figure out what I don't understand! B. I like the broad test coverage but I'm a little doubtful of its architecture on two levels. https://github.com/t-tk/groff-CJK-font/commit/c3be2e29ae362283f442bc7ab99210a58425edbc First, it is not integrated into the Automake test harness we otherwise use. You could be forgiven for overlooking that in groff 1.22.4, as I think back then we had only a handful of such automated tests--maybe 4. But now, we have well over 100. So I would prefer to see the tests run under the Automake-managed test harness. I've written most of those 100+ tests so I'm happy to help with the migration. Secondly, I think the testing is targeted very broadly. It seems to use raw groff input and keep entire PostScript files around as test artifacts to compare against (that is, if I'm understanding what you have, they are the "expected" output). Moreover, a latter commit adds PNG conversions of the PostScript output. I don't see where the output of patched groff is tested for conformance with these PNGs, but I think doing so would be misguided, and while visually-inspectable output is of interest, it's beside the point of your changes. Comparing PNG output is really a test of whatever is rendering the PostScript fonts, and that's not groff. It is furthermore sensitive to alternation or replacement of the font files used--but as long as the fonts maintain the same encoding--and if "UTF-16" is embedded in their file names, I reckon this should be the case--then such alterations to the rendering are not groff's problem to solve. On the other hand, placement of the glyphs on the page is very important and solidly groff's responsibility! What I am getting at is that these tests appear to skip over an important level of transformation, and does not target the scope of most of what your patches change with specificity. I am referring to the transformation of device-independent GNU troff output to PostScript. Were I writing tests for this, I would run groff to produce device-independent output from an input document that exercises the code paths and Unicode code point ranges of interest (much as you have already done), but run "groff -Z" to save that device-independent output, and use _it_ as the test input. This output format is documented and fairly stable (see the groff_out(5) man page). Some knowledge of it will be helpful to troubleshoot any test failures, but I emphasize that it is okay to produce it with "groff -Z". I've written device-independent *roff output from scratch but that's only because I'm a paranoid, crazy documentarian who wants to rewrite the groff_out(5) man page some day soon. ;-) With the PostScript expected output in hand, I would then not check for an exact match, but simply for the PostScript commands that direct the renderer/printer to place the relevant encoded glyph(s) on the page. This to me would seem to be simply a matter of verifying that the right PostScript command and font code points are in place. The basic function could be tested with a groff document that puts a single glyph on the page. For instance, adapting your "dankeCJK.1", I might do this. $ groff -K utf8 -Z > dankeCJK.grout 早 $ cat dankeCJK.grout x T ps x res 72000 1 1 x init x F - p1 XXX V24000 H72000 n12000 0 x trailer V792000 x stop At about the place "XXX" shows up (this is groff Git HEAD output), with CJK UTF-16 font support we would instead see commands to mount a font (like "CSS") and write the glyph, I expect with the "C" command (documented in groff_out(5). It seems to me that almost any other difference in "grout" or PostScript produced by groff would be irrelevant. The scope of your work is to get groff to recognize and use UTF-16-encoded CJK fonts, correct? Checking the entire output for exact matches makes it likely that unrelated changes in groff will cause spurious test failures. C. Of lesser importance, and easier to solve, are some reservations I have about code style. I am dubious about indirecting the changed data type for `ps_output` through the preprocessor, altering its first argument from a `char` to an unsigned short. I would prefer to have the function tell the truth about its argument types. "CHAR" is misleading. Furthermore, my first inclination would be to go ahead and use an entire machine register for this datum, since Unicode is a 21-bit code in a 32-bit space. Only if performance problems, reliably measured and verified, with grops arose would I consider an optimization to a shorter type. More broadly, to get groff more Unicode friendly in the future we're going to need to widen its internal data type for character and glyph code points anyway, as discussed in item (A) above. I'd personally be perfectly happy to see grops become the first step in such an improvement. I also get the feeling it's going to be more manageable and less disruptive to change that data type everywhere else first, _before_ we change the core (GNU troff, the program, itself). Another small style issue is that I don't see any reason to hide this new feature behind C preprocessor conditionals, _except_ for the fact that you changed the syntax of the font description file. https://github.com/t-tk/groff-CJK-font/commit/a34710636508177025ba1c1b9ec979ec191aae53 I think I would prefer to add a new font description file directive, something like "charset-range", and use that to exercise your new feature of applying a single set of metrics to a large number of code points. This would better maintain continuity with a file format and syntax that dates back to Kernighan's device-independent troff in 1981. I say "continuity" not "compatibility", because a non-groff *roff that tries to parse a groff font description file using your new feature would fail to correctly process either form, but by using a new keyword we make it much more obvious and clear to anyone troubleshooting problems what has gone wrong--an incompatible extension is in use. But every troff veteran who encounters groff is accustomed to those, so it should be no great shock. (On reflection, maybe "ranged-charset" would be a better name, because it would avoid misinterpretation by lazy parsers that only compare the first 7 bytes of the keyword--but maybe the diversity of *roff font description file readers is so low that this doesn't really matter.) It has taken some time to consider these issues and compose this email; I apologize for making you wait for the feedback. Can you tell me where I might obtain the font files you are using? I am eager to start experimenting and planning the least disruptive route to integration. > Thanks, > Takuji TANAKA Best regards, Branden
signature.asc
Description: PGP signature