At 2020-09-19T16:01:22-0500, Dave Kemper wrote: > Straying away from man-page considerations and comparing these two > approaches in general: > > The .string* requests also have the advantage of handling alphabetic > Latin-1 characters and the roff escapes that represent them (though > .stringup fails kind of messily on roff escapes representing > nonalphabetic Latin-1 characters, such as \[de]).
Admittedly, yes. > However, if input must go through preconv, the .string* requests > remove all non-ASCII characters (alphabetic or otherwise) from strings > passed to them and emit warnings for each one. The .tr approach, > while failing to convert non-ASCII alphabetic characters, does > preserve them. > > .tr is also portable to non-groff roffs. > > So there are trade-offs to either approach. As the implementor of .string{up,down} I grant that they are feeble. The only thing that makes them bearable is that they are pretty much adequate to the man page considerations you're straying away from. Several weekends ago I started down the road of learning what it would take to convert the GNU troff engine to use a wide character type for handling of the input tokens. That is, you would still read a byte at a time, but immediately toss it into a wider type and then never have to worry about its representation format again until emitting device-independent output, which is already 7-bit ASCII, I think. 32 bits sounded good. Unicode is only 21 bits so I figured I'd just move all the crazy groff enums[1] in src/roff/troff/input.h to the top end of that space, or count backwards from the halfway point in case someone made noise about signedness issues. Either way, tons of space, and they wouldn't even have to be #ifdef-ed for EBCDIC! Complexity rapidly ramified. First I was rewriting groff's built-in C++ string library to be wchar_t-based, and I was already anticipating this list getting swarmed by C++ weenies screaming "why are you reinventing the wheel AGAIN when the C++ STL is RIGHT THERE?" Fortunately, I think Zack Weinberg answered that question for me in the meantime: This is because the test probes for C++11 library features, and the C++ standard library is notoriously heavyweight. The test program used by _AC_PROG_CXX_CXX11 is only about 150 lines long but it expands to 47,000 lines of gnarly template classes after preprocessing, and roughly 30,000 assembly instructions after compilation. With -g enabled (as is the default) 770,000 lines of debug information are also emitted into the assembly.[2] There were other problems I don't even remember now. I should have written up a report of what I saw that had to be dealt with, but I got discouraged and did not. Maybe I'll take another crack at it sometime. I don't yet perceive whether there is a way to do the char-width migration in a modular way, or if everything's so tightly coupled that you have to break the world and then put it back together. Right now you have to break libgroff along with the troff executable, and breaking libgroff breaks tons of other things in the tree. Maybe a good start (probably on a branch) would be to give troff its own copy of ligroff to which the violence can be done. Regards, Branden [1] They're not really enums, just global integer constants. But at least they're not preprocessor symbols. [2] https://savannah.gnu.org/support/index.php?110285
signature.asc
Description: PGP signature