I've been quiet for the last several days because I've been working hard on some of the issues I've brought up on this list. I've included the current draft of the report on portable troff requests below. After it, some discussion of what I have planned when the report is finished.
------------------------------------------------------------------------------ Third draft of the report on defining a portable subset of troff requests. PORTABLE FEATURES: Portable requests: .de .ds .fi .ft .ie .if .ig .nf .nr .rm .rn .so .sp. The .sp macro is portable in the sense that it can be portably used to generate a visual paragraph break without terminating list markup, but use of an argument to control vertical spacing is not portable. While .if/.ie is in the portable set, the expression set allowable in conditionals has to be seriously restricted to be portable across rendering programs including doclifter and Unix *roff. The parenthesized form of conditional, and the groff extended logical operators, are not portable. TODO: DESCRIBE PORTABLE EXPRESSIONS BETTER Note that .br/.nl, .ti, .ta, and .in are *not* in the portable set. These cannot be translated structurally by doclifter, and man-to-HTML translators tend to ignore them or give useless results as well. Fortunately. these can almost always be replaced by uses of .nf/.fi, .RS/.RE, and tbl markup (which doclifter handles). Portable escapes: \. \^ \' \` \- \$ \* \& \| \0 \<SP> \d \e \f \u \n These are almost all the escapes actually needed to interpret the entire 13,447-page Fedora Core 6 corpus into DocBook. The corpus makes sporadic and very rare use of use of \v, \w, \h, \o, and \k (approximately once each), but these are not essential and can be patched out. I noted previously that \w is *not* portable. In general, we can't count on the viewer to be able to render horizontal or vertical motions with precision, we can't count on it to know font sizes, and we can't even count on it to know whether its output uses fixed- or variable-width fonts. As it turns out, interpreting \w is not necessary for doclifter, either -- all man-page uses (at least, in my corpus) are either inside macros which are interpreted by other means or part of Synopsis syntax. Werner Lemberg wanted to know the status of \~. I found 17 uses within the groff documentation and 4 outside it. Of those 4, two were errors. So it's not much needed for manual pages, which is a good thing as it is not portable. In particular, I was unable to discover any corresponding ISO entity or Unicode character. Portable glyphs: The glyphs \*R, \(Tm, \*(lq and \*(rq (registered, trademark, left quote, and right quote) are described on groff_man. Every man-page viewer I examined except the crufty old Perl man2html supports these. I think we can declare Latin-1 and the intersection of groff glyphs with HTML entities portable as well, but verifying this will need more work and it will require ignoring the limitations of some obsolete translators such as the Perl man2html. TODO: NEEDS MORE INVESTIGATION Portable registers: After investigating the groff builtin registers, I have concluded that the only portable built-in register is .$, the macro argument count register. Any other time troff markup references a built-in register, it is about to do something that is dependent on knowing about the physical rendering medium, such as sub-character motion or drawing. FEATURE SUPPORT IN OTHER MANPAGE-RENDERING PROGRAMS: More detailed notes on feature support in programs other than groff follow. Programs are listed roughly in decreasing order of groff compatibility. Heirloom troff: Gunnar Ritter, the maintainer, says: "supports almost all groff requests; a complete list is in <http://heirloom.sourceforge.net/doctools/troff.pdf>. The exceptions are mainly in areas which are irrelevant in the context of manual pages, like debugging or color support. The only unsupported request which sometimes occurs in manual pages is .fam." I checked on the last bit; .fam is used in exactly 5 pages in the crpus, two of which are groff documentation. We can safely not support it. Whatever subset of groff glyphs this supports, it's bound to be larger than that of non-troff-descended rendering programs and thus will not constrain the portable subset. Thus I have not enumerated the supported glyphs here. Unix troff classic: Supports all the features described above. Because all the other programs described here were modeled on it, it is not going to be a constraint on the portable set. doclifter: doclifter supports the following troff requests: ab .am .as .bp .c2 .cc .cu .de .ds .em .fi .ft .ie .if .ig .nf .mso .nop .nr .pm .rm .return .rn .rr .shift .so .sp .tm .tr .ul. doclifter does not support .fam. It treats .do as a no-op, a rather dodgy procedure which (because of the restricted ways .do is used) nevertheless gives good results. doclifter handles *all* predefined groff glyphs, mapped to ISO escapes and Unicode -- except the old-style Bell Labs bracket-pile characters. doclifter handles the entire portable set of escapes as described above, and also \c, \<CR>, some cases of \w, and some cases of \o. (The remaining cases are passed through with a warning.) manServer: Gunnar also reports: "the manServer script by Rolf Howarth lacks support for .bp, .ul, .cu, .tm, .as, .em, .am, .rr, .pm, .cc, .c2, .ab, and .do, so I think these also do not belong on the list of safe requests. It lacks reasonable support for the \c and \<CR> escape sequences." Developer's docs are here: <http://www.squarebox.co.uk/users/rolf/download/manServer.shtml>. I read through the sourcecode to determine its capabilities. manServer handles these troff requests: .ds .nr .ti .rm .rn .de .ig .so .ps .ft .nf .fi .br .sp .ta manServer handles escapes \., \', \`, \&, \^, \0, \d, \e, \f, \n, \s, \u. manServer handles cases of \o that reduce to Latin-1 and Latin-2 accented characters. The KDE manpage viewer: Gunnar writes: "At its core, it seems to be a derivative of the man2html program by Richard Verhoeven which is also part of Andries Brouwer's man package: <http://websvn.kde.org/trunk/KDE/kdebase/kioslave/man/man2html.cpp?rev=416894&view=auto>. >From [the doclifter list of] requests, it lacks support for .bp, .cu, .do, .em, .pm, .rr, and .ul. It implements all escape sequences you consider as safe, and has a large list of supported special characters which I am too lazy to examine in detail." The KDE viewer supports built-in registers: n, t, o, e, l, .$, .A, .T, .V Supported escapes are: \c, \e, \f, \n, \p, \s, \t, \w \<SP>, \$, \&, \', \`, \-, \., and others outside the set that man pages actually use. Escapes \0, \~, \|, and \^ are all mapped to an ordinary so the latter two cannot really be said to be supported. There is also faked support for \z, \k, \!, \a, \d, \r, \u. The glyph set includes the Greek alphabet (miniscule and majuscule), the groff Latin-1 characters, the "registered", "copyright", and "trademark" glyphs, and much of the classic troff glyph set. TODO: CHARACTERIZE THE MAN2HTML GLYPH SET BETTER man2html: This is not the C program that the KDE browser is based on, but a crude Perl script that seems to have written in the mid-1990s and been last modified in 2003. There is a Savannah project page, dormant, here: <http://savannah.nongnu.org/projects/man2html/>. No glyph, escape, or register support at all. It's a good thing this has been obsolesced by more recent converters or it would choke the portable subsets of those right down to nothing. (This is the man2html I was thinking about when I dismissed its translations as "crappy". I was right... :-)) Xman: TODO: FIND OUT WHAT XMAN DOES TKman: TKman relies on nroff to format pages, then analyzes the generated ASCII looking for section headers, references to manual pages, and other cliches. It does no interpretation of troff markup itself. and is this not a constraint on the set of portable features. Rosetta/PolyglotMan: TODO: FIND OUT WHAT POLYGLOTMAN DOES ------------------------------------------------------------------------------ Once I have the set of portable man-page constructs well characterized, I intend to develop a set of patches for the groff distribution that will do the following: 1) Trim the groff manual pages so they use only the portable subset, plus the .SY and .OP macros that Werner and I have characterized. 2) Add a section on portable *roff requests to groff_man(7), including the recommendation to define .SY, .OP, .EX/.EE and .DS/.DE locally for a while until the new man macros have time to propagate everywhere. 3) Add definitions of .EX/.EE and .DS/.DE to the man macros. While I am doing these things, I will also be upgrading doclifter in various ways: 1) The next feature to go in will be the ability to recognize ad-hoc tables made with .nf/.ta./.fi and compile them into DocBook table markup. 2) doclifter will be taught to recognize .SY and .OP. 3) I plan to add a validator option to doclifter that will issue warnings on use of any request, escape, or register not in the defined portable set. By this means, man-page authors will be able to conveniently check the conformance of their pages to the portable set. I want to get these patches out in a 1.20 release in time to make the Fedora 7 development freeze in late January. Yes, I know, Bernd Warken is in love with the hyperextended macros on groffer.1 and elsewhere, and will go ballistic. Too bad for him; we've established that they break too much software to live. Are there any other objections, either substantive or procedural, to this work plan? Any constructive criticism or discussion I have not incorporated? This is going to be a lot of work. There are things I could use help with: 1) I don't have to be the one to implement .SY/.OP/.EX/.EE/.DS/.DE in an-old.tmac; someone else could do that. 2) Any help in filling out the TODOs in the above draft would speed things up measurably. Every hour I don't have to spend on research others could do will be spent on related things only I am presently qualified to work on, like doclifter internals. Gunnar? Anybody? Here are two related tasks not on my schedule: 1) Once we know what the portable set is, groff itself should issue warnings when a man page uses a non-portable feature. This should be taken on by somebody who understands groff internals better than I do. 2) Patches for .SY/.OP/.EX/.EE/.DS/.DE support should be developed for the KDE help browser and shipped as soon as possible. 3) .SY/.OP/.EX/.EE/.DS/.DE will also be needed in Heirloom troff. This one is pretty obviously Gunnar's baby. Open issues for discussion: 1) In defining the portable subset, do we want to take a conservative approach that embraces only the intersection of the feature sets of all viewers, or set a floor based on the capabilities of respectable modern viewers like the KDE help browser? In practice, this question comes down to whether we're going to bless Latin-1 as a portable character set and the groff glyphs mapping to Latin-1 as portable. I favor setting a floor that includes Latin-1. 2) When, in the portable-subset description, can we say that .EX/.EE, .SY/.OP, and .DS/.DE should be considered portable and no longer need local definitions? I think two years from when we ship 1.20 seems reasonable. That would give groff-1.20, (hypothetical) KDE help-browser patches, and an update of heiroom troff time to propagate. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> _______________________________________________ Groff mailing list Groff@gnu.org http://lists.gnu.org/mailman/listinfo/groff