On Tue, Sep 04, 2007 at 10:55:03AM +0200, Giacomo A. Catenazzi wrote: > Colin Watson wrote: > >On Mon, Sep 03, 2007 at 05:38:10PM +0200, Giacomo A. Catenazzi wrote: > >>I don't like the proposal ;-) > >>It is not very POSIXly and to application specific. > > > >Of course it is application-specific; /usr/share/man is > >application-specific (i.e. specific to the man application). Methods of > >processing /usr/share/man that don't use /usr/bin/man are already broken > >in other ways. (man exports a number of specialised interfaces that can > >be used by frontends, and I'm happy to add more on request.) > > But we have the same problem with info, with the HOWTO, with the > doc, ....
Manual pages are different because: * They are not typically read directly, but via a toolset that is capable of dealing with such matters as encoding translation in a manner appropriate to the user's locale. In other words, we can safely recommend UTF-8 in the comfortable knowledge that it can be done transparently. info may share this property; I'm not sure because I'm not familiar with it at the implementation level and I haven't noticed it having much in the way of internationalisation support in general. I don't particularly object to recommending UTF-8 for HTML documentation as such, but it is clearly less convenient as you need to adjust the files themselves to declare a character set rather than just installing them in a different place. Other documentation is often read with a simple pager. UTF-8 is probably the most convenient encoding long-term in order that you can read documentation in more than one language without reconfiguring your software, but I imagine there is plenty of room for local exceptions here and it is certainly less clear. * As a general rule, manual pages are much better localised than other documentation. That is, they actually get localised. We may not be anywhere close to completion, but compare it to the other forms of documentation you mentioned: info has a handful of translations with a variety of naming conventions (is there any client support for selecting them automatically?), and random files in /usr/share/doc typically aren't localised or at best maybe have one or two translations (usually in the upstream author's native language). The only other form of documentation I'm aware of with a comparable level of localisation is the HOWTOs from the Linux Documentation Project. * Because our current groff implementation imposes quite strict restrictions on what input and output encodings are possible, and usually needs to know detailed information about these encodings in order to achieve correct typography, it is if anything more important than usual for man to have an accurate idea of the document's character set. * Because manual page encoding is specified by means of file system location, and because only a strict subset of the file system is allowed, it is important for policy to specify how this is to be handled across many packages for interoperability, more so than for forms of documentation where file system location is immaterial. > For this reason, I would like a general policy and solution. > (The /usr/share/man then it would a follow-up policy) > > Or there is fewer problem on other docs? I don't think it's really reasonable or necessary to create a general policy covering both /usr/share/man and other documentation in a single piece of text. The requirements are too different, and several different documentation formats have their own special requirements and need to move at their own pace. Current policy wisely does not attempt to treat them as a single unit, but has subsections for the two major specialised formats (man and info). > >POSIX does not specify anything about the layout of /usr/share/man. The > >FHS makes an attempt, but it's horribly broken (speaking as one who has > >attempted to implement it), predates widespread deployment of UTF-8, and > >does not really help with the problem to hand anyway. > > Yes, I saw (and there are some strange consideration), but I meant: > POSIX define locales and how application use locales. > If we convert manpages with UTF-8, I think we broke posix: > the user can see wrong encoding. No, you still don't understand. The conversion is only applied to the source files, not what users see. POSIX does not impose requirements on the encoding of applications' data files: each file clearly has to have an encoding and an application that can know what encoding is in use and convert it to the user's locale is clearly doing a better job than one that can't. > But I was thinking to a possible over-engineering: manpages that > explain output of the program: the output in an ideal world should > be written in the user locale (number and dates). You mean the LC_NUMERIC and LC_TIME locale categories? There is no support for this in groff and I think this is unlikely to happen. As you suggest yourself, this is overengineering; a manual page is probably better advised to explain in prose, as it's not at all impossible for a user to look at a manual page in a different locale. In any case, I would appreciate it if you didn't distract this proposal that's purely about encodings to become a general debate about wishlists for locale handling in manual pages. > So in the policy I would mention the possible triplets > (for application reading the files), Triplets? Do you mean language[_territory][.codeset]? Just say "locales" rather than inventing a new term. I'm not sure what you want to be mentioned, though. Are you looking for a complete specification of the possible subdirectory names under /usr/share/man? Perhaps it would be better to document that in man-db, and leave policy to recommend the best choice rather than document all possible choices. After all, the policy group's job is to take decisions. > >>It is confusing the "legacy (non-UTF-8) character". > > > >Yes, it is, but it is current practice and I merely document it. If we > >were starting from scratch with the benefit of hindsight then obviously > >we wouldn't have done it this way. > > > >I think it's unambiguous for all languages where we actually have > >existing manual pages to worry about. > > I don't like the wording. Now it seems that UTF-8 is superior > to other encoding, but we should take UTF-8 as the ultimate > encoding. I propose a simple "non-UTF-8 character". > Anyway this is a very minor point. I'm not sure this is the right place to debate UTF-8's superiority to earlier 8-bit encodings such as ISO-8859-1 or the double-byte character sets. I think it's self-evident while it's not clear that you do, and this doesn't seem like the place to reach agreement on that. I also don't think in this case that we need to be afraid to adopt the best available encoding now for fear that a better one might come along later; should that happen, we can simply move along gradually to it and have man recode on the fly, just as I'm proposing we do here. Sure, we can say "non-UTF-8" rather than "legacy", though I think policy should be unafraid to take a strong stance on this. I borrow the "legacy" term from Unicode advocates such as Markus Kuhn. I think it's quite an accurate and justified description of the encodings that are only useful for one or a small number of languages. > >>> 3. man-db 2.5.0-1 moves into testing. [...] > >I should clarify that /usr/share/man/<ll>.UTF-8/ will be used by man for > >all <ll>* locales, not merely for those where the user requested UTF-8; > >man will recode to the appropriate character set on the fly. [...] > "man will recode to the appropriate character set on the fly.", > so on point 3, you should mention also a new "man" version. "3. man-db 2.5.0-1 moves into testing." $ ls -l /usr/bin/man lrwxrwxrwx 1 root root 17 2007-08-26 23:29 /usr/bin/man -> ../lib/man-db/man $ dpkg -S /usr/lib/man-db/man man-db: /usr/lib/man-db/man This is the second time in this thread that you've apparently forgotten to do basic fact-checking before posting. Could you please adjust your behaviour here? This is getting a little tedious. > I like UTF-8, but I don't like that we set UTF-8 as > predefinite debian encoding. > And in such case, I would set a default policy (not only > for manpages, for debian/changelog, ...). Policy is already moving in the direction of a default here. See the footnote to section C.2.2 (which recommends UTF-8 for changelogs): I think it is fairly obvious that we need to eventually transition to UTF-8 for our package infrastructure; it is really the only sane char-set in an international environment. Now, we can't switch to using UTF-8 for package control fields and the like until dpkg has better support, but one thing we can start doing today is requesting that Debian changelogs are UTF-8 encoded. At some point in time, we can start requiring them to do so. > Anyway, IIRC there was some negative comment about email > in UTF-8, in the discussion about DPL vote and wrong > MUA handling of signed UTF-8 vote. E-mail is a difficult case because some mail user agents are stuck in a bygone age, but that is not comparable to the case of a tree of files for use essentially by a single program under our clear control. I don't wish to be arrogant here, but I have six years of practical experience implementing this kind of stuff in man-db (obviously with lots of help from experts in particular languages etc.). I do not want to deal with speculative worries that aren't even about the same subsystem. For the purposes of this proposal, please restrict your concerns to real examples regarding manual pages, not half-remembered comments about e-mail. > Do you think it is feasible to convert manpage on UTF-8, > from the non-latin alphabet? > For this point we should see commentary on i18n list Yes, I do. The Debian CJK patch to groff already implements CJK encodings (the only case that presents any kind of problem here, to my knowledge) by converting them to UCS-2 internally and then back to the source encoding for output. If there is a problem with the conversion, which as far as I have heard there is not right now, then we would already be encountering it. The only other non-Latin encoding currently supported by man-db in Debian is KOI8-R. Since it's a simple 8-bit encoding, I doubt there is any kind of round-trip problem with Unicode, and I have not heard of one. Though the CC hasn't been preserved, I CCed debian-i18n on my initial bug report, so I hope they're aware of this proposal. I have reinstated the CC here. > >>So I propose that manpage specify a charset (i.e. not using the defaul > >>local with only the language (and territory)). > > > >That is what I'm doing here. The character set named in the directory > >name specifies the encoding for all manual pages installed under that > >directory; it does not mandate that only users of that character set may > >use these manual pages. (I understand your confusion since this is not > >what is implemented in current man-db, but frankly that implementation > >doesn't benefit anyone.) > > But you propose only "UTF-8" encoding. I propose that policy should standardise that we move to using UTF-8 as the source encoding for all manual pages since it clearly makes sense to do so. This will still need to be specified by each manual page (by means of the directory in which it is installed), and it does *not* affect what user locales are supported in any way. The internationalisation changes in man-db 2.5.0 will arrange for users to see pages in their native language when they did not before; I do not expect it to cause any users to fail to see pages in their native language when they previously did. Once man-db 2.5.0 is in place, the change in policy to recommend installing pages with UTF-8 encoding in a properly marked directory will have *no* effect on users, no matter what their locale. It is purely for improved maintenance of the system. > Unfortunately Debian is no more the upstream of man-db. Excuse me! I'm sorry, but on this point you seem to be quite rude. *I* am the upstream for man-db, and I do so wearing my Debian developer hat and using my @debian.org address. After Fabrizio's death in 2001, when I took over as Debian maintainer of man-db, I contacted Graeme Wilford informing him of my wish to take over as upstream; I received a reply in mid-April giving me permission. I released man-db 2.3.18 in May 2001, and since then have made seven further upstream releases, the last one being in February of this year. I use the Debian bug tracking system for upstream purposes, typically take account of Debian release cycles when doing upstream development, and upload new upstream versions to Debian promptly. The only thing I don't do is use the native packaging format, which was really never a particularly good idea for man-db and which I don't find helpful in this case. If I as a Debian developer am not the upstream maintainer for man-db, I should very much like to know who is. Please retract this misstatement. The most cursory examination of /usr/share/doc/man-db/copyright would have overturned it. What was the point of saying that, anyway? > In summary, now I'm ok with your proposal. > I don't like the "hardcoded" UTF-8, and I'm not sure that > an automatic conversion is featible for some non latin alphabet. > But it is the only clean and reasonable solution. Thanks. I hope that my comments above clarify some further confusion. I would still appreciate concrete information and examples on why you don't like the idea of manual pages being installed in UTF-8 (noting that as a package maintainer or a translator you wouldn't have to actually edit it in that encoding if you didn't want to, it doesn't have to be done urgently or on any kind of flag day, I have addressed the non-Latin concern above, and it will not have a negative effect on users of non-UTF-8 locales). Regards, -- Colin Watson [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]