On Wed, Oct 01, 2008 at 02:10:53AM -0700, Russ Allbery wrote: > Niko Tyni <[EMAIL PROTECTED]> writes: > > Any estimate on how widespread this POD problem is? Is the hardcoded > > 'pod2man --utf8' in the Lenny perldoc going to cause more grief than > > it's worth? > > > > I'm leaning on reverting that and reopening #492037 until the issue is > > sorted out in Pod-Perldoc upstream. Adding a way to enable or disable > > the '--utf8' option on the perldoc command line is one possibility, > > but it might as well cause even further trouble if upstream chooses a > > different implementation. > > I looked at this some more, and there's a deeper problem. If you run the > current pod2man with --utf8 on an input POD file that doesn't declare an > =encoding of UTF-8, any use of S<> in that POD file will result in invalid > UTF-8, even if there's no use of high-bit characters in the input POD at > all.
Thanks for pointing out =encoding to me; I completely missed that in the documentation. > I think the core problem was that Pod::Man is responsible for the output > through the file handle and was missing an encoding layer. The problem is > that we can't just call encode() on the output, since that breaks if > PERL_UNICODE is set or if an encoding was manually set on the file handle. > You get double-encoding. I think the least bad option is for Pod::Man and > Pod::Text to force the encoding on their output file handles to UTF-8 when > --utf8 is given. > > The problem with this fix is that this now really will break pod2man > --utf8 if POD documents don't have their encoding declared properly, since > it will end up double-encoding the UTF-8 given that, without =encoding, > Pod::Simple is treating the input as ISO 8859-15. I think it's correct > according to the specifications, but existing POD text that doesn't > declare an encoding will get double-encoded output. I can work around > this by not setting a UTF-8 output encoding unless the input encoding is > detected as UTF-8, but that's not really correct. You *should* be able to > have an input POD document with =encoding ISO-8859-1 and run it through > pod2man --utf8 and get UTF-8 output. But a POD document with no > =encoding according to perlpodspec has an implicit =encoding ISO-8859-1. While this is certainly something extra that people have to bear in mind when using pod2man --utf8, it *is* an option people have to enable manually (well, except for in perldoc; I suppose I'm more worried about generated manual pages), and it doesn't seem too unreasonable to just say that you have to specify =encoding when doing so. If that were mentioned explicitly in the pod2man manual page then I think that would be good enough. Assuming that your intent is to run with UTF-8 across the board, then just sticking "=encoding UTF-8" at the top of all POD files before passing them to pod2man is sufficient, and that's not too hard. The diff to debconf looks like this: Index: doc/Makefile =================================================================== --- doc/Makefile (revision 2310) +++ doc/Makefile (working copy) @@ -4,6 +4,9 @@ pod2man=pod2man -c Debconf -r '' --utf8 manpages: cd man && po4a po4a/po4a.cfg + for pod in man/*.pod; do \ + perl -pi -e 'if (not $$seen and /^=head1/) { print "=encoding UTF-8\n\n"; $$seen = 1; }' $$pod; \ + done install -d man/gen for num in 1 3 8; do \ find man -maxdepth 1 -type f -name "*.$$num.pod" -printf '%P\n' | \ I'd prefer to do this with a po4a addendum, but it turns out to be an absolute pain. Also this would break if any of the source documents contained S<>. Maybe I should just change all the source documents instead. Perhaps it would be helpful if po4a inserted an =encoding paragraph? After all, it understands POD and it knows the encoding. > I think that for lenny you may want to back out of the --utf8 change and > give it some time to settle. Hmm, this would be a shame. With your most recent patch it's now finally possible for debconf to generate working manual pages for Russian and French at the same time. I understand the perldoc problem though ... -- Colin Watson [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]