Niko Tyni <[EMAIL PROTECTED]> writes: > Hm, this is looking worse the more I stare at it.
I spent four and a half hours on this the other night before producing the patch that was in my previous message, so I'm sympathetic. :) It gets to be more and more of a headache the more you work through it. > I've been testing pod2man with the attached .pod file that does have > '=encoding UTF-8', and the current Debian (from 5.10.0-15) 'pod2man > --utf8' gives these results: > > - the Finnish "a with two dots", i.e. LATIN SMALL LETTER A WITH DIAERESIS, > is output as its ISO-8859-1 representation (octal 344) > > - the Russian letter "n", CYRILLIC SMALL LETTER EN, is output in UTF-8: > octal 320+275. However, there's a warning: > > Wide character in print at /usr/share/perl/5.10/Pod/Man.pm line 717. > > - S<one two> gets the ISO-8859-1 NO-BREAK SPACE in between > > So the output is ISO-8859-1 where possible and UTF-8 elsewhere. I was afraid of this. The problem is that the version of Pod::Man that you have at the moment doesn't understand anything about output encoding. It therefore prints out whatever Pod::Simple hands it. This is, in Perl's Unicode world, basically unsupported behavior. What you get can be fairly random. It works in some cases but doesn't work in others. This is the reason why I thought I needed to do things like remap the non-breaking space. The output is very confused, and I didn't understand at first what I was seeing. If one is dealing with Unicode in Perl, one is *required* to decode all input and encode all output. Nothing else works. Pod::Simple does decode input *if* =encoding is used, but doesn't encode output. Pod::Man (and Pod::Text for that matter) therefore have to encode output in order to work properly. The patch I sent previously does implement that, with some other consequences. One of the problems that makes this unnecessarily hard is that Perl doesn't keep track of whether it's *already* encoded output, so if you set an output encoding with binmode and also call encode() directly, you get double-encoded output. This basically means that, in practice, encode() is unusable if you want to support the PERL_UNICODE environment variable, since setting PERL_UNICODE silently adds output encodings to all your file handles which will then happily double-encode the results of encode(). > Russ, I think the binmode($output, ":utf8") really belongs in pod2man > instead of Pod::Man. It turns out, at least based on the experiments that I did, that you never want to use an encoding of :utf8. What this does is tell Perl to just dump its internal encoding to the file handle rather than applying any encoding. The only supported thing you can do with that byte stream is to read it back in via another file handle using the :utf8 encoding. It is *not* necessarily valid UTF-8, and in practice I was getting all sorts of really strange things from it when looking at it via something other than Perl. You always want to use :encoding(utf-8) instead if the output is for anything other than Perl. > Users of Pod::Man should do that themselves for their output file handle > when they use the 'utf8' option. (This needs documentation, of course.) I'm not sure I like this as an interface since Pod::Man's supported interface involves opening the files itself. This would mean that anyone who wants Unicode output can't use the API of Pod::Man and Pod::Text that have been supported for years. I'd really rather try to transparently support Unicode using the existing API, even if it means messing with the state of provided output file handles. > However, pod2man currently uses the parse_from_file() method, which is > just a compatibility wrapper in Pod::Simple that does the open() and > output_fh() calls. I suppose this should go in pod2man itself. > Something like the attached patch might do, although I see there's some > deeper magic in Pod::Simple. This patch looks fine to me as a workaround, although I think my previous patch is the better long-term fix. Note that Pod::Text has related issues; try running pod2text on your same sample POD file and you'll see that it produces warnings about wide characters as well. I'm not sure if that's worth trying to tackle for lenny, though (it affects perldoc -t). -- Russ Allbery ([EMAIL PROTECTED]) <http://www.eyrie.org/~eagle/> -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]