Niko Tyni <[EMAIL PROTECTED]> writes:

> Hm, this is looking worse the more I stare at it.

I spent four and a half hours on this the other night before producing the
patch that was in my previous message, so I'm sympathetic.  :)  It gets to
be more and more of a headache the more you work through it.

> I've been testing pod2man with the attached .pod file that does have
> '=encoding UTF-8', and the current Debian (from 5.10.0-15) 'pod2man
> --utf8' gives these results:
>
> - the Finnish "a with two dots", i.e. LATIN SMALL LETTER A WITH DIAERESIS,
>   is output as its ISO-8859-1 representation (octal 344)
>
> - the Russian letter "n", CYRILLIC SMALL LETTER EN, is output in UTF-8: 
>   octal 320+275. However, there's a warning:
>
> Wide character in print at /usr/share/perl/5.10/Pod/Man.pm line 717.
>
> - S<one two> gets the ISO-8859-1 NO-BREAK SPACE in between
>
> So the output is ISO-8859-1 where possible and UTF-8 elsewhere.

I was afraid of this.

The problem is that the version of Pod::Man that you have at the moment
doesn't understand anything about output encoding.  It therefore prints
out whatever Pod::Simple hands it.  This is, in Perl's Unicode world,
basically unsupported behavior.  What you get can be fairly random.  It
works in some cases but doesn't work in others.

This is the reason why I thought I needed to do things like remap the
non-breaking space.  The output is very confused, and I didn't understand
at first what I was seeing.

If one is dealing with Unicode in Perl, one is *required* to decode all
input and encode all output.  Nothing else works.  Pod::Simple does decode
input *if* =encoding is used, but doesn't encode output.  Pod::Man (and
Pod::Text for that matter) therefore have to encode output in order to
work properly.  The patch I sent previously does implement that, with some
other consequences.

One of the problems that makes this unnecessarily hard is that Perl
doesn't keep track of whether it's *already* encoded output, so if you set
an output encoding with binmode and also call encode() directly, you get
double-encoded output.  This basically means that, in practice, encode()
is unusable if you want to support the PERL_UNICODE environment variable,
since setting PERL_UNICODE silently adds output encodings to all your file
handles which will then happily double-encode the results of encode().

> Russ, I think the binmode($output, ":utf8") really belongs in pod2man
> instead of Pod::Man.

It turns out, at least based on the experiments that I did, that you never
want to use an encoding of :utf8.  What this does is tell Perl to just
dump its internal encoding to the file handle rather than applying any
encoding.  The only supported thing you can do with that byte stream is to
read it back in via another file handle using the :utf8 encoding.  It is
*not* necessarily valid UTF-8, and in practice I was getting all sorts of
really strange things from it when looking at it via something other than
Perl.

You always want to use :encoding(utf-8) instead if the output is for
anything other than Perl.

> Users of Pod::Man should do that themselves for their output file handle
> when they use the 'utf8' option. (This needs documentation, of course.)

I'm not sure I like this as an interface since Pod::Man's supported
interface involves opening the files itself.  This would mean that anyone
who wants Unicode output can't use the API of Pod::Man and Pod::Text that
have been supported for years.  I'd really rather try to transparently
support Unicode using the existing API, even if it means messing with the
state of provided output file handles.

> However, pod2man currently uses the parse_from_file() method, which is
> just a compatibility wrapper in Pod::Simple that does the open() and
> output_fh() calls. I suppose this should go in pod2man itself.
> Something like the attached patch might do, although I see there's some
> deeper magic in Pod::Simple.

This patch looks fine to me as a workaround, although I think my previous
patch is the better long-term fix.

Note that Pod::Text has related issues; try running pod2text on your same
sample POD file and you'll see that it produces warnings about wide
characters as well.  I'm not sure if that's worth trying to tackle for
lenny, though (it affects perldoc -t).

-- 
Russ Allbery ([EMAIL PROTECTED])             <http://www.eyrie.org/~eagle/>



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to