Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Ã instead of their proper entity

Damyan Ivanov Fri, 05 Jun 2015 05:30:47 -0700

-=| Mathieu Roy, 05.06.2015 13:35:24 +0200 |=-
> Package: libhtml-parser-perl
> Version: 3.71-1+b3
> Severity: important
> 
> Hello,
> 
> According to http://search.cpan.org/dist/HTML-Parser/lib/HTML/Entities.pm
> 
> 
>  use HTML::Entities;
>  $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
>  print encode_entities($input), "\n"
> 
> print 
> 
>  vis-&agrave;-vis Beyonc&eacute;'s na&iuml;ve
>  papier-m&acirc;ch&eacute; r&eacute;sum&eacute;
> 
> 
> That's correct.
> 
> 
> However, here:
> 
>   $ cat test.pl 
> #!/usr/bin/perl
> 
> use HTML::Entities;
> $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
> print encode_entities($input), "\n"
> 
> # EOF 
> 
>   $ perl test.pl 
> vis-&Atilde;&nbsp;-vis Beyonc&Atilde;&copy;&#39;s na&Atilde;&macr;ve
> papier-m&Atilde;&cent;ch&Atilde;&copy; r&Atilde;&copy;sum&Atilde;&copy;


I can confirm that. However, adding "use utf8;" to the test script 
fixes the output. So it seems to me that your test file is encoded in 
utf8 and you need to tell that to perl.

HTML::Entities encodes characters, and it depends on perl's 
interpretation of the source text. Without an explicit 'use utf8' it 
is considered to be Latin1, which I think leads to the garbage above.

If you recode the test file in latin1, everything will work as 
expected, since latin1 is the default encoding.

> Where do these &Atilde; come from?
> According to http://www.w3schools.com/charsets/ref_html_entities_4.asp it's 
> for Ã.
> 
> I tested the same script on a debian stable and on some ubuntu with the exact 
> same result.
> 
> I dont know what I'm doing wrong here but a simple copy/paste of the 
> documented example does not work.

I guess the documentation needs 'use utf8;' somewhere or maybe 
something more generic, since the same text may be encoded in latin1.

> Other similar commands work as expected. For instance:
> 
> echo "vis-à-vis Beyoncé's naïve\npapier-mâché résumé" | recode utf8..html
> vis-&agrave;-vis Beyonc&eacute;'s na&iuml;ve\npapier-m&acirc;ch&eacute; 
> r&eacute;sum&eacute;
> 
> 
> 
> 
> Plus, as a side bug (require a report on its own?),
> man HTML::Entities prints
> 
>    For example, this:
> 
>         $input = "vis-a-vis Beyonce's naieve\npapier-mache resume";
>         print encode_entities($input), "\n"
> 
>        Prints this out:
> 
>         [...]
> 
> Yes, the man page example is actually stripped of entities to encode!

Not sure where the problem is here. perldoc works fine:

 perldoc HTML::Entities

pod2man /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm 
generates stuff like:

 \& $input = "vis\-a\*`\-vis Beyonce\*'\*(Aqs 
 nai\*:ve\enpapier\-ma\*^che\*' re\*'sume\*'";

Which I guess is *roff speak for accents.

Adding --utf8 seems to get it right:

 pod2man --utf8 /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm \
     |   man -l -


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Ã instead of their proper entity

Reply via email to