-=| Mathieu Roy, 05.06.2015 13:35:24 +0200 |=- > Package: libhtml-parser-perl > Version: 3.71-1+b3 > Severity: important > > Hello, > > According to http://search.cpan.org/dist/HTML-Parser/lib/HTML/Entities.pm > > > use HTML::Entities; > $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé"; > print encode_entities($input), "\n" > > print > > vis-à-vis Beyoncé's naïve > papier-mâché résumé > > > That's correct. > > > However, here: > > $ cat test.pl > #!/usr/bin/perl > > use HTML::Entities; > $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé"; > print encode_entities($input), "\n" > > # EOF > > $ perl test.pl > vis-à -vis Beyoncé's naïve > papier-mâché résumé
I can confirm that. However, adding "use utf8;" to the test script fixes the output. So it seems to me that your test file is encoded in utf8 and you need to tell that to perl. HTML::Entities encodes characters, and it depends on perl's interpretation of the source text. Without an explicit 'use utf8' it is considered to be Latin1, which I think leads to the garbage above. If you recode the test file in latin1, everything will work as expected, since latin1 is the default encoding. > Where do these à come from? > According to http://www.w3schools.com/charsets/ref_html_entities_4.asp it's > for Ã. > > I tested the same script on a debian stable and on some ubuntu with the exact > same result. > > I dont know what I'm doing wrong here but a simple copy/paste of the > documented example does not work. I guess the documentation needs 'use utf8;' somewhere or maybe something more generic, since the same text may be encoded in latin1. > Other similar commands work as expected. For instance: > > echo "vis-à-vis Beyoncé's naïve\npapier-mâché résumé" | recode utf8..html > vis-à-vis Beyoncé's naïve\npapier-mâché > résumé > > > > > Plus, as a side bug (require a report on its own?), > man HTML::Entities prints > > For example, this: > > $input = "vis-a-vis Beyonce's naieve\npapier-mache resume"; > print encode_entities($input), "\n" > > Prints this out: > > [...] > > Yes, the man page example is actually stripped of entities to encode! Not sure where the problem is here. perldoc works fine: perldoc HTML::Entities pod2man /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm generates stuff like: \& $input = "vis\-a\*`\-vis Beyonce\*'\*(Aqs nai\*:ve\enpapier\-ma\*^che\*' re\*'sume\*'"; Which I guess is *roff speak for accents. Adding --utf8 seems to get it right: pod2man --utf8 /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm \ | man -l - -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org