XML::LibXML parse_html_string and Encoding problems

Kjetil Kjernsmo Mon, 06 Nov 2006 11:14:53 -0800

DAHUT!

So, inside my AxKit app, but not directly related to AxKit, I'm parsing 
some HTML I get from TinyMCE, to add it properly to the output tree. 
The problem is that after the HTML is parsed, it appears to be 
double-encoded, so the valid UTF-8 that comes in, is borked when it 
gets out:


The relevant code, with debug statements, currently looks like this:

            my $parser = XML::LibXML->new();
            $parser->recover(1);
            warn "FOO: ". $content;
            use Devel::Peek 'Dump';
            warn Dump($content);
            my $parsed = $parser->parse_html_string($content); 
            my @fragments = $parsed->findnodes('/html/body/*');
            foreach my $fragment (@fragments) {
              my $tmp = $fragment->toString;
               warn "BAR: ". $tmp;
              Dump($tmp);

The output of these Devel::Peek dumps to the Apache log, looks like this 
before parsing:
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x8777c98 "Eivind forteller at 
\"Multitude\":http://www.multitude.no/ arrangerer \303\245pent

So, valid UTF8, the flag is set and the å is correct.

After parsing, it looks like this:
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x8778000 "<p>Eivind forteller at 
\"Multitude\":http://www.multitude.no/ arrangerer \303\203\302\245pent

The UTF8 flag is still set, but the å was apparently broken up to its 
individual bytes somewhere in the process, and the bytes were again 
encoded, so, a double-encoding problem.

I guess the solution is straightforward, just tell the parser 
that "don't worry, it is allready UTF-8, just leave it alone", but I 
can't figure out how... 

So, it is not an AxKit question, but I would guess it is a good place to 
ask anyway? :-)

Friendly Tiddely-pom,

Kjetil
-- 
Kjetil Kjernsmo
Programmer / Astrophysicist / Ski-orienteer / Orienteer / Mountaineer
[EMAIL PROTECTED]
Homepage: http://www.kjetil.kjernsmo.net/     OpenPGP KeyID: 6A6A0BBC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

XML::LibXML parse_html_string and Encoding problems

Reply via email to