DAHUT!
So, inside my AxKit app, but not directly related to AxKit, I'm parsing
some HTML I get from TinyMCE, to add it properly to the output tree.
The problem is that after the HTML is parsed, it appears to be
double-encoded, so the valid UTF-8 that comes in, is borked when it
gets out:
The relevant code, with debug statements, currently looks like this:
my $parser = XML::LibXML->new();
$parser->recover(1);
warn "FOO: ". $content;
use Devel::Peek 'Dump';
warn Dump($content);
my $parsed = $parser->parse_html_string($content);
my @fragments = $parsed->findnodes('/html/body/*');
foreach my $fragment (@fragments) {
my $tmp = $fragment->toString;
warn "BAR: ". $tmp;
Dump($tmp);
The output of these Devel::Peek dumps to the Apache log, looks like this
before parsing:
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x8777c98 "Eivind forteller at
\"Multitude\":http://www.multitude.no/ arrangerer \303\245pent
So, valid UTF8, the flag is set and the å is correct.
After parsing, it looks like this:
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x8778000 "<p>Eivind forteller at
\"Multitude\":http://www.multitude.no/ arrangerer \303\203\302\245pent
The UTF8 flag is still set, but the å was apparently broken up to its
individual bytes somewhere in the process, and the bytes were again
encoded, so, a double-encoding problem.
I guess the solution is straightforward, just tell the parser
that "don't worry, it is allready UTF-8, just leave it alone", but I
can't figure out how...
So, it is not an AxKit question, but I would guess it is a good place to
ask anyway? :-)
Friendly Tiddely-pom,
Kjetil
--
Kjetil Kjernsmo
Programmer / Astrophysicist / Ski-orienteer / Orienteer / Mountaineer
[EMAIL PROTECTED]
Homepage: http://www.kjetil.kjernsmo.net/ OpenPGP KeyID: 6A6A0BBC
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]