Pablo Barbachano wrote: > Hi, I have a wiki page like this (simplified for the report): > > cat >fo.mdwn <<EOF >  > EOF > > It is valid utf8. When it is converted to html, the 'ó' gets converted to > ó > > I don't know if it is bug in markdown or ikiwiki.
You can work around this bug by turning off the htmlscrubber module, either by passing --disable-module htmlscrubber or by removing it from your ikiwiki setup file. And what it seems to be doing is not treating the input as utf-8 and encoding each of the two bytes of the dual-width utf-8 character separately. Let's see: [EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn <p><img src="../images/o.jpg" alt="o" title="ó" /> óóóóó</p> [EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn | perl -e 'use HTML::Scrubber; my $s=HTML::Scrubber->new(allow => [qw{img}], default => [undef, { alt => 1, src => 1, title => 1 }]); while (<>) { print $s->scrub($_) }' <img src="../images/o.jpg" alt="o" title="ó"> óóóóó [EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn | perl -e 'use Encode; use HTML::Scrubber; my $s=HTML::Scrubber->new(allow => [qw{img}], default => [undef, { alt => 1, src => 1, title => 1 }]); while (<>) { print $s->scrub(Encode::decode_utf8($_)) }' <img src="../images/o.jpg" alt="o" title="ó"> ��� Not sure what happened to the "óóóóó" there, but on the right track.. [EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn | perl -CSD -e 'use HTML::Scrubber; my $s=HTML::Scrubber->new(allow => [qw{img}], default => [undef, { alt => 1, src => 1, title => 1 }]); while (<>) { print $s->scrub($_) }' <img src="../images/o.jpg" alt="o" title="ó"> óóóóó So running perl with -CSD, as ikiwiki does, should make it work. But it doesn't in ikiwiki, so I guess that what we get back from markdown in ikiwiki is not being treated as utf8 internally before the sanitize hook is called. I don't understand why though. This was changed in Recai's big utf-8 patch in ikiwiki 1.5; if I back that patch out things work ok. Or I could just do this: Index: IkiWiki/Render.pm =================================================================== --- IkiWiki/Render.pm (revision 795) +++ IkiWiki/Render.pm (working copy) @@ -39,9 +39,12 @@ } if (exists $hooks{sanitize}) { + require Encode; + $content=Encode::decode_utf8($content); foreach my $id (keys %{$hooks{sanitize}}) { $content=$hooks{sanitize}{$id}{call}->($content); } + $content=Encode::encode_utf8($content); } return $content; This patch fixes the problem, but I don't understand why we have to re-encode the string to utf-8 on the way out. ikiwiki should just be using decoded utf-8 internally throughout and perl automatically converting to utf-8 on output. Beginning to think that Recai's patch wasn't the right approach. With my patch above, if displaying a preview page, ikiwiki will now: - Read it in from CGI as, apparently, raw utf-8 - decode_utf8 so it's in perl's internal representation - htmlize it via markdown, which will include running decode_utf8 again on the markdown output as above, and then encode_utf8 so it's back to raw utf-8 - decode_utf8 once again in the preview code - finally turn it back into raw utf-8 again and emit it to the browser Yugh. This is becoming far too ugly to live. Maybe Recai can help figure this out.. -- see shy jo
signature.asc
Description: Digital signature