Bug#373203: ikiwiki: utf8 not handled in image titles

Joey Hess Wed, 14 Jun 2006 22:11:52 -0700

Pablo Barbachano wrote:
> Hi, I have a wiki page like this (simplified for the report):
> 
> cat >fo.mdwn <<EOF
> ![o](../images/o.jpg "ó")
> EOF
> 
> It is valid utf8. When it is converted to html, the 'ó' gets converted to
> Ã³
> 
> I don't know if it is bug in markdown or ikiwiki.


You can work around this bug by turning off the htmlscrubber module,
either by passing --disable-module htmlscrubber or by removing it from
your ikiwiki setup file.

And what it seems to be doing is not treating the input as utf-8 and encoding
each of the two bytes of the dual-width utf-8 character separately. Let's see:

[EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn 
<p><img src="../images/o.jpg" alt="o" title="ó" />
óóóóó</p>
[EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn | perl -e 'use 
HTML::Scrubber; my $s=HTML::Scrubber->new(allow => [qw{img}], default => 
[undef, { alt => 1, src => 1, title => 1 }]); while (<>) { print $s->scrub($_) 
}'
<img src="../images/o.jpg" alt="o" title="&Atilde;&sup3;">
óóóóó
[EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn | perl -e 'use Encode; 
use HTML::Scrubber; my $s=HTML::Scrubber->new(allow => [qw{img}], default => 
[undef, { alt => 1, src => 1, title => 1 }]); while (<>) { print 
$s->scrub(Encode::decode_utf8($_)) }'  
<img src="../images/o.jpg" alt="o" title="&oacute;">
���
Not sure what happened to the "óóóóó" there, but on the right track..
[EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn | perl -CSD -e 'use 
HTML::Scrubber; my $s=HTML::Scrubber->new(allow => [qw{img}], default => 
[undef, { alt => 1, src => 1, title => 1 }]); while (<>) { print $s->scrub($_) 
}' 
<img src="../images/o.jpg" alt="o" title="&oacute;">
óóóóó

So running perl with -CSD, as ikiwiki does, should make it work. But it
doesn't in ikiwiki, so I guess that what we get back from markdown in
ikiwiki is not being treated as utf8 internally before the sanitize hook
is called. I don't understand why though. This was changed in Recai's
big utf-8 patch in ikiwiki 1.5; if I back that patch out things work ok.

Or I could just do this:

Index: IkiWiki/Render.pm
===================================================================
--- IkiWiki/Render.pm   (revision 795)
+++ IkiWiki/Render.pm   (working copy)
@@ -39,9 +39,12 @@
        }
 
        if (exists $hooks{sanitize}) {
+               require Encode;
+               $content=Encode::decode_utf8($content);
                foreach my $id (keys %{$hooks{sanitize}}) {
                        $content=$hooks{sanitize}{$id}{call}->($content);
                }
+               $content=Encode::encode_utf8($content);
        }
        
        return $content;

This patch fixes the problem, but I don't understand why we have to
re-encode the string to utf-8 on the way out. ikiwiki should just be
using decoded utf-8 internally throughout and perl automatically 
converting to utf-8 on output. 

Beginning to think that Recai's patch wasn't the right approach. With my
patch above, if displaying a preview page, ikiwiki will now:

        - Read it in from CGI as, apparently, raw utf-8
        - decode_utf8 so it's in perl's internal representation
        - htmlize it via markdown, which will include running decode_utf8
          again on the markdown output as above, and then encode_utf8 so it's
          back to raw utf-8
        - decode_utf8 once again in the preview code
        - finally turn it back into raw utf-8 again and emit it to the
          browser

Yugh. This is becoming far too ugly to live. Maybe Recai can help figure
this out..

-- 
see shy jo

signature.asc
Description: Digital signature

Bug#373203: ikiwiki: utf8 not handled in image titles

Reply via email to