On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote:
> On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
>> I think what I need is some code to strip non-utf8 characters from a string
>> -- even if that string has the utf8 bit switched on. I thought that Encode
>> would do that for me, but in this case apparently not. Anyone got an
>> example?
>
> Tri this:
>
> Encode::_utf8_off($string);
> $string = Encode::decode('utf8', $string);
>
> That will replace any byte sequences which are invalid UTF-8 with the Unicode
> replacement character.
Yeah. Not working for me. See attached script. Devel::Peek says:
SV = PV(0x100801f18) at 0x10082f368
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1002015c0 "<p>Tomas Laurinavi\303\204\302\215ius</p>"\0 [UTF8
"<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"]
CUR = 29
LEN = 32
So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What is
that crap?
Confused and frustrated,
David
#!/usr/local/bin/perl -w
use 5.12.0;
use Encode;
use Devel::Peek;
my $str = '<p>Tomas LaurinaviÃÂius</p>';
my $utf8 = decode('UTF-8', $str);
say $str;
binmode STDOUT, ':utf8';
say $utf8;
Dump($utf8);