Because I had little need for it I had tried to just ignore Perl's
Unicode support as long as possible. Now it looks like I can't do that
anymore, so I started looking through the various docs.
One thing that confused me: several sources mention Perl using 8-bit
characters as long as possible, which seems to contradict some observations.
"perluniintro" for example says:
"if all code points in the string are 0xFF or less, Perl uses the
native eight-bit character set. Otherwise, it uses UTF-8."
and the documentation for "Encode":
ยท When you decode, the resulting UTF8 flag is on unless you can
unambiguously represent data. Here is the definition of dis-
ambiguity.
After "$utf8 = decode('foo', $octet);",
When $octet is... The UTF8 flag in $utf8 is
---------------------------------------------
In ASCII only (or EBCDIC only) OFF
In ISO-8859-1 ON
In any other Encoding ON
---------------------------------------------
But when I look at it with Devel::Peek, it seems like after "decoding"
- the UTF8 flag is always on
- only ASCII characters are stored as bytes,
everything else is converted to utf-8
> perl -MDevel::Peek -MEncode -e 'Dump(decode latin1 => "\x41")'
SV = PV(0x603e58) at 0x62d620
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x6b3d90 "A"\0 [UTF8 "A"]
> perl -MDevel::Peek -MEncode -e 'Dump("\xf6")'
SV = PV(0x70bda8) at 0x606f10
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x61ded0 "\366"\0
> perl -MDevel::Peek -MEncode -e 'Dump(decode latin1 => "\xf6")'
SV = PV(0x603e58) at 0x62d620
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x6b3d90 "\303\266"\0 [UTF8 "\x{f6}"]
So, which is true? Is the Unicode documentation obsolete and the
internal representation changed (I know, I should not worry about
internals ;-) or is the output of Devel::Peek::Dump misleading?
Regards,
Peter
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/