UTF-8 and Internal Representation of (Latin1) Characters

Peter Daum Tue, 11 May 2010 03:54:50 -0700

Because I had little need for it I had tried to just ignore Perl's
Unicode support as long as possible. Now it looks like I can't do that
anymore, so I started looking through the various docs.


One thing that confused me: several sources mention Perl using 8-bit
characters as long as possible, which seems to contradict some observations.

"perluniintro" for example says:
  "if all code points in the string are 0xFF or less, Perl uses the
  native eight-bit character set.  Otherwise, it uses UTF-8."

and the documentation for "Encode":
   · When you decode, the resulting UTF8 flag is on unless you can
     unambiguously represent data.  Here is the definition of dis-
     ambiguity.
      After "$utf8 = decode('foo', $octet);",
        When $octet is...   The UTF8 flag in $utf8 is
       ---------------------------------------------
       In ASCII only (or EBCDIC only)            OFF
       In ISO-8859-1                              ON
       In any other Encoding                      ON
       ---------------------------------------------

But when I look at it with Devel::Peek, it seems like after "decoding"
- the UTF8 flag is always on
- only ASCII characters are stored as bytes,
  everything else is converted to utf-8

> perl -MDevel::Peek -MEncode -e 'Dump(decode latin1 => "\x41")'
SV = PV(0x603e58) at 0x62d620
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK,UTF8)
  PV = 0x6b3d90 "A"\0 [UTF8 "A"]

> perl -MDevel::Peek -MEncode -e 'Dump("\xf6")'
SV = PV(0x70bda8) at 0x606f10
  REFCNT = 1
  FLAGS = (PADTMP,POK,READONLY,pPOK)
  PV = 0x61ded0 "\366"\0

> perl -MDevel::Peek -MEncode -e 'Dump(decode latin1 => "\xf6")'
SV = PV(0x603e58) at 0x62d620
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK,UTF8)
  PV = 0x6b3d90 "\303\266"\0 [UTF8 "\x{f6}"]

So, which is true? Is the Unicode documentation obsolete and the
internal representation changed (I know, I should not worry about
internals ;-) or is the output of Devel::Peek::Dump misleading?

Regards,
                      Peter


-- 
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/

UTF-8 and Internal Representation of (Latin1) Characters

Reply via email to