Karl Berry wrote:
"All possible characters have a UTF-8 representation so this function
[encode_utf8] cannot fail."
What about non-characters, i.e., byte sequences that are invalid UTF-8?
Each individual byte gets encoded as UTF-8. 0x00..0x7F are an identity
map, while 0x80..0xFF are translated to 2-octet sequences. /Decoding/
UTF-8 can blow up or produce bogus results (I think Perl might just drop
in the "substitute" character and emit a warning) but /encoding/ UTF-8
always works, even on UTF-8. Remember that Perl could handle arbitrary
binary data long before it had Unicode support.
What I found was that using \N{...} implies a Unicode string. From the
charnames(3) man page (stranged not named "perlcharnames"):
Otherwise, any string that includes a "\N{charname}" or "\N{U+code
point}" will automatically have Unicode rules (see "Byte and
Character Semantics" in perlunicode).
That page is named "charnames" because it documents the "charnames"
pragmatic module. The man page version was translated from the perldoc
system when perl was built/installed/packaged. The "perlunicode" page
documents general Unicode support in Perl.
Maybe pack("C") somehow can get to the bytes from a Unicode string?
All strings in Perl are Unicode now, internally stored as UTF-8 or, as
an optimization if no codepoints exceed 255, raw octets. (A string of
raw octets is considered to be a sequence of characters in the range
[0,255].) The "utf8 flag" on a string indicates which of those forms is
in use on any particular string. Using encode_utf8 simply gives you the
internal encoding, converting an octet string to UTF-8 if needed, marked
as an octet string. If the string is already UTF-8, encode_utf8 simply
clears the utf8 flag so you get access to the raw bytes. (Brain twisted
yet? Mine was when I first looked at this...)
Perl's Unicode handling is fun because Perl could always handle binary
data, and Unicode support was more-or-less retrofitted on top of that
support for binary data. In other words, if your program does not
handle Unicode properly (or if you are running on Perl 5.6 and your
program does not do the Perl 5.6 magic Unicode dances), Perl will treat
"Unicode" data as its underlying octet sequence; thus my earlier advice
to conditionally import Encode and shim encode_utf8 with an identity
function if Encode is not available.
-- Jacob