Karl Berry wrote:
"All possible characters have a UTF-8 representation so this function [encode_utf8] cannot fail."

What about non-characters, i.e., byte sequences that are invalid UTF-8?

Each individual byte gets encoded as UTF-8. 0x00..0x7F are an identity map, while 0x80..0xFF are translated to 2-octet sequences. /Decoding/ UTF-8 can blow up or produce bogus results (I think Perl might just drop in the "substitute" character and emit a warning) but /encoding/ UTF-8 always works, even on UTF-8. Remember that Perl could handle arbitrary binary data long before it had Unicode support.

What I found was that using \N{...} implies a Unicode string. From the
charnames(3) man page (stranged not named "perlcharnames"):

     Otherwise, any string that includes a "\N{charname}" or "\N{U+code
     point}" will automatically have Unicode rules (see "Byte and
     Character Semantics" in perlunicode).

That page is named "charnames" because it documents the "charnames" pragmatic module. The man page version was translated from the perldoc system when perl was built/installed/packaged. The "perlunicode" page documents general Unicode support in Perl.

Maybe pack("C") somehow can get to the bytes from a Unicode string?

All strings in Perl are Unicode now, internally stored as UTF-8 or, as an optimization if no codepoints exceed 255, raw octets. (A string of raw octets is considered to be a sequence of characters in the range [0,255].) The "utf8 flag" on a string indicates which of those forms is in use on any particular string. Using encode_utf8 simply gives you the internal encoding, converting an octet string to UTF-8 if needed, marked as an octet string. If the string is already UTF-8, encode_utf8 simply clears the utf8 flag so you get access to the raw bytes. (Brain twisted yet? Mine was when I first looked at this...)

Perl's Unicode handling is fun because Perl could always handle binary data, and Unicode support was more-or-less retrofitted on top of that support for binary data. In other words, if your program does not handle Unicode properly (or if you are running on Perl 5.6 and your program does not do the Perl 5.6 magic Unicode dances), Perl will treat "Unicode" data as its underlying octet sequence; thus my earlier advice to conditionally import Encode and shim encode_utf8 with an identity function if Encode is not available.


-- Jacob



Reply via email to