[bug#67841] [PATCH] Clarify error messages for misuse of m4_warn and --help for -W.

Jacob Bachmeyer Tue, 19 Dec 2023 19:06:06 -0800

Karl Berry wrote:

"All possible characters have a UTF-8 representation so this function[encode_utf8] cannot fail."
What about non-characters, i.e., byte sequences that are invalid UTF-8?

Each individual byte gets encoded as UTF-8. 0x00..0x7F are an identitymap, while 0x80..0xFF are translated to 2-octet sequences. /Decoding/UTF-8 can blow up or produce bogus results (I think Perl might just dropin the "substitute" character and emit a warning) but /encoding/ UTF-8always works, even on UTF-8. Remember that Perl could handle arbitrarybinary data long before it had Unicode support.

What I found was that using \N{...} implies a Unicode string. From the
charnames(3) man page (stranged not named "perlcharnames"):

     Otherwise, any string that includes a "\N{charname}" or "\N{U+code
     point}" will automatically have Unicode rules (see "Byte and
     Character Semantics" in perlunicode).

That page is named "charnames" because it documents the "charnames"pragmatic module. The man page version was translated from the perldocsystem when perl was built/installed/packaged. The "perlunicode" pagedocuments general Unicode support in Perl.

Maybe pack("C") somehow can get to the bytes from a Unicode string?

All strings in Perl are Unicode now, internally stored as UTF-8 or, asan optimization if no codepoints exceed 255, raw octets. (A string ofraw octets is considered to be a sequence of characters in the range[0,255].) The "utf8 flag" on a string indicates which of those forms isin use on any particular string. Using encode_utf8 simply gives you theinternal encoding, converting an octet string to UTF-8 if needed, markedas an octet string. If the string is already UTF-8, encode_utf8 simplyclears the utf8 flag so you get access to the raw bytes. (Brain twistedyet? Mine was when I first looked at this...)

Perl's Unicode handling is fun because Perl could always handle binarydata, and Unicode support was more-or-less retrofitted on top of thatsupport for binary data. In other words, if your program does nothandle Unicode properly (or if you are running on Perl 5.6 and yourprogram does not do the Perl 5.6 magic Unicode dances), Perl will treat"Unicode" data as its underlying octet sequence; thus my earlier adviceto conditionally import Encode and shim encode_utf8 with an identityfunction if Encode is not available.



-- Jacob

[bug#67841] [PATCH] Clarify error messages for misuse of m4_warn and --help for -W.

Reply via email to