On Mon, 02 Apr 2018, rhkra...@gmail.com wrote: > The wikipedia article is rather interesting, in a quick skim, I learned some > interesting things about UTF-8, especially the property of self- > synchronization.
Yes, UTF-8 is a brilliant design. > I had trouble reading that large table--but if I simply take the red boxes at > face value, maybe there are 10 or so bytes that are not valid UTF-8. I'll > probably first consider the bytes that tomas also mentions, i.e., decimal 254 > and 255). On that table, columns are the least significant bits (second hex digit), and rows are the most significant bits (first hex digit) of a byte. As in C1 is row C, column 1. The "2-byte", "3-byte" and "4-byte" are comments that remind you of the self-sinchronizing nature of UTF-8, and that these bytes would be invalid outside of that position in an UTF-8 sequence that encodes a single code point (but they would be valid in the correct position). The stuff in "red" on that table is always invalid for Unicode: if you find one of those in a data file, that file is *not* valid UTF-8 (but it could be valid UTF-16, valid UTF-32, or valid ISO-8859-*, etc). > I guess I have a followup question--are those two bytes (or either one of > them) also unused in all possible "code pages"? For Unicode, yes, because Unicode can't go past code point 0x10ffff. And that isn't about to change anytime soon (lots of stuff hardcode it somehow, e.g., by limiting the number of UTF-8 bytes that can be used to encode a single code point...). I have not read the Unicode standard to check what it says about future expansions related to the valid code point range, though. > The problem is that I copy snippets of text from all kinds of sources into > those text files (which are formatted like mbox files), so I might find one > or > both of those bytes in the file already. Then it isn't a valid unicode text file in UTF-8 format, and it needs to be converted (or fixed) first to be encoded in UTF-8 :-) -- Henrique Holschuh