Thanks to tomas and Henrique! The wikipedia article is rather interesting, in a quick skim, I learned some interesting things about UTF-8, especially the property of self- synchronization.
I had trouble reading that large table--but if I simply take the red boxes at face value, maybe there are 10 or so bytes that are not valid UTF-8. I'll probably first consider the bytes that tomas also mentions, i.e., decimal 254 and 255). I guess I have a followup question--are those two bytes (or either one of them) also unused in all possible "code pages"? The problem is that I copy snippets of text from all kinds of sources into those text files (which are formatted like mbox files), so I might find one or both of those bytes in the file already. I guess it's not a big deal as, I will either: * search the file (with a hex editor, I guess) to see if decimal 254 or 255 is in use already--if only one or a few cases, I might replace it with something else before adding additional instances to serve as a temporary record separator (for use by msort), or * use one of the other utilities that I've since found which can apparently sort mbox files while keeping emails intact (I have to read up on those (or it?) again as there were, iirc, also some limitations there that might not let me accomplish what I want (usually, sorting the emails by the title in the mbox "From " header (and ususally not by the email From: header). Thanks again! On Monday, April 02, 2018 09:05:52 AM Henrique de Moraes Holschuh wrote: > On Mon, 02 Apr 2018, rhkra...@gmail.com wrote: > > A few weeks ago, I was looking for a byte that, in UTF-8, would be a > > totally invalid byte (not an invalid sequence of bytes). At the time, I > > tried some googling, but it looked rather hopeless (maybe it was my > > googling that was hopeless). > > 0xff should work. But any of those in RED on the wikipedia article > about UTF-8 would do for Unicode text: > > https://en.wikipedia.org/wiki/UTF-8