On Monday, April 02, 2018 03:39:05 AM Andre Majorel wrote: > > Why? UTF (especially UTF-8) is vastly superior for all purposes: > I wouldn't say that. UTF-8 breaks a number of assumptions. For > instance, > 1) every character has the same size, > 2) every byte sequence is a valid character,
A few weeks ago, I was looking for a byte that, in UTF-8, would be a totally invalid byte (not an invalid sequence of bytes). At the time, I tried some googling, but it looked rather hopeless (maybe it was my googling that was hopeless). I know that your statement does not imply there is such a byte, but maybe you (or someone else reading this) know(s)? (The reason I wanted such a byte was to use it as a record separator in a set of text files (that I use as an askSam "workalike" (or "worksimilar") so that I could use msort (which depends on a 1 byte record separator to --separate the records ;-) while sorting.) (Some of the files already include UTF-8, and, in the future, I anticpate all will be in UTFF-8.) > 3) the equality or inequality of two characters comes down to > the equality or inequality of the bytes they encode to.