Hi Ralph, Ralph Corderoy wrote on Sun, Jul 23, 2017 at 12:38:16PM +0100:
> UTF-8 comes along and groff can't adopt it because it's already > taken an incompatible fork. In theory, that's true. If you see a two-, three-, or four-byte sequence that forms a valid UTF-8 character, it could theoretically be a sequence of two, three, or four ISO-LATIN-1 characters. In practice, these combinations of ISO-LATIN-1 characters are nonsensical and simply do not occur in real-world files. So, you can simply read the file up to the first byte with the high bit set. If that starts are valid UTF-8 sequence, the file is UTF-8. Otherwise, it is ISO-LATIN-1. I'm doing exactly that in mandoc, and i have never seen a misclassification in practice. Groff could do the same and remain backward-compatible. That doesn't even require a heavy, sophisticated library like uchardet. If somebody insists on processing a maliciously crafted ISO-LATIN-1 file where the first non-ASCII byte sequence looks like UTF-8, they will have to put a charset annotation into the file or use a -K option. But that won't get into the way of processing historical files because those just won't contain such nonsense. Of course, to process native UTF-16 on Windows, preconv will be needed just like now. No change there. Oh, maybe things get easier even on Windows, because you gain the additional option to use "iconv -t UTF-8" just like for any other text file and don't necessarily need the special preconv(1) tool any longer. Yours, Ingo