Re: [Groff] mom : unicode in .INCLUDE'd files

Ingo Schwarze Sun, 23 Jul 2017 06:02:46 -0700

Hi Ralph,

Ralph Corderoy wrote on Sun, Jul 23, 2017 at 12:38:16PM +0100:


> UTF-8 comes along and groff can't adopt it because it's already
> taken an incompatible fork.

In theory, that's true.

If you see a two-, three-, or four-byte sequence that forms a
valid UTF-8 character, it could theoretically be a sequence of
two, three, or four ISO-LATIN-1 characters.

In practice, these combinations of ISO-LATIN-1 characters are
nonsensical and simply do not occur in real-world files.

So, you can simply read the file up to the first byte with the
high bit set.  If that starts are valid UTF-8 sequence, the file
is UTF-8.  Otherwise, it is ISO-LATIN-1.  I'm doing exactly that
in mandoc, and i have never seen a misclassification in practice.
Groff could do the same and remain backward-compatible.

That doesn't even require a heavy, sophisticated library like
uchardet.

If somebody insists on processing a maliciously crafted ISO-LATIN-1
file where the first non-ASCII byte sequence looks like UTF-8, they
will have to put a charset annotation into the file or use a -K
option.  But that won't get into the way of processing historical
files because those just won't contain such nonsense.

Of course, to process native UTF-16 on Windows, preconv will be
needed just like now.  No change there.  Oh, maybe things get easier
even on Windows, because you gain the additional option to use
"iconv -t UTF-8" just like for any other text file and don't
necessarily need the special preconv(1) tool any longer.

Yours,
  Ingo

Re: [Groff] mom : unicode in .INCLUDE'd files

Reply via email to