Re: `texindex` output depends on locale settings

Gavin Smith Sun, 06 Nov 2022 10:35:21 -0800

On Sun, Nov 06, 2022 at 05:05:00PM +0200, Eli Zaretskii wrote:
> Sure, but is the issue only with lower-case letters?  What about
> collation order or even determining what is and isn't a character (as
> opposed to incomplete byte sequence)?
>


As you say, this information isn't in the input index files for texindex
(either @documentencoding or @documentlanguage).

I imagine it may be difficult to put this information in the index files
in a backwards compatible way (e.g. newer texinfo.tex with old texindex).
Workarounds (maybe involving texi2dvi too) may be possible, but this
is something we'd have to investigate.

It would make sense for texindex to use the UTF-8 encoding in a UTF-8
locale, where awk supports this.  Then in the usual case of "UTF-8
everywhere", the encoding wouldn't be an issue.

texinfo.tex does have limited support for UTF-8 input, but just for
codepoints for which glyphs are available in the default TeX fonts.
So our problem is not necessarily "perfect sorting for arbitrary
UTF-8 strings", only for those languages that are currently supported
(which use the Latin alphabet, and possibly Greek, although I've never
heard of a Texinfo document in Greek).  Notably, the Cyrillic alphabet
(for e.g. Russian) is not supported.

Japanese, and recently Chinese, are supported with special files
and particular variants of TeX (XeTeX and/or LuaTeX).  However, I
don't know what happens with indices in these non-alphabetic languages.

Collation order may or may not be an issue, depending on the languages
of the Texinfo documents.  I guess this affects questions like, should
words starting with "É" (E actute) be included in the "E" section of the 
index.

It might be interesting to make a list of which languages texinfo.tex
supports.

Re: `texindex` output depends on locale settings

Reply via email to