On Sun, Nov 06, 2022 at 05:05:00PM +0200, Eli Zaretskii wrote: > Sure, but is the issue only with lower-case letters? What about > collation order or even determining what is and isn't a character (as > opposed to incomplete byte sequence)? >
As you say, this information isn't in the input index files for texindex (either @documentencoding or @documentlanguage). I imagine it may be difficult to put this information in the index files in a backwards compatible way (e.g. newer texinfo.tex with old texindex). Workarounds (maybe involving texi2dvi too) may be possible, but this is something we'd have to investigate. It would make sense for texindex to use the UTF-8 encoding in a UTF-8 locale, where awk supports this. Then in the usual case of "UTF-8 everywhere", the encoding wouldn't be an issue. texinfo.tex does have limited support for UTF-8 input, but just for codepoints for which glyphs are available in the default TeX fonts. So our problem is not necessarily "perfect sorting for arbitrary UTF-8 strings", only for those languages that are currently supported (which use the Latin alphabet, and possibly Greek, although I've never heard of a Texinfo document in Greek). Notably, the Cyrillic alphabet (for e.g. Russian) is not supported. Japanese, and recently Chinese, are supported with special files and particular variants of TeX (XeTeX and/or LuaTeX). However, I don't know what happens with indices in these non-alphabetic languages. Collation order may or may not be an issue, depending on the languages of the Texinfo documents. I guess this affects questions like, should words starting with "É" (E actute) be included in the "E" section of the index. It might be interesting to make a list of which languages texinfo.tex supports.