On Fri, Aug 30, 2019 at 10:20 AM Jean Louis <bugs@gnu.support> wrote: > > Hello Bernhard, > > Thank you, just one question: > > * Bernhard Voelker <m...@bernhard-voelker.de> [2019-08-30 00:44]: > > The updatedb script now operates in the C locale only. This means > > that character encoding issues are now not likely to cause sort to > > fail. It also honours the TMPDIR environment variable if that was > > set, and no longer sorts file names case-insensitively. > > Does that still allows to index unicode file names?
In reality there is no such thing as a "Unicode file name", because there exists no mechanism to record or specify the character encoding of a path name (i.e. just the bytes are saved n the file system, not the associated encoding(s)). Nothing guarantees that the character encoding in use at the time a path name is generated is the same as the encoding in use by the user at the time the path name is later used. Indeed, the sub-directory names comprising a path name can each be in a different, incompatible, encoding. If this seems messy, then yes, well it is. POSIX hasn't dealt with this very well. If there were a do-over, path names might be specified as UTF-8 in all cases, but that's not how it actually happened. In practical terms, the content of the locate database generated by updateddb treats path names as byte sequences, which is what in fact they are. It preserves all valid byte sequences (path names may not contain the NUL character). The locate utility prints these. Whether or not that will result in an intelligible string displayed in your UI depends on what character encoding(s) is/are used when the path name is generated and when it is displayed. If you had consistent settings for all those things at both points, you will get the behaviour you expected. Otherwise, not. But updatedb is just passing-through the bytes, it doesn't change them. You will see locale-dependent behaviour from locate, though, since regular expressions need to understand what the characters mean (to offer character classes for example, and even things like '.' which needs to match a single character). James.