KH> Aleksey, AC>> Before looking for word in .index file dictd converts it to lower AC>> case and removes non-alphanumeric characters from the word (if no AC>> 00-database-allchars is found of cause). This is necessary to AC>> ignore non-alphanumeric characters in search and make the search AC>> case-insensitive. 'dictfmt' builds .index file the same way. This AC>> is why 00-database-short could not be found in your databases.
KH> I am curious as to how this is special for uft8. Don't the same KH> requirements of a case-insensitive search apply to non-uft8? So, why KH> then is the full 00-database-short allowed in a non-uft8 index even KH> when 00-database-allchars is omitted? 'sort -df -k 1,3' is used for sorting ASCII dictionary This allows us to keep nonalphanumeric characters in .index. Also all characters are in their original case. Some info from sort manual: -d, --dictionary-order consider only blanks and alphanumeric characters -f, --ignore-case fold lower case to upper case characters dictd in turn uses appropriate sorting compare function, see index.c:compare_alnumspace for details. This is how dict/dictfmt was designed by Rick. The same method is possible for UTF-8 dictionary (and the very first version worked this way), but later (before releasing anything) I changed sorting order both in dictfmt and dictd. Now all words in .index are "normalized", i.e. lowercased and only alnum chars are kept in them. Benefits: - 'sort' utility doesn't need be aware of UTF-8. - Sorting order is trivial, byte-to-byte. - Much simplier and much faster compare function in dictd, see index.c:compare_allchars Disadvantageous: - MATCH command returns "normalized" words, but the original one. I have a plan to implement fourth column in .index file to keep original word. P.S. Here the correct compare function is selected: static int compare( const char *word, const dictIndex *dbindex, const char *start, const char *end ) { ... if (dbindex && (dbindex -> flag_allchars || dbindex -> flag_utf8 || dbindex -> flag_8bit)) { return compare_allchars( word, start, end ); }else{ return compare_alnumspace( word, dbindex, start, end ); } } Upper level functions call 'tolower_alnumspace' to "normalize" query. -- Best regards, Aleksey Cheusov. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]