KH> Aleksey,

 AC>> Before looking for word in .index file dictd converts it to lower
 AC>> case and removes non-alphanumeric characters from the word (if no
 AC>> 00-database-allchars is found of cause).  This is necessary to
 AC>> ignore non-alphanumeric characters in search and make the search
 AC>> case-insensitive. 'dictfmt' builds .index file the same way.  This
 AC>> is why 00-database-short could not be found in your databases.

 KH> I am curious as to how this is special for uft8.  Don't the same
 KH> requirements of a case-insensitive search apply to non-uft8?  So, why
 KH> then is the full 00-database-short allowed in a non-uft8 index even
 KH> when 00-database-allchars is omitted?

'sort -df -k 1,3' is used for sorting ASCII dictionary
This allows us to keep nonalphanumeric characters in .index.
Also all characters are in their original case.
Some info from sort manual:
       -d, --dictionary-order
              consider only blanks and alphanumeric characters
       -f, --ignore-case
              fold lower case to upper case characters
dictd in turn uses appropriate sorting compare function,
see index.c:compare_alnumspace for details.

This is how dict/dictfmt was designed by Rick.

The same method is possible for UTF-8 dictionary
(and the very first version worked this way), but
later (before releasing anything) I changed sorting order
both in dictfmt and dictd.
Now all words in .index are "normalized", i.e. lowercased
and only alnum chars are kept in them.

Benefits:
- 'sort' utility doesn't need be aware of UTF-8.
- Sorting order is trivial, byte-to-byte.
- Much simplier and much faster compare function in dictd,
  see index.c:compare_allchars
Disadvantageous:
- MATCH command returns "normalized" words, but the original one.
  I have a plan to implement fourth column in .index file
  to keep original word.

P.S.
Here the correct compare function is selected:

static int compare(
   const char *word,
   const dictIndex *dbindex,
   const char *start, const char *end )
{
...
   if (dbindex &&
       (dbindex -> flag_allchars || dbindex -> flag_utf8 ||
        dbindex -> flag_8bit))
   {
      return compare_allchars( word, start, end );
   }else{
      return compare_alnumspace( word, dbindex, start, end );
   }
}

Upper level functions call 'tolower_alnumspace' to "normalize" query.

-- 
Best regards, Aleksey Cheusov.



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to