Ondrej Bilka wrote: > I looked more into source and discovered fnmatch doesn't work as I imagined. > By default it converts strings into widechars and match there. > utf8 allows searching be done bitwise. Its in most cases faster.
fnmatch converts to wide characters because it often makes several passes across many characters of the string, and at each pass it has to call mbrtowc for looking up the extent of that character. And while UTF-8 is the most common encoding, there are other ones, such as ISO-8859-2 or GB18030, for which mbrtowc is really expensive. > Is ok just use original fnmatch if pattern contains extended wildcard or [] > with nonascii symbol? No. If the encoding is GB18030 and the pattern is "*5*", and you attempt to search for the '5' byte for byte, you will find a match where there is actually none - because multibyte characters in GB18030 can contains values in the range 0x30..0x39 in bytes 2..4. Similarly for the BIG5, BIG5-HKSCS, GBK, and SHIFT_JIS encodings. > Here is casefold patch for fnmatch. (abusing wchar=u32) wchar_t == ucs4_t is only generally true on glibc systems, not on Solaris, FreeBSD, AIX, etc. > +#ifdef _LIBC The symbol _LIBC is only defined when compiling glibc. It is not defined when compiling gnulib source code on any system. > - > - res = internal_fnwmatch (wpattern, wstring, wstring + strsize - 1, > + wchar_t *wfoldpattern,*wfoldstring; > + wfoldpattern=wpattern;wfoldstring=wstring; If you want me to review some code, please present it with the same friendly indentation, space-after-comma, space-around-operators, one-variable-per- declaration, one-statement-per-line, GNU-style brace placement, max linelength of 80, etc. that you find in the rest of the gnulib source code. Regarding indentation: a tab's width is 8 columns. It looks like you're using a different tab width. If that is so, and you cannot change it, please try to avoid tabs altogether. Bruno