Hi Pádraig, > However, the first byte of a multibyte > UTF-8 char is the same for a lot of characters
Yes. The last byte is equidistributed across the range 0x80..0xBF, whereas the first byte is often the same. I'm applying the commit below to exploit it for speed. > I was wondering myself about what parts of gnulib/unistring could take > advantage of assuming valid UTF-8 strings. From my own notes on this > function, I have: > > "Some possible optimizations would need to > be conditional on CONFIG_UNICODE_SAFETY (see u8_mblen). > Note also u8_mbtouc_unsafe() and u8_mbtouc(), the latter > detecting invalid utf-8 chars even without --enable-safety > So given the above I'm assuming that most of gnulib/unistring > assumes valid UTF-8 (which users can enforce on input with u8_check()), > and if a safe but inefficient implementation option is possible > then it should be within CONFIG_UNICODE_SAFETY. Note I found > no mention of --enable-safety in the gnulib/libunistring configure scripts." Generally, it's better to go for safety by default. --enable-safety is for cases where a user wants to trade safety for speed. I doubt that's reasonable in general. It's for this reason that I provided u8_mbtouc_unsafe under a different function name, so that programmers can use it at those places where they know that the input is well-formed. Bruno 2010-07-18 Bruno Haible <br...@clisp.org> unistr/u8-strchr: Optimize non-ASCII argument case. * lib/unistr/u8-strchr.c (u8_strchr): Compare the last byte first, because the first byte often matches anyway. Reported by Pádraig Brady <p...@draigbrady.com>. --- lib/unistr/u8-strchr.c.orig Sun Jul 18 17:16:07 2010 +++ lib/unistr/u8-strchr.c Sun Jul 18 17:12:17 2010 @@ -68,7 +68,7 @@ { if (s[1] == 0) goto notfound; - if (*s == c0 && s[1] == c1) + if (s[1] == c1 && *s == c0) break; } return (uint8_t *) s; @@ -86,7 +86,7 @@ { if (s[2] == 0) goto notfound; - if (*s == c0 && s[1] == c1 && s[2] == c2) + if (s[2] == c2 && s[1] == c1 && *s == c0) break; } return (uint8_t *) s; @@ -105,7 +105,7 @@ { if (s[3] == 0) goto notfound; - if (*s == c0 && s[1] == c1 && s[2] == c2 && s[3] == c3) + if (s[3] == c3 && s[2] == c2 && s[1] == c1 && *s == c0) break; } return (uint8_t *) s;