On 11/07/10 15:20, Paolo Bonzini wrote: > On 07/07/2010 03:44 PM, Pádraig Brady wrote: >> Subject: [PATCH] unistr/u8-strchr: speed up searching for ASCII >> characters >> >> * lib/unistr/u8-strchr.c (u8_strchr): Use strchr() for >> the single byte case as it was measured to be 50% faster >> than the existing code on x86 linux. Also add a comment >> on why not to use memmem() for the moment for the multibyte case. > > If p is surely a valid UTF-8 string, you can do better in general like > this. Say [q, q+q_len) points to an UTF-8 representation of uc: > > for (; p = strchr (p, *q) && memcmp (p+1, q+1, q_len-1); p += q_len) > ; > > return p;
That would be an improvement if strchr() would skip lots of p at a time, to counter the function call overhead. However, the first byte of a multibyte UTF-8 char is the same for a lot of characters, so I'm guessing there would be lots of false positives in practice? > > That's because once the first byte has matched, the length of the UTF-8 > character is known to be q_len. It's better than memmem if the startup > cost of strchr is low enough (of course memcmp has to be > inlined/unrolled/unswitched to get decent performance). > > Does the argument of u8_strchr have this guarantee? If not, the above > code can read arbitrary memory. I was wondering myself about what parts of gnulib/unistring could take advantage of assuming valid UTF-8 strings. From my own notes on this function, I have: "Some possible optimizations would need to be conditional on CONFIG_UNICODE_SAFETY (see u8_mblen). Note also u8_mbtouc_unsafe() and u8_mbtouc(), the latter detecting invalid utf-8 chars even without --enable-safety So given the above I'm assuming that most of gnulib/unistring assumes valid UTF-8 (which users can enforce on input with u8_check()), and if a safe but inefficient implementation option is possible then it should be within CONFIG_UNICODE_SAFETY. Note I found no mention of --enable-safety in the gnulib/libunistring configure scripts." cheers, Pádraig.
