On 07/12/2010 01:38 AM, Pádraig Brady wrote:
On 11/07/10 15:20, Paolo Bonzini wrote:
On 07/07/2010 03:44 PM, Pádraig Brady wrote:
Subject: [PATCH] unistr/u8-strchr: speed up searching for ASCII
characters
* lib/unistr/u8-strchr.c (u8_strchr): Use strchr() for
the single byte case as it was measured to be 50% faster
than the existing code on x86 linux. Also add a comment
on why not to use memmem() for the moment for the multibyte case.
If p is surely a valid UTF-8 string, you can do better in general like
this. Say [q, q+q_len) points to an UTF-8 representation of uc:
for (; p = strchr (p, *q)&& memcmp (p+1, q+1, q_len-1); p += q_len)
;
return p;
That would be an improvement if strchr() would skip lots of p at a time,
to counter the function call overhead. However, the first byte of a multibyte
UTF-8 char is the same for a lot of characters, so I'm guessing there would
be lots of false positives in practice?
I guess it depends. Absolutely awful for Greek/Arabic/etc., probably
not too bad for European languages. Also probably not too bad when
searching in mixed single-/multi-byte text (e.g. code with foreign
language comments).
A lot of the startup overhead of strchr is to align to a word and
multiply the sought character by 0x1010101. All these could be done
only once. I wonder if a completely inlined fast strchr would be too
complex to be worth the improvement...
Paolo