On 29/07/10 10:26, Paolo Bonzini wrote: > On 07/29/2010 11:05 AM, Pádraig Brady wrote: >>> I would better recommend to use >>> the u8_strstr function. >> >> I wonder could we speed that up for UTF-8 >> by just deferring to strstr() ? >> I've not tested this so feel free to bin it. > > An alternative idea: check if there is more than 1 character using > u8_mbtouc. If there is one, use the faster u8_strchr algorithm (you can > just use u8_strchr, even though that does a useless conversion back to > UTF-8). If there is more than one, use strstr directly.
I thought of that but thought it would be too much overhead for the general case since it seemed like it would need strlen, u8_mbsnlen & u8_mbtouc, with conversions back and forth. > And this gives another opportunity: we can now define u8_strstr and > u16_strstr to always return a pointer to a valid UTF-8 sequence: > > /* Find the first occurrence of NEEDLE in HAYSTACK. For UTF-8 and > UTF-16, the returned value will never point in the middle of a > multibyte sequence or surrogate pair. If this cannot be satisfied, > for example due to an invalid NEEDLE, NULL is returned. */ > > Why? Because the only cases when you can get a false positive are: 1) > when you have illegal sequences at the beginning of the needle, which > now we could easily check; 2) when both the needle and haystack have the > same illegal UTF-8 in it, which I think could be ignored. > > For u16_strstr you need to look for an invalid surrogate character at > the beginning of the needle. > > Thoughts? Would anybody give a shot to the above? But with those benefits it might be worth it? cheers, Pádraig.