On 07/29/2010 11:05 AM, Pádraig Brady wrote:
I would better recommend to use
the u8_strstr function.
I wonder could we speed that up for UTF-8
by just deferring to strstr() ?
I've not tested this so feel free to bin it.
An alternative idea: check if there is more than 1 character using
u8_mbtouc. If there is one, use the faster u8_strchr algorithm (you can
just use u8_strchr, even though that does a useless conversion back to
UTF-8). If there is more than one, use strstr directly.
And this gives another opportunity: we can now define u8_strstr and
u16_strstr to always return a pointer to a valid UTF-8 sequence:
/* Find the first occurrence of NEEDLE in HAYSTACK. For UTF-8 and
UTF-16, the returned value will never point in the middle of a
multibyte sequence or surrogate pair. If this cannot be satisfied,
for example due to an invalid NEEDLE, NULL is returned. */
Why? Because the only cases when you can get a false positive are: 1)
when you have illegal sequences at the beginning of the needle, which
now we could easily check; 2) when both the needle and haystack have the
same illegal UTF-8 in it, which I think could be ignored.
For u16_strstr you need to look for an invalid surrogate character at
the beginning of the needle.
Thoughts? Would anybody give a shot to the above?
Paolo