On 28/07/10 22:32, Bruno Haible wrote: > Pádraig Brady wrote: >> I would suggest a new function due to the >> way I see this function called most often. >> >> /* definitely not sure of this name */ >> uint8_t * >> u8_str_u8_chr (const uint8_t *s, const uint8_t *c, size_t size) >> { >> switch (size): >> { >> case 1: >> return (uint8_t *) strchr ((const char *) s, *c); >> case 2: >> //use logic from current u8_strchr() >> case 3: >> ... >> case 4: >> ... >> } >> } >> ... >> while ((f=u8_str_u8_chr (s, "–", 3)); > > Such an API does not appear very robust to me: it is quite easy to > mistakenly pass a string consisting of more or less than 1 character as > second argument. If the argument to be searched for is given as an > UTF-8 string rather than as an ucs4_t
It's not that confusing to me, but fair enough. > I would better recommend to use > the u8_strstr function. I wonder could we speed that up for UTF-8 by just deferring to strstr() ? I've not tested this so feel free to bin it. cheers, Pádraig. commit 8b154a3421de21254e628085ccf22ce736947635 Author: Pádraig Brady <p...@draigbrady.com> Date: Thu Jul 29 08:16:20 2010 +0100 unistr/u8-strstr: simplify and probably speedup the UTF-8 case * lib/unistr/u-strstr.h (UTF8_MODE): A new define so we can do a compile time check for code to use for the UTF-8 case. * lib/unistr/u8-strstr.c (u8_strstr): Use strstr() for UTF-8 and needles bigger than 1 byte as it's simpler and probably faster. Also add a comment about when using u8_strchr() may be faster. * modules/unistr/u8-strstr: Depend on strstr-simple so that we don't access out of bounds memory on glibc-2.10 on 64 bit platforms. diff --git a/ChangeLog b/ChangeLog index 897387c..d3f8ccc 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,8 @@ +2010-07-29 Pádraig Brady <p...@draigbrady.com> + + * lib/unistr/u8-strstr.c (u8_strstr): Use strstr() as it's probably + faster. + 2010-07-26 Paul R. Eggert <egg...@cs.ucla.edu> timespec: use cast and not conditional, as truncation isn't possible diff --git a/lib/unistr/u-strstr.h b/lib/unistr/u-strstr.h index df32be8..9fb64cd 100644 --- a/lib/unistr/u-strstr.h +++ b/lib/unistr/u-strstr.h @@ -28,6 +28,13 @@ FUNC (const UNIT *haystack, const UNIT *needle) if (needle[1] == 0) return U_STRCHR (haystack, first); +#if UTF8_MODE + /* Optimize/simplify the UTF-8 case. + Note to users of u8_strstr(), if passing a single multibyte character + as a needle, then it may be faster to convert the needle to ucs4_t + and use u8_strchr(), for longer haystacks. */ + return (uint8_t *) strstr ((const char *) haystack, (const char *) needle); +#else /* Search for needle's first unit. */ for (; *haystack != 0; haystack++) if (*haystack == first) @@ -44,6 +51,7 @@ FUNC (const UNIT *haystack, const UNIT *needle) return (UNIT *) haystack; } } +#endif return NULL; } diff --git a/lib/unistr/u8-strstr.c b/lib/unistr/u8-strstr.c index cce37ad..37f2aa4 100644 --- a/lib/unistr/u8-strstr.c +++ b/lib/unistr/u8-strstr.c @@ -20,9 +20,12 @@ /* Specification. */ #include "unistr.h" +#include <string.h> + /* FIXME: Maybe walking the string via u8_mblen is a win? */ #define FUNC u8_strstr #define UNIT uint8_t #define U_STRCHR u8_strchr +#define UTF8_MODE 1 #include "u-strstr.h" diff --git a/modules/unistr/u8-strstr b/modules/unistr/u8-strstr index 5996917..2531ec1 100644 --- a/modules/unistr/u8-strstr +++ b/modules/unistr/u8-strstr @@ -7,6 +7,7 @@ lib/unistr/u-strstr.h Depends-on: unistr/base +strstr-simple configure.ac: gl_LIBUNISTRING_MODULE([0.9], [unistr/u8-strstr])