Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

Pádraig Brady Thu, 29 Jul 2010 02:06:25 -0700

On 28/07/10 22:32, Bruno Haible wrote:
> Pádraig Brady wrote:
>> I would suggest a new function due to the
>> way I see this function called most often.
>>
>> /* definitely not sure of this name */
>> uint8_t *
>> u8_str_u8_chr (const uint8_t *s, const uint8_t *c, size_t size)
>> {
>>   switch (size):
>>     {
>>     case 1:
>>       return (uint8_t *) strchr ((const char *) s, *c);
>>     case 2:
>>       //use logic from current u8_strchr()
>>     case 3:
>>       ...
>>     case 4:
>>       ...
>>     }
>> }
>> ...
>> while ((f=u8_str_u8_chr (s, "–", 3));
> 
> Such an API does not appear very robust to me: it is quite easy to
> mistakenly pass a string consisting of more or less than 1 character as
> second argument. If the argument to be searched for is given as an
> UTF-8 string rather than as an ucs4_t


It's not that confusing to me, but fair enough.

> I would better recommend to use
> the u8_strstr function.

I wonder could we speed that up for UTF-8
by just deferring to strstr() ?
I've not tested this so feel free to bin it.

cheers,
Pádraig.

commit 8b154a3421de21254e628085ccf22ce736947635
Author: Pádraig Brady <p...@draigbrady.com>
Date:   Thu Jul 29 08:16:20 2010 +0100

    unistr/u8-strstr: simplify and probably speedup the UTF-8 case

    * lib/unistr/u-strstr.h (UTF8_MODE): A new define so we can
    do a compile time check for code to use for the UTF-8 case.
    * lib/unistr/u8-strstr.c (u8_strstr): Use strstr() for UTF-8 and
    needles bigger than 1 byte as it's simpler and probably faster.
    Also add a comment about when using u8_strchr() may be faster.
    * modules/unistr/u8-strstr: Depend on strstr-simple so that we don't
    access out of bounds memory on glibc-2.10 on 64 bit platforms.

diff --git a/ChangeLog b/ChangeLog
index 897387c..d3f8ccc 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+2010-07-29  Pádraig Brady  <p...@draigbrady.com>
+
+       * lib/unistr/u8-strstr.c (u8_strstr): Use strstr() as it's probably
+       faster.
+
 2010-07-26  Paul R. Eggert  <egg...@cs.ucla.edu>

        timespec: use cast and not conditional, as truncation isn't possible
diff --git a/lib/unistr/u-strstr.h b/lib/unistr/u-strstr.h
index df32be8..9fb64cd 100644
--- a/lib/unistr/u-strstr.h
+++ b/lib/unistr/u-strstr.h
@@ -28,6 +28,13 @@ FUNC (const UNIT *haystack, const UNIT *needle)
   if (needle[1] == 0)
     return U_STRCHR (haystack, first);

+#if UTF8_MODE
+  /* Optimize/simplify the UTF-8 case.
+     Note to users of u8_strstr(), if passing a single multibyte character
+     as a needle, then it may be faster to convert the needle to ucs4_t
+     and use u8_strchr(), for longer haystacks.  */
+  return (uint8_t *) strstr ((const char *) haystack, (const char *) needle);
+#else
   /* Search for needle's first unit.  */
   for (; *haystack != 0; haystack++)
     if (*haystack == first)
@@ -44,6 +51,7 @@ FUNC (const UNIT *haystack, const UNIT *needle)
               return (UNIT *) haystack;
           }
       }
+#endif

   return NULL;
 }
diff --git a/lib/unistr/u8-strstr.c b/lib/unistr/u8-strstr.c
index cce37ad..37f2aa4 100644
--- a/lib/unistr/u8-strstr.c
+++ b/lib/unistr/u8-strstr.c
@@ -20,9 +20,12 @@
 /* Specification.  */
 #include "unistr.h"

+#include <string.h>
+
 /* FIXME: Maybe walking the string via u8_mblen is a win?  */

 #define FUNC u8_strstr
 #define UNIT uint8_t
 #define U_STRCHR u8_strchr
+#define UTF8_MODE 1
 #include "u-strstr.h"
diff --git a/modules/unistr/u8-strstr b/modules/unistr/u8-strstr
index 5996917..2531ec1 100644
--- a/modules/unistr/u8-strstr
+++ b/modules/unistr/u8-strstr
@@ -7,6 +7,7 @@ lib/unistr/u-strstr.h

 Depends-on:
 unistr/base
+strstr-simple

 configure.ac:
 gl_LIBUNISTRING_MODULE([0.9], [unistr/u8-strstr])

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

Reply via email to