Hi Pádraig, > > But at the same time, can you also try to apply the O(n) algorithm to > > u16_strstr and u32_strstr? > > Attached is a first attempt at this.
Thanks for working on this. > Questions I have are: > > Is it OK to include str-mp.h and malloca > in unistr/ modules like I do? Yes, sure. Why not? Possible namespacing problems are solved by libunistring's Makefile. Let me go through your patch. 1) It's a good choice to use the Knuth-Morris-Pratt algorithm for u16_strstr and u32_strstr. It's O(n). Eric's two-way algorithm is O(n) as well, but is more complicated code and requires to set up a table whose size is 1 << CHAR_BIT, which would be 1 << 16 for uint16_t or 1 << 32 for uint32_t - infeasible. 2) It's a good choice to parameterize str-kmp.h, rather than to duplicate code. But while doing so, please try to avoid gcc warnings. Always test your patches with CPPFLAGS="-Wall" before submitting them: mbscasestr.c:376: warning: pointer targets in passing argument 1 of ‘knuth_morris_pratt’ differ in signedness mbscasestr.c:376: warning: pointer targets in passing argument 2 of ‘knuth_morris_pratt’ differ in signedness mbscasestr.c:376: warning: passing argument 4 of ‘knuth_morris_pratt’ from incompatible pointer type mbsstr.c:347: warning: pointer targets in passing argument 1 of ‘knuth_morris_pratt’ differ in signedness mbsstr.c:347: warning: pointer targets in passing argument 2 of ‘knuth_morris_pratt’ differ in signedness And there's no reason to remove part of the specification of CANON_ELEMENT in lib/str-kmp.h. Removing reasonable comments is a good way to have your patches rejected. 3) About the module descriptions: Makefile.am: if LIBUNISTRING_COMPILE_UNISTR_U16_STRSTR -lib_SOURCES += unistr/u16-strstr.c +lib_SOURCES += unistr/u16-strstr.c str-kmp.h endif You don't need to add .h files to lib_SOURCES. 4) The line gl_LIBUNISTRING_MODULE([0.9], [unistr/u32-strstr]) needs to be updated, because with a libunistring from versions 0.9 .. 0.9.3 the new unit tests will fail. So when a libunistring version < 0.9.4 is found, you need to enforce compilation of the module from source. This is done by changing the line to gl_LIBUNISTRING_MODULE([0.9.4], [unistr/u32-strstr]) 5) The unit test tests/unistr/test-u-strstr.h: It is good to share code among tests. rather than write down the same code 3 times. But please follow GNU coding style: indentation by 2 spaces, not 4, and put spaces around arithmetic operators. More importantly, though, the test code that you put there is part of tests/test-strstr.c and tests/test-mbsstr1.c. Why only a part? u16_strstr and u32_strstr should have all the properties that strstr and mbsstr have. So the right thing to do is to adapt all of tests/test-strstr.c into tests/unistr/test-u-strstr.h. I've done this, and now I get test failures: /bin/sh: line 5: 24308 Alarm clock EXEEXT='' srcdir='.' MAKE='make' ${dir}$tst FAIL: test-u16-strstr /bin/sh: line 5: 24325 Alarm clock EXEEXT='' srcdir='.' MAKE='make' ${dir}$tst FAIL: test-u32-strstr 6) In unit tests, the order of the #includes is the following: - First comes <config.h>. - Then comes the header file supposed to declare the particular functionality. This comes immediately after <config.h> in order to check that the header is self-contained. - Then come system includes <...>. - Then come gnulib includes "...". 7) Also please go into more details when writing a ChangeLog entry. Always mention the function names, especially when you're renaming a function. So, the first part that can go in is this: 2011-01-21 Pádraig Brady <p...@draigbrady.com> Bruno Haible <br...@clisp.org> Prepare for faster uN_strstr functions. * lib/str-kmp.h: Support definable UNITs. (knuth_morris_pratt): Renamed from knuth_morris_pratt_unibyte. Add needle_len argument. * lib/mbsstr.c (mbsstr): Adjust for the changed str-kmp.h. * lib/mbscasestr.c (mbscasestr): Likewise. --- lib/mbscasestr.c.orig Fri Jan 21 13:29:32 2011 +++ lib/mbscasestr.c Fri Jan 21 13:05:06 2011 @@ -30,6 +30,7 @@ #define TOLOWER(Ch) (isupper (Ch) ? tolower (Ch) : (Ch)) /* Knuth-Morris-Pratt algorithm. */ +#define UNIT unsigned char #define CANON_ELEMENT(c) TOLOWER (c) #include "str-kmp.h" @@ -368,10 +369,12 @@ if (needle_last_ccount == NULL) { /* Try the Knuth-Morris-Pratt algorithm. */ - const char *result; + const unsigned char *result; bool success = - knuth_morris_pratt_unibyte (haystack, needle - 1, - &result); + knuth_morris_pratt ((const unsigned char *) haystack, + (const unsigned char *) (needle - 1), + strlen (needle - 1), + &result); if (success) return (char *) result; try_kmp = false; --- lib/mbsstr.c.orig Fri Jan 21 13:29:32 2011 +++ lib/mbsstr.c Fri Jan 21 13:05:06 2011 @@ -27,6 +27,7 @@ #include "mbuiter.h" /* Knuth-Morris-Pratt algorithm. */ +#define UNIT unsigned char #define CANON_ELEMENT(c) c #include "str-kmp.h" @@ -339,10 +340,12 @@ if (needle_last_ccount == NULL) { /* Try the Knuth-Morris-Pratt algorithm. */ - const char *result; + const unsigned char *result; bool success = - knuth_morris_pratt_unibyte (haystack, needle - 1, - &result); + knuth_morris_pratt ((const unsigned char *) haystack, + (const unsigned char *) (needle - 1), + strlen (needle - 1), + &result); if (success) return (char *) result; try_kmp = false; --- lib/str-kmp.h.orig Fri Jan 21 13:29:32 2011 +++ lib/str-kmp.h Fri Jan 21 13:05:06 2011 @@ -1,4 +1,4 @@ -/* Substring search in a NUL terminated string of 'char' elements, +/* Substring search in a NUL terminated string of UNIT elements, using the Knuth-Morris-Pratt algorithm. Copyright (C) 2005-2011 Free Software Foundation, Inc. Written by Bruno Haible <br...@clisp.org>, 2005. @@ -18,21 +18,26 @@ Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */ /* Before including this file, you need to define: + UNIT The element type of the needle and haystack. CANON_ELEMENT(c) A macro that canonicalizes an element right after - it has been fetched from one of the two strings. - The argument is an 'unsigned char'; the result - must be an 'unsigned char' as well. */ + it has been fetched from needle or haystack. + The argument is of type UNIT; the result must be + of type UNIT as well. */ /* Knuth-Morris-Pratt algorithm. See http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm + HAYSTACK is the NUL terminated string in which to search for. + NEEDLE is the string to search for in HAYSTACK, consisting of NEEDLE_LEN + units. Return a boolean indicating success: Return true and set *RESULTP if the search was completed. Return false if it was aborted because not enough memory was available. */ static bool -knuth_morris_pratt_unibyte (const char *haystack, const char *needle, - const char **resultp) +knuth_morris_pratt (const UNIT *haystack, + const UNIT *needle, size_t needle_len, + const UNIT **resultp) { - size_t m = strlen (needle); + size_t m = needle_len; /* Allocate the table. */ size_t *table = (size_t *) nmalloca (m, sizeof (size_t)); @@ -66,14 +71,14 @@ The inequality needle[x..i-1] != needle[0..i-1-x] is known to hold for x < table[i-1], by induction. Furthermore, if j>0: needle[i-1-j..i-2] = needle[0..j-1]. */ - unsigned char b = CANON_ELEMENT ((unsigned char) needle[i - 1]); + UNIT b = CANON_ELEMENT (needle[i - 1]); for (;;) { /* Invariants: The inequality needle[x..i-1] != needle[0..i-1-x] is known to hold for x < i-1-j. Furthermore, if j>0: needle[i-1-j..i-2] = needle[0..j-1]. */ - if (b == CANON_ELEMENT ((unsigned char) needle[j])) + if (b == CANON_ELEMENT (needle[j])) { /* Set table[i] := i-1-j. */ table[i] = i - ++j; @@ -108,17 +113,16 @@ /* Search, using the table to accelerate the processing. */ { size_t j; - const char *rhaystack; - const char *phaystack; + const UNIT *rhaystack; + const UNIT *phaystack; *resultp = NULL; j = 0; rhaystack = haystack; phaystack = haystack; /* Invariant: phaystack = rhaystack + j. */ - while (*phaystack != '\0') - if (CANON_ELEMENT ((unsigned char) needle[j]) - == CANON_ELEMENT ((unsigned char) *phaystack)) + while (*phaystack != 0) + if (CANON_ELEMENT (needle[j]) == CANON_ELEMENT (*phaystack)) { j++; phaystack++;