Re: [PATCH] improve string find algorithm

Jakub Jelinek Fri, 06 Jan 2017 07:29:13 -0800

On Fri, Jan 06, 2017 at 03:16:12PM +0000, Jonathan Wakely wrote:
> On 06/01/17 16:09 +0100, Jakub Jelinek wrote:
> > On Fri, Jan 06, 2017 at 08:42:15AM -0600, Aditya Kumar wrote:
> > > Yes, we do.
> > > Sorry for the mistake, it happened because I first wrote this for
> > > libcxx (https://reviews.llvm.org/D27068) and while porting that line
> > > got missed.
> > 
> > Shouldn't find at least in the case where it is narrow char string
> > just use C library memmem?  That implements a Two-Way searching algorithm
> > with some improvements from Boyer-Moore.
> > Otherwise, consider what even your modified version will do for
> > 
> > #include <string>
> > int main() {
> > (std::string(10000000, 'a')+"b").find(std::string(1000000, 'a')+"b");
> > }
> 
> Yes, I have an incomplete local patch that checks for memmem and uses
> it where possible. This change would still be valuable for non-GNU
> targets though.


The description of Two-Way algorithm is in
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
Boyer-Moore in
http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm

Of course, as you know the lengths of both strings, you can quite cheaply
decide whether it is worth spending some time building data structures
to speed up the following search or not (but so does memmem already,
e.g. handle with no searching zero length needle, or needle longer than
haystack).  Or for short needles start with memchr of the first char, while
for longer needles don't do that etc.

> > Or does the C++ library need to reinvent everything implemented in the C
> > library?
> 
> In this case yes, because strstr doesn't take the length of the
> strings into account, and memmem isn't standard. Because std::string
> knows its length we can do better than strstr for some cases, such as
> std::string(1000, 'a').find(std::string(1001, 'a')).

        Jakub

Re: [PATCH] improve string find algorithm

Reply via email to