Re: A subroutine to find every occurrence of a substring in a string?

Chas. Owens Thu, 10 Jun 2010 08:45:09 -0700

On Thu, Jun 10, 2010 at 10:26, Rob Dixon <[email protected]> wrote:
snip
> It is possible that index() is faster than regular expressions, but I would
> write the code below.
snip


It looks like, at least on Perl 5.12.0, iterated_regex is faster than
index for finding non-overlaping substrings when the substring is
small, but index starts to win as the substring grows in size.  The
real winner is split though.  It is basically the same algorithm as
iterated_regex, but the loop is implemented in C instead of Perl.

I would probably still go with an inline pure_regex type
implementation.  I don't see the value of making a function to do this
unless profiling shows that I have a bottleneck in that section of the
code and need the extra performance of something that can't be easily
inlined.

Perl version 5.012000
index => 2
iterated_regex => 2
pure_regex => 2
split => 2
                    Rate   pure_regex         index iterated_regex         split
pure_regex      608548/s           --           -6%           -46%          -74%
index           649176/s           7%            --           -42%          -73%
iterated_regex 1124391/s          85%           73%             --          -52%
split          2360644/s         288%          264%           110%            --
                    Rate   pure_regex         index iterated_regex         split
pure_regex      613304/s           --           -6%           -44%          -75%
index           649177/s           6%            --           -41%          -73%
iterated_regex 1101602/s          80%           70%             --          -54%
split          2406041/s         292%          271%           118%            --
                    Rate   pure_regex         index iterated_regex         split
pure_regex      590160/s           --           -9%           -46%          -75%
index           651987/s          10%            --           -40%          -73%
iterated_regex 1092266/s          85%           68%             --          -55%
split          2406041/s         308%          269%           120%            --
                    Rate   pure_regex         index iterated_regex         split
pure_regex      601510/s           --           -7%           -47%          -75%
index           649176/s           8%            --           -42%          -73%
iterated_regex 1124391/s          87%           73%             --          -54%
split          2429401/s         304%          274%           116%            --
                    Rate   pure_regex         index iterated_regex         split
pure_regex      590160/s           --           -8%           -48%          -76%
index           643109/s           9%            --           -43%          -74%
iterated_regex 1137401/s          93%           77%             --          -54%
split          2477508/s         320%          285%           118%            --

substring of 10 sets
                 Rate    pure_regex iterated_regex          index          split
pure_regex      128/s            --           -47%           -56%           -91%
iterated_regex  241/s           88%             --           -17%           -84%
index           290/s          126%            20%             --           -80%
split          1478/s         1050%           513%           409%             --

substring of 100 sets
                 Rate iterated_regex    pure_regex          index          split
iterated_regex  522/s             --          -35%           -64%           -74%
pure_regex      807/s            55%            --           -44%           -60%
index          1437/s           175%           78%             --           -28%
split          1999/s           283%          148%            39%             --

substring of 1000 sets
                 Rate iterated_regex    pure_regex          split          index
iterated_regex  599/s             --          -62%           -71%           -75%
pure_regex     1569/s           162%            --           -24%           -36%
split          2055/s           243%           31%             --           -16%
index          2443/s           308%           56%            19%             --

substring of 10000 sets
                 Rate iterated_regex    pure_regex          split          index
iterated_regex  570/s             --          -56%           -68%           -77%
pure_regex     1309/s           130%            --           -26%           -48%
split          1761/s           209%           34%             --           -30%
index          2510/s           340%           92%            43%             --

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

print "Perl version $]\n";

my $string    = "ababababab";
my $substring = "abab";

my %subs = (
        pure_regex => sub {
                return scalar( ()= $string =~ /\Q$substring/g );
        },
        iterated_regex => sub {
                my $n;
                $n++ while $string =~ /\Q$substring/g;
                return $n;
        },
        index => sub {
                my $offset = 0;
                my $result = index $string, $substring, $offset;
                my $length = length $substring;

                my $n;
                while ($result != -1) {
                        $n++;
                        $offset = $result + $length;
                        $result = index $string, $substring, $offset;
                }

                return $n;
        },
        split => sub {
                my $c = split /\Q$substring/, $string;
                #handle the empty string case
                return $c ? $c - 1 : 0;
        }
);

for my $sub (sort keys %subs) {
        print "$sub => ", $subs{$sub}->(), "\n";
}

for $string ("abab", "abab" x 10, ("x" x 1000) . "abab", "x" x 10_000,
"ab" x 10_000 ) {
        Benchmark::cmpthese -1, \%subs;
}

$string = "abab" x 100_000;

for my $n (10, 100, 1_000, 10_000) {
        print "\nsubstring of $n sets\n";
        $substring = "abab" x $n;
        Benchmark::cmpthese -1, \%subs;
}


-- 
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.

-- 
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/

Re: A subroutine to find every occurrence of a substring in a string?

Reply via email to