David Storrs <[EMAIL PROTECTED]> writes:

<snip>

> Here are two versions that do work:
> 
> #!/usr/bin/perl 
> 
> use Benchmark;
> 
> my @array = qw( bar baz foo);
> 
> timethese(1_000_000, 
>         {
>             'short_circuit' => \&short_circuit,
>             'keys_compare'  => \&keys_compare,
>         }
> );
> 
> sub short_circuit {
>     # return 1 if dupe, 0 if unique
> 
>     my %seen;
>     my $is_dupe = 0;
>     for(@array) {
>       $seen{$_}++; 
>       if ($seen{$_} > 1) {
>           $is_dupe++; 
>           last;
>       }
>     } 
>     return $is_dupe;
> }
> 
> sub keys_compare {
>     # return 1 if dupe, 0 if unique
> 
>     my %seen;
>     $seen{$_} = 1 for(@array) ;
> 
>     if (scalar keys %seen != scalar @array) {
>       return 1;
>     } else {
>       return 0;
>     }
> }
> __END__
> 
> Out of curiousity, I benchmarked them a couple times.  Here are the
> results (note that I'm only testing with 4 elements; the timings could
> change greatly with large datasets):
> 
> The executive summary is that short_circuit has a significantly better
> best case, a worse worst case, and considerably more variability than
> keys_compare.  So, you're probably better off using keys_compare.

keys_compare might be slightly better for small sets, but for large
sets, short_circuit is the way to go.

I initialized my @array like so:

my @array;

for (1 .. 5000) {
  push @array, int((rand) * 10000);
}


And I only ran 10,000 iterations, since I wanted it to finish before I
got bored with the whole thing.

Results:

Benchmark: timing 10000 iterations of grep_compare, keys_compare, short_circuit...

grep_compare: 140 wallclock secs (136.80 usr +  0.37 sys = 137.17 CPU) @ 72.90/s 
(n=10000)

keys_compare: 127 wallclock secs (126.35 usr +  0.21 sys = 126.56 CPU) @ 79.01/s 
(n=10000)

short_circuit:  5 wallclock secs ( 4.67 usr +  0.01 sys =  4.68 CPU) @ 2136.75/s 
(n=10000)


As you can see, grep_compare performs somewhat worse than
keys_compare, and short_circuit cleans up.

Of course, short_circuit won't perform quite as well as the other two
when dealing with arrays that are usually unique, but as far as I can
tell, short_circuit's worst case should be about the same as the
normal case for the other two algorithms.

-RN

-- 
Robin Norwood
Red Hat, Inc.

"The Sage does nothing, yet nothing remains undone."
-Lao Tzu, Te Tao Ching

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to