On Tue, Jul 01, 2003 at 06:26:08AM +0200, Janek Schleicher wrote:
> B. Fongo wrote at Mon, 30 Jun 2003 23:46:19 +0200:
> 
> > Is there any other way to check array uniqueness except the Module
> > Array::Unique?
> > I'm having problems in installing these two Modules: Array::Unique and
> > Tie.  So I'm thinking of another way to check array for replicates.
> > 
> > I help will be appreciate. Thanks a lot!
> 
> You can also hardcode it e.g.
> 
> {
>    my %seen;
>    if (grep {$seen{$_}} @array) {
>       # at least one element is twice
>    } else {
>       # unique
>    }
> }
> 
> 
> 
> Greetings,
> Janek


Umm...this doesn't actually work.

Here are two versions that do work:

#!/usr/bin/perl 

use Benchmark;

my @array = qw( bar baz foo);

timethese(1_000_000, 
          {
              'short_circuit' => \&short_circuit,
              'keys_compare'  => \&keys_compare,
          }
);

sub short_circuit {
    # return 1 if dupe, 0 if unique

    my %seen;
    my $is_dupe = 0;
    for(@array) {
        $seen{$_}++; 
        if ($seen{$_} > 1) {
            $is_dupe++; 
            last;
        }
    } 
    return $is_dupe;
}

sub keys_compare {
    # return 1 if dupe, 0 if unique

    my %seen;
    $seen{$_} = 1 for(@array) ;

    if (scalar keys %seen != scalar @array) {
        return 1;
    } else {
        return 0;
    }
}
__END__

Out of curiousity, I benchmarked them a couple times.  Here are the
results (note that I'm only testing with 4 elements; the timings could
change greatly with large datasets):

The executive summary is that short_circuit has a significantly better
best case, a worse worst case, and considerably more variability than
keys_compare.  So, you're probably better off using keys_compare.



Case #1
(  my @array = qw( foo bar baz jaz);  )


Benchmark: timing 1000000 iterations of keys_compare, short_circuit...

keys_compare: 19 wallclock secs (17.77 usr +  0.01 sys = 17.77 CPU) @
56263.74/s (n=1000000)

short_circuit: 19 wallclock secs (19.31 usr +  0.01 sys = 19.32 CPU) @
51759.00/s (n=1000000)

With a unique array of 4 elements, they run at identical wallclock
speeds.  keys_compare wins on CPU time, though.

---------------

Case #2
(  my @array = qw( foo bar baz foo);  )

Benchmark: timing 1000000 iterations of keys_compare, short_circuit...

keys_compare: 16 wallclock secs (15.98 usr +  0.00 sys = 15.98 CPU) @
62561.09/s (n=1000000)

short_circuit: 20 wallclock secs (16.85 usr +  0.00 sys = 16.85 CPU) @
59341.68/s (n=1000000)

This one is the worst-case for short_circuit; the duplicate is at the
far end of the array.  In this case, it runs slower than keys_compare
by 25%.  Counterintuitively, the gap between CPU times has gotten smaller.

---------------

Case #3
(  my @array = qw( foo foo bar baz );  )

Benchmark: timing 1000000 iterations of keys_compare, short_circuit...

keys_compare: 17 wallclock secs (15.94 usr +  0.00 sys = 15.94 CPU) @
62745.10/s (n=1000000)

short_circuit: 10 wallclock secs (10.77 usr +  0.00 sys = 10.77 CPU) @
92888.24/s (n=1000000)

This is the ideal case for short_circuit; the duplicate is immediate
so that it can take advantage of its optimization early on.  In this
case, it wins hands-down, both in wallclock and CPU time.


The really interesting part, to me, is that keys_compare has extremely
stable performance...averaging 17.3333 wallclock seconds and never
straying far from that.  short_circuit, OTOH, varies from 10 to 20
seconds depending on its data...and that was with a tiny data
set. (OTGH, I'm testing with a trivial dataset and only performed
three tests, so my results probably aren't very authoritative.)

In any case, these performance numbers surprise me, because both
algorithms should be dominated by the O(n) step (initializing the
hash), so should run about the same speed except for when
short_circuit gets to quit early.  

Anyone have any thoughts on this?  (I suppose I should deparse it to
see what's really going on; that might answer my question.)


--Dks

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to