Matthew Sacks wrote:
> I am unabashedly posting a quiz question I have about regular expressions:
>
> Looking for suggestions.
>
> I am thinking
>
> 1) make the set of regular expressions into one big expression?
> 2) search the seach strings, somehow, for common substrings. "acme.org"
> would be example. Each hit on acme.org would indicate a match on one of the
> original search strings?
> comments invited.
>
>
> You have 100,000 strings which are regex patterns
> intended to match URLs (including hostname and possibly a URI). When you
> receive a message, you need to see if it matches at least one of these
> patterns.
>
> The naive approach would be to go through your list of
> patterns linearly and attempt a regex match on each one. Suggest an
> alternative that would be more
> efficient.
Trade memory for speed.
Below is my solution: create a hash of hashes of hashes, etc. Each
level has one letter in the pattern or a wildcard marker or a successful
match marker. For each string, break it into characters and do a
breath-wide search down through the tree. If at the end of the string,
there is a match amrker, the string is matched.
Known bugs: foo***bar will fail if the number of characters between foo
and bar is less than 3.
To do: record the pattern in the hash so if a match occurs, you know
what pattern matched.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
# Make Data::Dumper pretty
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent = 1;
# Set maximum depth for Data::Dumper, zero means unlimited
$Data::Dumper::Maxdepth = 0;
binmode STDOUT, ':utf8';
my %patterns = ();
print "\nPatterns\n";
while( <DATA> ){
last if m{ __DATA__ }msx;
print;
chomp;
my $p = \%patterns;
while( m{ ( \\? . ) }gmsx ){
my $c = $1;
$c = '-wildcard' if $c eq '*';
$c = $1 if $c =~ m{ \\ (.) }msx;
$p->{$c} = {} unless $p->{$c};
$p = $p->{$c};
# print "\x{ab}$c\x{bb} ";
}
$p->{-matches} = 1;
# print "\n";
}
# print 'patterns = ', Dumper \%patterns;
$Data::Dumper::Maxdepth = 3;
print "\nStrings\n";
while( <DATA> ){
chomp;
my @patterns = ( [ \%patterns, 0 ] );
for my $c ( split // ){
# print "\x{ab}$c\x{bb} ", Dumper \...@patterns; <STDIN>;
my @next_patterns = ();
for my $set ( @patterns ){
push @next_patterns, $set if $set->[1];
my $p = $set->[0];
if( exists $p->{$c} ){
push @next_patterns, [ $p->{$c}, 0 ];
}
if( exists $p->{-wildcard} ){
push @next_patterns, [ $p->{-wildcard}, 1 ];
}
}
@patterns = @next_patterns;
}
# print "\n";
my $matches = 0;
for my $p ( @patterns ){
# print 'checking ', Dumper $p; <STDIN>;
if( exists $p->[0]{-matches} || exists $p->[0]{-wildcard}{-matches} ){
print "$_ matches\n";
$matches = 1;
last;
}
}
print "$_ no match\n" unless $matches;
}
__DATA__
*amazon.com*
*craigslist.*
acme.org*sub=7*
a1.vcl.com*
ebay.com/sports*
foo\*bar
foo
food
foo\*blah
__DATA__
www.amazon.com?x=123
books.amazon.com?y=123
acme.org
acme.org/stuff?sub=7
acme.org?w=1&sub=7
vcl.com/suba/subb?qs=abc
a1.vcl.com?w=3
a2.vcl.com/xyz?w=1
--
Just my 0.00000002 million dollars worth,
Shawn
Programming is as much about organization and communication
as it is about coding.
I like Perl; it's the only language where you can bless your
thingy.
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/