Hi,
I suck at regex, but getting better. :)
I'm probably reinventing the wheel here, but I tried to get along with
HTML::Parser and just couldn't get it to do anything. To confusing, I
think.
I simply want to get a list or real words from an HTML string, minus all
the HTML stuff. For example:
$a = 'This is a line of HTML:people write strange things here<br>
and hardly ever follow proper<p>
syntax A&B suck at spelling as well<br>
So I need to clean it up and strip out all<br>
words less then 3 characters in length.<p>
Later the words will go into an indexer for<br>
searching a database';
$a =~ s/<[^>]*>//gs;
$a =~ s/&/&/gs; # probably need to add more like this
@data = split (/ /,$a);
foreach $b (@data) {
foreach $b (split (/\n/,$b)){
foreach $b (split (/:/,$b)){
$b =~ s/^\s+//;
$b =~ s/\s+$//;
$b =~ s/\n//g;
$b =~ s/\c//g;
$b =~ s/[,.-;?]//gs;
if ($b and (length($b) > 3)){
print "D$b\n";
}
}
}
}
Is there a better, maybe more eligant, way to do this? I don't mind to
use HTML::Parser if I could only figure out how.
Cheers.
--
Scott
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>