How to delete duplicate headlines in perl with HTML::TokeParser

Gary Nielson Thu, 14 Sep 2000 15:23:54 -0700

Hi,

I am trying to figure out how to do something and frankly, don't know
where to begin. I am using the perl module HTML::TokeParser to extract a
list of urls and headlines. I then get rid of those headlines that are
garbage, but several times a day the same story comes over with a
different url but the same headline. I need only the latest version of the
story, but how do I check for duplicate headlines and get rid of all but
the first one in the list?

My code thus far is:

$ignoreItems = '^.*Schedule$|Bc-Fbc-|Eds:|^\(|By The';
use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"testwires.htm");
while (my $token = $p->get_tag("a")) {
   my $url = $token->[1]{href} || "-";
   my $text = $p->get_trimmed_text("/a");
   if ($text =~ $ignoreItems) { print ""; } else { print "$url\t$text\n"; }
}

which produces something like the following:

20000913-w4apf/f6590.html       Former Dolphins Qb Strock Named Coach
20000913-w4apf/f6591.html       Former Dolphins Qb Strock Named Coach
20000913-w4apf/k3225.html       Illinois Qb An Example For Boller
20000913-w4apf/k3242.html       Cardinals Revenge-Minded

Any advice on how to check the $text variable for the previous entry and
not print out th $url and $text if the previous entry for $text is the
same? Any pointers, suggestions appreciated.

Gary




_______________________________________________
Redhat-list mailing list
[EMAIL PROTECTED]
https://listman.redhat.com/mailman/listinfo/redhat-list

How to delete duplicate headlines in perl with HTML::TokeParser

Reply via email to