Hi Gary
A simple way would be to put them in a hash first before printing them,
using $text as the key and $url as the value. Duplicates will dissapear,
with the latest being kept. You could keep the first one by checking if it
exists before inserting.
If you need to preserve the order, then you will need to push them onto an
array as well, after first checking to see if you have done so before (by
looking in the hash). This may get a bit tricky if you are keeping the
latest, and want to keep the order of the latest, not the first occurance
though.
It's high time you newspaper boys moved to XML!
hth
charles
On Thu, 14 Sep 2000, Gary Nielson wrote:
> Hi,
>
> I am trying to figure out how to do something and frankly, don't know
> where to begin. I am using the perl module HTML::TokeParser to extract a
> list of urls and headlines. I then get rid of those headlines that are
> garbage, but several times a day the same story comes over with a
> different url but the same headline. I need only the latest version of the
> story, but how do I check for duplicate headlines and get rid of all but
> the first one in the list?
>
> My code thus far is:
>
> $ignoreItems = '^.*Schedule$|Bc-Fbc-|Eds:|^\(|By The';
> use HTML::TokeParser;
> $p = HTML::TokeParser->new(shift||"testwires.htm");
> while (my $token = $p->get_tag("a")) {
> my $url = $token->[1]{href} || "-";
> my $text = $p->get_trimmed_text("/a");
> if ($text =~ $ignoreItems) { print ""; } else { print "$url\t$text\n"; }
> }
>
> which produces something like the following:
>
> 20000913-w4apf/f6590.html Former Dolphins Qb Strock Named Coach
> 20000913-w4apf/f6591.html Former Dolphins Qb Strock Named Coach
> 20000913-w4apf/k3225.html Illinois Qb An Example For Boller
> 20000913-w4apf/k3242.html Cardinals Revenge-Minded
>
> Any advice on how to check the $text variable for the previous entry and
> not print out th $url and $text if the previous entry for $text is the
> same? Any pointers, suggestions appreciated.
>
> Gary
_______________________________________________
Redhat-list mailing list
[EMAIL PROTECTED]
https://listman.redhat.com/mailman/listinfo/redhat-list