On Mon, 2001-11-12 at 16:31, Steve Tattersall wrote:
> For example I want to extract the line: (see the html code below)
> GB 0152 MSS.126/NUDL
>
> and also the title which is:
>
> National Union of Dock, Riverside and General Workers in Grea
> t Britain and Ireland
>
> does anyone know how to go about this please, I would be extremly grateful.
We've had a regexp answer, but for readability I'd use the
HTML::TokeParser module. It'd work like this.
# Prep an object. $html contains the html to parse.
my $p = HTML::TokeParser->new( \$html ) or die "$!";
# Find an <a> tag, and get everything outside of it up to </a>.
my $token = $p->get_tag("a");
my $reference = $p->get_trimmed_text("/a");
# From there, find a </b> tag, and snarf everything up to <br>.
my $token = $p->get_tag("/b");
my $title = $p->get_trimmed_text("br");
You'll have some small tidying up to do on both, but it's a /much/ more
readable (and maintainable) way of parsing the HTML.
Hope this helps, (from one Manchester perl bod to another ;-)
~C.
--
$a="printf.net"; Chris Ball | chris@void.$a | www.$a | finger: chris@$a
"In the beginning there was nothing, which exploded."
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]