On 26/01/2013 20:44, Jeswin wrote:
> Hi,
> I'm trying to parse out the emails addresses from a webpage and I'm
> using the HTML::TreeBuilder::XPath module. I don't really understand
> XML and it's been a while since I worked with perl*. So far I mashed
> up a code by looking through past examples online. The HTML portion
> for the email is like:
>
> <li class="ii">Email: <a href="mailto:[email protected]">[email protected]</a></li>
>
> The code I put together is:
>
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> use HTML::TreeBuilder::XPath;
>
> my $html = HTML::TreeBuilder::XPath->new;
> my $root = $html->parse_file( 'file.htm' );
>
> my @email = $root ->findnodes(q{//a} );
>
> for my $email(@email) {
>
> print $email->attr('href');
> }
>
> The problem is that it also outputs the link found in another portion
> of the HTML ( <a href="http://sites.place.yyy/name">). So I get a list
> of websites and emails, one after another. How can I just output the
> email section?
>
> I also don't understand how the path for "findnodes(q{//a} )" works.
> What's the "q" for? How do I understand the structure of nodes?
>
> Thanks for any advice,
> JJ
>
> *I'm not a programmer; I have a list to compile for work and thought I
> might automate it to make my life easier.
Hi Jeswin
q{...} is just another way of writing single quotes. '//a' will do just
fine.
The // means descendant, so '//a' finds any <a> element beneath the root
of the document.
I'm not sure what you mean by "How do I understand the structure of
nodes?" Do you know any HTML? If not then this is going to be very
difficult.
Since you're using HTML::TreeBuilder::XPath there are some easier
options open to you. You can write
my $html = HTML::TreeBuilder::XPath->new_from_file('file.htm');
my @links = $html->findnodes_as_strings('//@href[starts-with(.,
"mailto:")]');
which will find all the href="..." attributes that start with 'mailto:'
and put their values into @links. Then you can print them all out using
print "$_\n" for @links;
If you want to go further and remove the 'mailto:' from the beginning,
then its just
for (@links) {
my $mail = s/^mailto://r;
print "$mail\n";
}
HTH,
Rob
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/