Re: trying to understand HTML::TreeBuilder::XPath

Rob Dixon Tue, 29 Jan 2013 06:11:37 -0800

On 26/01/2013 20:44, Jeswin wrote:
> Hi,
> I'm trying to parse out the emails addresses from a webpage and I'm
> using the HTML::TreeBuilder::XPath module. I don't really understand
> XML and it's been a while since I worked with perl*. So far I mashed
> up a code by looking through past examples online. The HTML portion
> for the email is like:
> 
> <li class="ii">Email: <a href="mailto:[email protected]";>[email protected]</a></li>
> 
> The code I put together is:
> 
> #!/usr/bin/perl
> use strict;
> use warnings;
> 
> use HTML::TreeBuilder::XPath;
> 
> my $html  = HTML::TreeBuilder::XPath->new;
> my $root  = $html->parse_file( 'file.htm' );
> 
> my @email = $root ->findnodes(q{//a} );
> 
> for my $email(@email) {
> 
> print $email->attr('href');
> }
> 
> The problem is that it also outputs the link found in another portion
> of the HTML ( <a href="http://sites.place.yyy/name";>). So I get a list
> of websites and emails, one after another. How can I just output the
> email section?
> 
> I also don't understand how the path for "findnodes(q{//a} )" works.
> What's the "q" for? How do I understand the structure of nodes?
> 
> Thanks for any advice,
> JJ
> 
> *I'm not a programmer; I have a list to compile for work and thought I
> might automate it to make my life easier.


Hi Jeswin

q{...} is just another way of writing single quotes. '//a' will do just
fine.

The // means descendant, so '//a' finds any <a> element beneath the root
of the document.

I'm not sure what you mean by "How do I understand the structure of
nodes?" Do you know any HTML? If not then this is going to be very
difficult.

Since you're using HTML::TreeBuilder::XPath there are some easier
options open to you. You can write

    my $html  = HTML::TreeBuilder::XPath->new_from_file('file.htm');

    my @links = $html->findnodes_as_strings('//@href[starts-with(., 
"mailto:";)]');

which will find all the href="..." attributes that start with 'mailto:'
and put their values into @links. Then you can print them all out using

    print "$_\n" for @links;

If you want to go further and remove the 'mailto:' from the beginning,
then its just

    for (@links) {
        my $mail = s/^mailto://r;
        print "$mail\n";
    }

HTH,

Rob

-- 
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/

Re: trying to understand HTML::TreeBuilder::XPath

Reply via email to