Hi,
Please check my reply below.
On Fri, May 3, 2013 at 12:59 PM, Edward and Erica Heim <
[email protected]> wrote:
> Hi all,
>
> I'm using LWP::UserAgent to access a website. One of the methods returns
> HTML data e.g.
>
> my $data = $response->content;
>
> I.e. $data contains the HTML content. I want to be able to parse it line
> by line e.g.
>
> foreach (split /pattern/, $data) {
> my $line = $_;
> .....
>
> If I print $data, I can see the individual lines of the HTML data but I'm
> not clear on the "pattern" that I should use in split or if there is a
> better way to do this.
>
> What really are you splitting? And what exactly is the pattern you are
using?
> I understand that there are packages to parse HTML code but this is also a
> learning exercise for me.
>
Please, don't parse HTML files with regexp. It's not that it can't be
done or it hasn't been done, but it labor in futility. Rather learn
modules like HTML::TreeBuilder and and rest from CPAN that can help do what
you wanted.
Secondly, parse the file first before "splitting".
If I may, say one is to parse http://www.perl.org to print out the trimmed
text on that web page. One can do like so:
[CODE]
#!/usr/bin/perl
use warnings;
use strict;
use LWP::UserAgent;
use HTML::TreeBuilder 5 -weak;
## url to get
my $url = 'http://www.perl.org';
## get the file
my $ua = LWP::UserAgent->new;
my $resp = $ua->request( HTTP::Request->new( GET => $url ) );
## parse the HTML file
my $tree = HTML::TreeBuilder->new;
$tree->parse( $resp->decoded_content );
print $tree->as_trimmed_text;
[/CODE]
Hope this help somehow.
>
> Thanks in advance, Edward
>
>
>
>
>
> --
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> http://learn.perl.org/
>
>
>
--
Tim