On May 3, 2013, at 4:59 AM, Edward and Erica Heim wrote:
> Hi all,
>
> I'm using LWP::UserAgent to access a website. One of the methods returns
> HTML data e.g.
>
> my $data = $response->content;
>
> I.e. $data contains the HTML content. I want to be able to parse it line by
> line e.g.
>
> foreach (split /pattern/, $data) {
> my $line = $_;
> ......
>
> If I print $data, I can see the individual lines of the HTML data but I'm not
> clear on the "pattern" that I should use in split or if there is a better way
> to do this.
If the lines are separated by new lines "\n", then the pattern is /\n/:
for my $line ( split(/\n/,$data) ) {
…
The lines could also use carriage return - line feed: /\r\n/ (or is it /\n\r/?).
The pattern /[\r\n]+/ will handle both but it will also gobble up blank lines
-- two successive line ending characters or pairs of characters.
>
> I understand that there are packages to parse HTML code but this is also a
> learning exercise for me.
>
I am currently using HTML::TokeParser to parse HTML files. It is pretty easy to
use:
use HTML::TokeParser;
…
my $parser = HTML::TokeParser->(\$data); # assuming $data contains the HTML
text to be parsed
while( my $token = $parser->get_token() ) {
my $type = $token->[0];
if( $type eq 'S' ) {
my $tag = $token->[1];
print "Start of tag $tag\n";
}elsif( $type eq 'E' ) {
print "End of tag $token->[1]\n";
}elsif( $type eq 'T' ) {
my $text = $token->[1];
print "Text: $text\n";
}elsif( $type eq 'C' ) {
print "Comment: $text\n";
}elsif( $type eq 'D' ) {
print "Declaration: $text\n";
}else{
print "Unknown type $type!!!\n"
}
}
See 'perldoc HTML::TokeParser' for details.
There are lots of other parsers out there. Some have special uses, like
HTML::LinkExtor for extracting links, and HtmL::TableExtract for extracting
information from HTML tables. Some modules, like HTML::TreeBuilder, build an
in-memory model of the HTML page that you can traverse or search for
information.
Good luck.
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/