David Eason wrote:
>
> John W. Krahn wrote:
> > According to HTML::Entities
> >
> > # Some extra Latin 1 chars that are listed in the HTML3.2 draft
> > (21-May-96)
> > copy => '�', # copyright sign
> > reg => '�', # registered sign
> > nbsp => "\240", # non breaking space
>
> Thanks, John, I had no idea where to look. I didn't know a non-breaking
> space was an actual character, I thought it was just a directive to the
> browser.
AFAIK it is.
> I have corrected the code below accordingly and it prints "line
> 1line 3" as desired.
FWIW on my computer "\240" prints a "space". :-)
> use strict;
> use warnings;
> use HTML::TokeParser;
>
> my $p = HTML::TokeParser->new(*DATA) or die "Can't open: $!";
> while (my $tag = $p->get_tag())
> {
> if ($tag->[0] eq "dd")
> {
> my $text = $p->get_trimmed_text();
> $text =~ s/^[\s\240]*(.*?)[\s\240]*$/$1/;
If you are going to do that then you might as well call get_text and do
all the trimming yourself.
my $text = $p->get_text();
for ( $text ) {
s/^[\s\240]+//;
s/[\s\240]+$//;
s/[\s\240]+/ /g;
}
> print "$text";
> }
> }
>
> __DATA__
>
> <DD>line 1</DD>
> <DD> </DD>
> <DD>line 3</DD>
John
--
use Perl;
program
fulfillment
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]