Re: parse xml with invalid chars

Chas. Owens Fri, 04 Jun 2010 10:02:39 -0700

On Fri, Jun 4, 2010 at 12:23, Roman Makurin <[email protected]> wrote:
> Hi, here it is http://pastebin.org/307289
>
> On Fri, Jun 04, 2010 at 12:06:24PM -0400, Chas. Owens wrote:
>> On Fri, Jun 4, 2010 at 10:16, Roman Makurin <[email protected]> wrote:
>> > Hi all
>> >
>> > Last time i have a big problem, i need parse xml files
>> > which have invalid xml chars outside of CDATA and xml
>> > parser hangs everytime on such files. Is there any way
>> > to parse such files ???
>> snip
>>
>> Can you give an example of these invalid characters?
>>
>> --
>> Chas. Owens
>> wonkden.net
>> The most important skill a programmer can have is the ability to read.
>
> --
> If you think of MS-DOS as mono, and Windows as stereo,
>  then Linux is Dolby Digital and all the music is free...
>


Given that this is RSS, you should be able to get away with using a
regex to fix the links.  This works for me:

#!/usr/bin/perl

use strict;
use warnings;

use XML::RSS::Parser;
use URI::Escape qw/uri_escape uri_unescape/;

my $filename = shift;

my $xml = do {
        open my $fh, "<", $filename
                or die "could not open $filename: $!";
        local $/;
        <$fh>;
};

$xml =~ s{<link>(.*?)</link>}{"<link>" . uri_escape($1)  . "</link>"}seg;


my $p = XML::RSS::Parser->new
        or die "could not create parser\n";

my $feed = $p->parse_string($xml)
        or die "could not parse $filename:", $p->errstr, "\n";

for my $item ( $feed->query('//item') ) {
        my $title = $item->query('title')->text_content;
        my $link  = uri_unescape $item->query('link')->text_content;
        printf "%60.60s: %s\n", $title, $link;
}


-- 
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.

--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/

Re: parse xml with invalid chars

Reply via email to