Dermot Paikkos wrote:
> Hi,
>
> I am trying to parse the data out of am XML file. The file is below.
> Most of the data is easily grabbed but the keywords stretch over
> several newlines and there can anywhere between 0 and 20 entries. I
> have tried using /m and /s but these don't seem to work. I have set
> $/="<image>", I don't know if this is impacting on my attempts. But
> changing it does help either.
>
> Here is what I am using at the moment:
> ==============
> my $datafile = "news.xml";
> open(FH,$datafile)|| die "Can't open $datafile: $!\n";
> while (defined($i=<FH>)) {
> $/ = </image>;
> if ( $i =~ /\?xml version*/ ) {
> next;
> }
> (my $splnum) = ($i =~ /<image number=.(\w\d+\/\d+)/i);
> (my $title) = ($i =~ /<title>(.*)<\/title>/);
> (my $date ) = ($i =~ /<date>(.*)<\/date>/);
> (my $credit) = ($i =~ /<credit>(.*)<\/credit>/);
> (my $caption) = ($i =~ /<caption>(.*)<\/caption>/);
> (my $keywords) = ($i =~ /<keyword>(.*)<\/keyword>/);
> chomp($splnum,$title,$date,$credit);
> print "$splnum $title $date $credit $keywords\n";
> }
> ===============
>
> This only grabs the first keyword (NERVE FIBRE, OVERLAPPING) and I
> need them all. Also the processing seems to stop after to records
> when there are 470 in $datafile!!. I can't work that out either.
>
> Any ideas? There are a lot of xml modules out there butI don't know if
> any would help.
> Thanx.
> Dp.
>
>
> =========== news.xml ============
> <?xml version='1.0'?>
> <images>
> <image number='P350/041'>
> <title>Coloured SEM of two overlapping nerve fibres</title>
> <date>09-Jul-98</date>
> <credit>CREDIT: JUERGEN BERGER, MAX-PLANCK
> INSTITUTE/SCIENCE PHOTO LIBRARY</credit>
> <caption>CREDIT: JUERGEN BERGER, MAX-PLANCK INSTITUTE/
> SCIENCE PHOTO LIBRARY Nerve fibres. Coloured scanning electron
> micrograph (SEM) of overlapping nerve fibres. Each fibre is made up
> of several individual axons. An axon is a long extension from a nerve
> cell (or neurone) which is the main output process of the cell. Some
> small neurone cell bodies (rounded) can be seen here alongside the
> axons. Nerve fibres rapidly relay signals between the central nervous
> system (the brain and spinal cord) and muscles and organs in the
> body. This allows the body to react quickly to any situation.
> Magnification unknown.</caption>
> <keywords>
> <keyword>NERVE FIBRE, OVERLAPPING</keyword>
> <keyword>AXON, NERVE FIBRE, OVERLAPPING</keyword>
> <keyword>FIBRE, NERVE, OVERLAPPING</keyword>
> <keyword>NERVE CELL, WITH FIBRES</keyword>
> <keyword>NEURONE, WITH NERVE FIBRES</keyword>
> <keyword>HUMAN BODY, ANATOMY, NERVOUS</keyword>
> <keyword>SYSTEM, NERVE FIBRE, FIBRES</keyword>
> </keywords>
> </image>
> </images>
> ~~
> Dermot Paikkos * [EMAIL PROTECTED]
> Network Administrator @ Science Photo Library
> Phone: 0207 432 1100 * Fax: 0207 286 8668
trying to do this with a reg. expression is unwise. there are a number of
module out there that can help you quickly find what you need in a XML
file. one of those module is XML::Parser. you can use it like:
#!/usr/bin/perl -w
use strict;
use XML::Parser;
my $kw = 0;
my $kws = '';
my $xml = new XML::Parser(Handlers => {Start => \&start,
End => \&end,
Char => \&string});
open(XML,'your.xml') || die $!;
$xml->parse(*XML);
close(XML);
sub start{
$kw = 1 if($_[1] eq 'keyword');
}
sub end{
if($_[1] eq 'keyword'){
print "get one keyword: $kws\n";
$kws = '';
$kw = 0;
}
}
sub string{
$kws .= $_[1] if($kw && $_[1] =~ /\S/);
}
__END__
the above only extract things inside the <keyword> tag from the XML file.
but you can apply the same technique to the other tags. i didn't really
teset the above but hope that should give you something to look into.
much easier than writing tons of reg. exp. right? :-)
david
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]