On 09/06/2011 09:48, venkates wrote:
> Hi,
>
> data snippet:
>
> ENTRY K00002 KO
> NAME E1.1.1.2, adh
> DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
> PATHWAY ko00010 Glycolysis / Gluconeogenesis
> ko00561 Glycerolipid metabolism
> ko00930 Caprolactam degradation
> CLASS Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis
> [PATH:ko00010]
> Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561]
> Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam
> degradation [PATH:ko00930]
> DBLINKS RN: R00746 R01041 R05231
> COG: COG0656
> GO: 0008106
> GENES HSA: 10327(AKR1A1)
> PTR: 741418(AKR1A1)
> PON: 100173796(AKR1A1)
> MCC: 693380(AKR1A1)
> MMU: 58810(Akr1a4)
> RNO: 78959(Akr1a1)
> CFA: 610537
> ///
> ENTRY K00730 KO
> NAME OST4
> DEFINITION oligosaccharyl transferase complex subunit OST4
> PATHWAY ko00510 N-Glycan biosynthesis
> ko00513 Various types of N-glycan biosynthesis
> ko04141 Protein processing in endoplasmic reticulum
> MODULE M00072 Oligosaccharyltransferase
> CLASS Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan
> biosynthesis [PATH:ko00510]
> Metabolism; Glycan Biosynthesis and Metabolism; Various types of
> N-glycan biosynthesis [PATH:ko00513]
> Genetic Information Processing; Folding, Sorting and Degradation;
> Protein processing in endoplasmic reticulum [PATH:ko04141]
> DBLINKS GO: 0008250
> GENES SCE: YDL232W(OST4)
> AGO: AGOS_ABL170C
> KLA: KLLA0A01287g
> VPO: Kpol_1054p35
> SSL: SS1G_13465
> REFERENCE PMID:15001703
> AUTHORS Zubkov S, Lennarz WJ, Mohanty S
> TITLE Structural basis for the function of a minimembrane protein
> subunit of yeast oligosaccharyltransferase.
> JOURNAL Proc Natl Acad Sci U S A 101:3821-6 (2004)
> ///
>
> I need to retrieve all the gene entries to add it to a hash ref. My code
> does that in the first record but in the second case it also pulls out
> the REFERENCE information. I have provided the code below. If some one
> could tell me where exactly I am going wrong (is it in the regex? or
> otherwise) I would be glad!!
>
> code :
>
> use strict;
> use warnings;
> use Carp;
> use Data::Dumper;
>
>
> my $set = parse("/home/venkates/workspace/KEGG_Parser/data/ko");
>
> sub parse {
>
> my $kegg_file_path = shift;
> my $keggData; # Hash ref
>
> open my $fh, '<', $kegg_file_path or croak("Cannot open file
> '$kegg_file_path': $!");
> local $/ = "\n///\n";
> while (<$fh>){
> chomp;
> my $record = $_;
> $record =~ m/^ENTRY\s{7}(.+?)\s+/xms;
> my $entries = $1;
> if ($record =~ m/^GENES\s{7}(.+)$/xms){
> my $gene = $1;
> ${$keggData}{$entries}{'GENE'} = $gene;
> my @genes = split ('\s{13}', $gene);
> foreach my $gene_element (@genes){
> my $taxon_label = substr($gene_element, 0, 3);
> my $gene_label = substr($gene_element, 5);
> my @gene_label_array = split '\s', $gene_label;
> push @{${$keggData}{$entries}{'GENES'}{$taxon_label}}, @gene_label_array;
> }
> }
>
> }
> print Dumper($keggData);
> close $fh;
> }
I would prefer to read the file a line at a time. The code below seems
to do what you want.
HTH,
Rob
use strict;
use warnings;
use Data::Dumper;
my $kegg_file = '/home/venkates/workspace/KEGG_Parser/data/ko';
my $fh;
unless (open $fh, $kegg_file) {
warn "Failed to open file: $!. Defaulting to DATA.";
$fh = *DATA;
}
parse($fh);
sub parse {
my $kegg_file_handle = shift;
my $keggData;
my $entry;
my $key;
while (<$fh>) {
next unless /\S/;
if (m|///|) {
undef $entry;
undef $key;
next;
}
chomp;
next unless m|^(.{0,11}?)\s+(.+)|;
$key = $1 if $1;
my $val = $2;
if ($key eq 'ENTRY') {
($entry) = $val =~ /(\S+)/;
}
elsif ($key eq 'GENES') {
die "No current entry" unless $entry;
my ($taxon_label, @gene_label_array) = split /:?\s+/, $val;
push @{$keggData->{$entry}{$key}{$taxon_label}}, @gene_label_array;
}
}
print Dumper($keggData);
}
__DATA__
ENTRY K00002 KO
NAME E1.1.1.2, adh
DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
PATHWAY ko00010 Glycolysis / Gluconeogenesis
ko00561 Glycerolipid metabolism
ko00930 Caprolactam degradation
CLASS Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis
[PATH:ko00010]
Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561]
Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam
degradation [PATH:ko00930]
DBLINKS RN: R00746 R01041 R05231
COG: COG0656
GO: 0008106
GENES HSA: 10327(AKR1A1)
PTR: 741418(AKR1A1)
PON: 100173796(AKR1A1)
MCC: 693380(AKR1A1)
MMU: 58810(Akr1a4)
RNO: 78959(Akr1a1)
CFA: 610537
///
ENTRY K00730 KO
NAME OST4
DEFINITION oligosaccharyl transferase complex subunit OST4
PATHWAY ko00510 N-Glycan biosynthesis
ko00513 Various types of N-glycan biosynthesis
ko04141 Protein processing in endoplasmic reticulum
MODULE M00072 Oligosaccharyltransferase
CLASS Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan
biosynthesis [PATH:ko00510]
Metabolism; Glycan Biosynthesis and Metabolism; Various types of
N-glycan biosynthesis [PATH:ko00513]
Genetic Information Processing; Folding, Sorting and Degradation;
Protein processing in endoplasmic reticulum [PATH:ko04141]
DBLINKS GO: 0008250
GENES SCE: YDL232W(OST4)
AGO: AGOS_ABL170C
KLA: KLLA0A01287g
VPO: Kpol_1054p35
SSL: SS1G_13465
REFERENCE PMID:15001703
AUTHORS Zubkov S, Lennarz WJ, Mohanty S
TITLE Structural basis for the function of a minimembrane protein subunit
of yeast oligosaccharyltransferase.
JOURNAL Proc Natl Acad Sci U S A 101:3821-6 (2004)
///
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/