Hi,
I'm trying to parse a table containing information about genes in a
bacterial chromosome. Below is a sample for several genes, and there's about
4500 such blocks in a file:
gene_oid Locus Tag Source Cluster Information Gene
Information E-value
642745051 SeSA_B0001 COG_category [T] Signal transduction
mechanisms
642745051 SeSA_B0001 COG_category [K] Transcription
642745051 SeSA_B0001 COG1974 SOS-response transcriptional
repressors (RecA-mediated autopeptidases) 2.0e-29
642745051 SeSA_B0001 pfam00717 Peptidase_S24 1.7e-13
642745051 SeSA_B0001 EC:3.4.21.- Hydrolases. Acting on peptide
bonds (peptide hydrolases). Serine endopeptidases.
642745051 SeSA_B0001 KO:K03503 DNA polymerase V [EC:3.4.21.-]
0.0e+00
642745051 SeSA_B0001 ITERM:03797 SOS response UmuD protein. Serine
peptidase. MEROPS family S24
642745051 SeSA_B0001 Locus_type CDS
642745051 SeSA_B0001 NCBI_accession YP_002112883
642745051 SeSA_B0001 Product_name protein SamA
642745051 SeSA_B0001 Scaffold NC_011092
642745051 SeSA_B0001 Coordinates 34..459(+)
642745051 SeSA_B0001 DNA_length 426bp
642745051 SeSA_B0001 Protein_length 141aa
642745051 SeSA_B0001 GC .52
642745052 SeSA_B0002 COG_category [L] Replication, recombination
and repair
642745052 SeSA_B0002 COG0389 Nucleotidyltransferase/DNA polymerase
involved in DNA repair 4.0e-71
642745052 SeSA_B0002 pfam00817 IMS 2.7e-36
642745052 SeSA_B0002 pfam11798 IMS_HHH 6.8e-06
642745052 SeSA_B0002 pfam11799 IMS_C 4.0e-11
642745052 SeSA_B0002 KO:K03502 DNA polymerase V 0.0e+00
642745052 SeSA_B0002 Locus_type CDS
642745052 SeSA_B0002 NCBI_accession YP_002112884
642745052 SeSA_B0002 Product_name protein UmuC
642745052 SeSA_B0002 Scaffold NC_011092
642745052 SeSA_B0002 Coordinates 459..1730(+)
642745052 SeSA_B0002 DNA_length 1272bp
642745052 SeSA_B0002 Protein_length 423aa
642745052 SeSA_B0002 GC .57
642745052 SeSA_B0002 Fused_gene Yes
I want to parse information for Locus_Tag, Source, and Cluster Info for each
gene so that the output table looks like this
locus COG_category COG_category COGID Cluster_Information
SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription
COG1974 SOS-response transcriptional repressors (RecA-mediated
autopeptidases)
SeSA_B0002 [L] Replication, recombination and repair COG0389
Nucleotidyltransferase/DNA polymerase involved in DNA repair
My problem is that some genes have 2 entries for COG_category, some only one
and others none. I took a look at perldsc and tried to fit the table into
one of the complex structures but didn't get far. Below is the code I came
up with so far:
#!/usr/bin/perl
# parse_IMG_gene_info.pl
use strict; use warnings;
open( IN, "<", @ARGV ) or die "Failed to open: $!\n";
print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Information\n\n";
my( %locus, @cogs, %cog_cat, %cog_id, $oid, $locus, $source, $cluster_info,
$e );
while( <IN> ) {
if( $_=~ /COG_category/ ) {
( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
$cog_cat{ $locus } = $cluster_info;
push( @cogs, { %cog_cat } );
} elsif ( $_=~ /COG\d+/ ) {
( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
$cog_id{ $locus } = $cluster_info;
}
}
close IN;
#print scalar @cogs, "\n";
for my $test( sort keys %cog_cat ) {
print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
}
print "\n";
Your insight is greatly appreciated!
galeb