Hi guys,
Thanks a lot for this issue and with Anjan's help here I got the
wonderful script for my work and I post below, hopefully it would be helpful
for others--works perfectly for fasta file.
Also thanks to others who are interested in the discussion, but I think
it takes time for me to digest those "weird" symbols-----Tack Så Mycket!!!
Have a nice day
Changrong
usage: perl seqComp.pl <input_fasta_file> > <output_file>
#! /usr/bin/perl -w
use strict;
open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!";
my %s;# a hash of arrays, to hold each line of sequence
my %seq; #a hash to hold the AA sequences.
my $key;
while (<S>){ #Read the FASTA file.
chomp;
if (/>/){
s/>//;
$key= $_;
}else{
push (@{$s{$key}}, $_);
}
}
foreach my $a (keys %s){
my $s= join("", @{$s{$a}});
$seq{$a}=$s;
#print("$a\t$s\n");
}
my @aa= qw(A R N D C Q E G H I L K M F P S T W Y V);
my $aa= join("\t", @aa);
print ("Sequence\t$aa\n");
foreach my $k (keys %seq){
my %count; # a hash to hold the count for each amino acid in the
protein.
my @seq= split(//, $seq{$k});
foreach my $r(@seq){
$count{$r}++;
}
my @row;
push(@row, $k);
foreach my $a (@aa){
$count{$a}||=0;
$count{$a}= sprintf("%0.2f", $count{$a}/length($seq{$k}));
push(@row,$count{$a});
}
my $row= join("\t",@row);
print("$row\n");
}
On Wed, Dec 1, 2010 at 8:56 PM, ANJAN PURKAYASTHA <
[email protected]> wrote:
> here is the script. give it a name, say, seqComp.pl. usage: perl seqComp.pl
> <input_FASTA_file>.
> HTH,
> Anjan
>
> #! /usr/bin/perl -w
> use strict;
>
> open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!";
>
> my %s;# a hash of arrays, to hold each line of sequence
> my %seq; #a hash to hold the AA sequences.
> my $key;
>
> while (<S>){ #Read the FASTA file.
> chomp;
>
> if (/>/){
> s/>//;
> $key= $_;
> }else{
> push (@{$s{$key}}, $_);
> }
>
> }
>
> foreach my $a (keys %s){
> my $s= join("", @{$s{$a}});
> $seq{$a}=$s;
> #print("$a\t$s\n");
> }
>
> my @aa= qw(A R N D C Q E G H I L K M F P S T W Y V);
>
> foreach my $k (keys %seq){
> my %count; # a hash to hold the count for each amino acid in the
> protein.
> my @seq= split(//, $seq{$k});
> foreach my $r(@seq){
> $count{$r}++;
> }
> print("$k\n");
> foreach my $a (@aa){
> $count{$a}||=0;
> $count{$a}= sprintf("%0.2f", $count{$a}/length($seq{$k}));
> print("$a\t$count{$a}\n");
> }
>
> }
>
> On Wed, Dec 1, 2010 at 2:31 PM, Rob Dixon <[email protected]> wrote:
>
>> On 01/12/2010 08:44, Changrong Ge wrote:
>>
>>>
>>> I am quite new to this perl language-I am from biochemistry field.
>>> Now trying to write a script for my current work but could not make
>>> it. The idea is to calculate the composition (percentage) of amino
>>> acids in a protein sequence.
>>>
>>> Input is a series of fasta format (protein sequence) output is a tab
>>> delimited format like below: >
>>>
>>> Name A T C D N Q E .......
>>> protein1 0.23 0.40 0.20 ...
>>> protein2 0.52 0.01 ....
>>> protein3
>>> ......
>>> Could somebody help me with this? I tried reading some books like
>>> perl for bioinformatics, but still not into it.
>>>
>>
>> Hi Changrong
>>
>> The problem doesn't seem difficult, but I'm afraid we don't have much
>> knowledge of bioinformatics between us. If you post a sample of input
>> data and the corresponding output you desire then I am sure we can help.
>>
>> Regards,
>>
>> Rob
>>
>>
>> --
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> http://learn.perl.org/
>>
>>
>>
>
>
> --
> ===================================
> anjan purkayastha, phd.
> research associate
> fas center for systems biology,
> harvard university
> 52 oxford street
> cambridge ma 02138
> phone-703.740.6939
> ===================================
>
--
Changrong Ge, PhD student,
Center for Biomembrane Research,
Dept.of Biochemistry & Biophysics,
Svante Arrheniusväg 16,
Stockholm University, SE-10691 Stockholm, Sweden.
Tel: +46-(0)8-16 24 87, Mobile: +46-(0)76-2176 119
Fax: +46-(0)8-16 2487, Email: [email protected] or [email protected]