Re: sequence composition

Changrong Ge Thu, 02 Dec 2010 00:16:15 -0800

Hi guys,
    Thanks a lot for this issue and with Anjan's help here I got the
wonderful script for my work and I post below, hopefully it would be helpful
for others--works perfectly for fasta file.
    Also thanks to others who are interested in the discussion, but I think
it takes time for me to digest those "weird" symbols-----Tack Så Mycket!!!
    Have a nice day
    Changrong




usage: perl seqComp.pl <input_fasta_file> > <output_file>


#! /usr/bin/perl -w
use strict;

open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!";

my %s;# a hash of arrays, to hold each line of sequence
my %seq; #a hash to hold the AA sequences.
my $key;

while (<S>){ #Read the FASTA file.
    chomp;

    if (/>/){
        s/>//;
        $key= $_;
    }else{
        push (@{$s{$key}}, $_);
    }

}

foreach my $a (keys %s){
    my $s= join("", @{$s{$a}});
    $seq{$a}=$s;
    #print("$a\t$s\n");
}

my @aa= qw(A R N D C Q E G H I L K M F P S T W Y V);
my $aa= join("\t", @aa);
print ("Sequence\t$aa\n");

foreach my $k (keys %seq){
    my %count; # a hash to hold the count for each amino acid in the
protein.
    my @seq= split(//, $seq{$k});
    foreach my $r(@seq){
        $count{$r}++;
    }
my @row;
push(@row, $k);
    foreach my $a (@aa){
        $count{$a}||=0;
        $count{$a}= sprintf("%0.2f", $count{$a}/length($seq{$k}));
        push(@row,$count{$a});
    }
my $row= join("\t",@row);
print("$row\n");
}



On Wed, Dec 1, 2010 at 8:56 PM, ANJAN PURKAYASTHA <
[email protected]> wrote:

> here is the script. give it a name, say, seqComp.pl. usage: perl seqComp.pl
> <input_FASTA_file>.
> HTH,
> Anjan
>
> #! /usr/bin/perl -w
> use strict;
>
> open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!";
>
> my %s;# a hash of arrays, to hold each line of sequence
> my %seq; #a hash to hold the AA sequences.
> my $key;
>
> while (<S>){ #Read the FASTA file.
>     chomp;
>
>     if (/>/){
>         s/>//;
>         $key= $_;
>     }else{
>         push (@{$s{$key}}, $_);
>     }
>
> }
>
> foreach my $a (keys %s){
>     my $s= join("", @{$s{$a}});
>     $seq{$a}=$s;
>     #print("$a\t$s\n");
> }
>
> my @aa= qw(A R N D C Q E G H I L K M F P S T W Y V);
>
> foreach my $k (keys %seq){
>     my %count; # a hash to hold the count for each amino acid in the
> protein.
>     my @seq= split(//, $seq{$k});
>     foreach my $r(@seq){
>         $count{$r}++;
>     }
>     print("$k\n");
>     foreach my $a (@aa){
>         $count{$a}||=0;
>         $count{$a}= sprintf("%0.2f", $count{$a}/length($seq{$k}));
>         print("$a\t$count{$a}\n");
>     }
>
> }
>
> On Wed, Dec 1, 2010 at 2:31 PM, Rob Dixon <[email protected]> wrote:
>
>> On 01/12/2010 08:44, Changrong Ge wrote:
>>
>>>
>>> I am quite new to this perl language-I am from biochemistry field.
>>> Now trying to write a script for my current work but could not make
>>> it. The idea is to calculate the composition (percentage) of amino
>>> acids in a protein sequence.
>>>
>>> Input is a series of fasta format (protein sequence) output is a tab
>>> delimited format like below: >
>>>
>>>     Name            A      T       C        D      N     Q      E .......
>>>     protein1     0.23   0.40  0.20 ...
>>>     protein2     0.52   0.01   ....
>>>     protein3
>>>     ......
>>> Could somebody help me with this? I tried reading some books like
>>> perl for bioinformatics, but still not into it.
>>>
>>
>> Hi Changrong
>>
>> The problem doesn't seem difficult, but I'm afraid we don't have much
>> knowledge of bioinformatics between us. If you post a sample of input
>> data and the corresponding output you desire then I am sure we can help.
>>
>> Regards,
>>
>> Rob
>>
>>
>> --
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> http://learn.perl.org/
>>
>>
>>
>
>
> --
> ===================================
> anjan purkayastha, phd.
> research associate
> fas center for systems biology,
> harvard university
> 52 oxford street
> cambridge ma 02138
> phone-703.740.6939
> ===================================
>



-- 
Changrong Ge, PhD student,
Center for Biomembrane Research,
Dept.of Biochemistry & Biophysics,
Svante Arrheniusväg 16,
Stockholm University, SE-10691 Stockholm, Sweden.
Tel:  +46-(0)8-16 24 87,  Mobile: +46-(0)76-2176 119
Fax: +46-(0)8-16 2487,  Email: [email protected] or [email protected]

Re: sequence composition

Reply via email to