Re: [Tutor] Merging Text Files

2010-10-14 Thread Adam Lucas
Either way; nest the for loops and index with protein IDs or dictionary one
file and write the other with matches to the dictionary:

non-python pseudocode:

for every line in TWO:
 get the first protein ID
 for every line in ONE:
if the second protein ID is the same as the first:
 perform the string merging and write it to the file
else:
 pass to the next protein ID in ONE

--OR--

for every line in ONE:
make a dictionary with a key = to the protein ID and the value, the rest

for every line in TWO:
if the dictionary has the same protein ID:
perform the string merging and write to the file

I'm inferring an 'inner join' (drop non-matches), for an 'outer/right join'
(keep everything in TWO) initialize a 'matchmade' variable in the inner loop
and if no matches are made, write the protein to the merged file with null
values.

If you plan on querying or sharing the newly organized dataset use a
database. If this file is going to into a workflow, it probably wants text.
I'd probably do both.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Merging Text Files

2010-10-14 Thread Adam Lucas
Whoops:

1) dictionary.has_key() ???
2) I don't know if it's a typo or oversight, but there's a comma in you
dictionary key, line.split(',')[0].
3) Forget the database if it's part of a larger workflow unless your job is
to adapt a biological workflow database for your lab.



On Thu, Oct 14, 2010 at 09:48, Ara Kooser  wrote:

> Morning all,
>
>   I took the pseudocode that Emile provided and tried to write a python
> program. I may have taken the pseudocode to literally.
>
> So what I wrote was this:
> xml = open("final.txt",'r')
> gen = open("final_gen.txt",'r')
>
> PIDS = {}
> for proteinVals in gen:
>
> ID = proteinVals.split()[0]
> PIDS[ID] = proteinVals
>
> print PIDS
>
> for line in xml:
> ID = proteinVals.split()[1]
> rslt = "%s,%s"% (line,PIDS[ID])
> print rslt
>
> So the first part I get. I read in gen that has this format as a text file:
>
> *Protein ID, Locus Tag, Start/Stop*
> ZP_05482482, StAA4_010100030484, complement(NZ_ACEV0178.1:25146..40916)
> ZP_07281899, SSMG_05939, complement(NZ_GG657746.1:6565974..6581756)
> ZP_05477599, StAA4_01015861, NZ_ACEV0113.1:86730..102047
> ...
> Put that into a dictionary with a key that is the Protein ID at position 0
> in the dictionary.
>
> The second part reads in the file xml which has this format:
>
> *Species, Protein ID, E Value, Length*
> Streptomyces sp. AA4, ZP_05482482, 2.82936001e-140, 5256,
> Streptomyces sp. AA4, ZP_05482482, 8.03332997e-138, 5256,
> Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
> Streptomyces sp. AA4, ZP_07281899, 2.92539001e-140, 5260,
> Streptomyces sp. AA4, ZP_07281899, 8.23695995e-138, 5260,
> 
> *same protein id multiple entries
>
> The program splits the file and does something with the 1 position which is
> the proten id in the xml file. After that I am not really sure what is
> happening. I can't remember what the %s means. Something with a string?
>
> When this runs I get the following error:
> Traceback (most recent call last):
>   File "/Users/ara/Desktop/biopy_programs/merge2.py", line 18, in 
> rslt = "%s,%s"% (line,PIDS[ID])
> KeyError: 'StAA4_010100017400,'
>
> From what I can tell it's not happy about the dictionary key.
>
> In the end I am looking for a way to merge these two files and for each
> protein ID add the locus tag and start/stop like this:
> *Species, Protein ID, Locus Tag, E Value, Length*, *Start/Stop*
>
> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
> 2.82936001e-140, 5256, complement(NZ_ACEV0178.1:25146..40916)
> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
> 8.03332997e-138, 5256, complement(NZ_ACEV0178.1:25146..40916)
> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484, 1.08889e-124, 5256,
> complement(NZ_ACEV0178.1:25146..40916)
> Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 2.92539001e-140,
> 5260, complement(NZ_GG657746.1:6565974..6581756)
> Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 8.23695995e-138,
> 5260, complement(NZ_GG657746.1:6565974..6581756)
>
> Do you have any suggestions for how to proceed. It feels like I am getting
> closer. :)
>
>
> Note:
> When I change this part of the code to 0
> for line in xml:
> ID = proteinVals.split()[0]
> rslt = "%s,%s"% (line,PIDS[ID])
> print rslt
>
> I get the following output:
> Streptomyces sp. AA4, ZP_05482482, 8.03332997e-138, 5256,
> ,ZP_05479896, StAA4_010100017400, NZ_ACEV0143.1:241968..>242983
>
>
> Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
> ,ZP_05479896, StAA4_010100017400, NZ_ACEV0143.1:241968..>242983
>
>
> Streptomyces sp. AA4, ZP_07281899, 2.92539001e-140, 5260,
> ,ZP_05479896, StAA4_010100017400, NZ_ACEV0143.1:241968..>242983
>
> Which seems closer but all it's doing is repeating the same Locus Tag and
> Start/Stop for each entry.
>
> Thank you!
>
> Ara
>
>
> --
> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
> sub cardine glacialis ursae.
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
>


-- 
Data is not information, information is not knowledge, knowledge is not
understanding, understanding is not wisdom.
--Clifford Stoll
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Merging Text Files

2010-10-14 Thread Adam Lucas
I sent both emails and may have confused things:

1. PIDS.has_key(ID) returns True/False. you need to make sure the dictionary
has the key before you fetch PIDS[NotAKey] and get a KeyError.
2. line.split() splits at and removes whitespace, leaving commas.
line.split(",") splits at and removes commas.

On Thu, Oct 14, 2010 at 13:43, Adam Lucas  wrote:

> Whoops:
>
> 1) dictionary.has_key() ???
> 2) I don't know if it's a typo or oversight, but there's a comma in you
> dictionary key, line.split(',')[0].
> 3) Forget the database if it's part of a larger workflow unless your job is
> to adapt a biological workflow database for your lab.
>
>
>
> On Thu, Oct 14, 2010 at 09:48, Ara Kooser  wrote:
>
>> Morning all,
>>
>>   I took the pseudocode that Emile provided and tried to write a python
>> program. I may have taken the pseudocode to literally.
>>
>> So what I wrote was this:
>> xml = open("final.txt",'r')
>> gen = open("final_gen.txt",'r')
>>
>> PIDS = {}
>> for proteinVals in gen:
>>
>> ID = proteinVals.split()[0]
>> PIDS[ID] = proteinVals
>>
>> print PIDS
>>
>> for line in xml:
>> ID = proteinVals.split()[1]
>> rslt = "%s,%s"% (line,PIDS[ID])
>> print rslt
>>
>> So the first part I get. I read in gen that has this format as a text
>> file:
>>
>> *Protein ID, Locus Tag, Start/Stop*
>> ZP_05482482, StAA4_010100030484,
>> complement(NZ_ACEV0178.1:25146..40916)
>> ZP_07281899, SSMG_05939, complement(NZ_GG657746.1:6565974..6581756)
>> ZP_05477599, StAA4_01015861, NZ_ACEV0113.1:86730..102047
>> ...
>> Put that into a dictionary with a key that is the Protein ID at position 0
>> in the dictionary.
>>
>> The second part reads in the file xml which has this format:
>>
>> *Species, Protein ID, E Value, Length*
>> Streptomyces sp. AA4, ZP_05482482, 2.82936001e-140, 5256,
>> Streptomyces sp. AA4, ZP_05482482, 8.03332997e-138, 5256,
>> Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
>> Streptomyces sp. AA4, ZP_07281899, 2.92539001e-140, 5260,
>> Streptomyces sp. AA4, ZP_07281899, 8.23695995e-138, 5260,
>> 
>> *same protein id multiple entries
>>
>> The program splits the file and does something with the 1 position which
>> is the proten id in the xml file. After that I am not really sure what is
>> happening. I can't remember what the %s means. Something with a string?
>>
>> When this runs I get the following error:
>> Traceback (most recent call last):
>>   File "/Users/ara/Desktop/biopy_programs/merge2.py", line 18, in 
>> rslt = "%s,%s"% (line,PIDS[ID])
>> KeyError: 'StAA4_010100017400,'
>>
>> From what I can tell it's not happy about the dictionary key.
>>
>> In the end I am looking for a way to merge these two files and for each
>> protein ID add the locus tag and start/stop like this:
>> *Species, Protein ID, Locus Tag, E Value, Length*, *Start/Stop*
>>
>> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
>> 2.82936001e-140, 5256, complement(NZ_ACEV0178.1:25146..40916)
>> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484,
>> 8.03332997e-138, 5256, complement(NZ_ACEV0178.1:25146..40916)
>> Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484, 1.08889e-124, 5256,
>> complement(NZ_ACEV0178.1:25146..40916)
>> Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 2.92539001e-140,
>> 5260, complement(NZ_GG657746.1:6565974..6581756)
>> Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 8.23695995e-138,
>> 5260, complement(NZ_GG657746.1:6565974..6581756)
>>
>> Do you have any suggestions for how to proceed. It feels like I am getting
>> closer. :)
>>
>>
>> Note:
>> When I change this part of the code to 0
>> for line in xml:
>> ID = proteinVals.split()[0]
>> rslt = "%s,%s"% (line,PIDS[ID])
>> print rslt
>>
>> I get the following output:
>> Streptomyces sp. AA4, ZP_05482482, 8.03332997e-138, 5256,
>> ,ZP_05479896, StAA4_010100017400, NZ_ACEV0143.1:241968..>242983
>>
>>
>> Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256,
>>  ,ZP_05479896, StAA4_010100017400, NZ_ACEV0143.1:241968..>242983
>>
>>
>> Streptomyces sp. AA4, ZP_07281899, 2.92539001e-140, 5260,
>> ,ZP_05479896, StAA4_010100017400, NZ_ACEV0143.1:241968..>24

Re: [Tutor] triple-nested for loop not working

2011-05-04 Thread Adam Lucas
On Wed, May 4, 2011 at 13:31, Spyros Charonis  wrote:

> Hello everyone,
>
> I have written a program, as part of a bioinformatics project, that
> extracts motif sequences (programmatically just strings of letters) from a
> database and writes them to a file.
> I have written another script to annotate the database file (in plaintext
> ASCII format) by replacing every match of a motif with a sequence of tildes
> (~).  Primitive I know, but not much more can be done with ASCII files.  The
> code goes as follows:
>
>
> motif_file = open('myfolder/pythonfiles/final motifs_11SGLOBULIN', 'r')   #
> => final motifs_11sglobulin contains the output of my first program
> align_file = open('myfolder/pythonfiles/11sglobulin.seqs', 'a+')  #
> => 11sglobulin.seqs is the ASCII sequence alignment file which I want to
> "annotate" (modify)
>
> finalmotif_seqs = []
> finalmotif_length = []  # store length of each motif
> finalmotif_annot = []
>
> for line in finalmotifs:
> finalmotif_seqs.append(line)
> mot_length = len(line)
> finalmotif_length.append(mot_length)
>
> for item in finalmotif_length:
> annotation = '~' * item
> finalmotif_annot.append(annotation)
>
> finalmotifs = motif_file.readlines()
> seqalign = align_file.readlines()
>
> for line in seqalign:
> for i in len(finalmotif_seqs):  # for item in finalmotif_seqs:
> for i in len(finalmotif_annot): # for item in finalmotif_annot:
> if finalmotif_seqs[i] in line:  # if item in line:
> newline = line.replace(finalmotif_seqs[i],
> finalmotif_annot[i])
> #sys.stdout.write(newline)   # => print the lines out
> on the shell
> align_file.writelines(newline)
>

Pay attention to scope with the elements of your iteration loops. If you
call everything 'item' you can confuse yourself, others, and the interpreter
as to which 'item' you're talking about.


>
> motif_file.close()
> align_file.close()
>
>
> My coding issue is that although the script runs, there is a logic error
> somewhere in the triple-nested for loop as I when I check my file I'm
> supposedly modifying there is no change. All three lists are built correctly
> (I've confirmed this on the Python shell). Any help would be much
> appreciated!
> I am running Python 2.6.5
>
>
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor