Morning all, I took the pseudocode that Emile provided and tried to write a python program. I may have taken the pseudocode to literally.
So what I wrote was this: xml = open("final.txt",'r') gen = open("final_gen.txt",'r') PIDS = {} for proteinVals in gen: ID = proteinVals.split()[0] PIDS[ID] = proteinVals print PIDS for line in xml: ID = proteinVals.split()[1] rslt = "%s,%s"% (line,PIDS[ID]) print rslt So the first part I get. I read in gen that has this format as a text file: *Protein ID, Locus Tag, Start/Stop* ZP_05482482, StAA4_010100030484, complement(NZ_ACEV01000078.1:25146..40916) ZP_07281899, SSMG_05939, complement(NZ_GG657746.1:6565974..6581756) ZP_05477599, StAA4_010100005861, NZ_ACEV01000013.1:86730..102047 ... Put that into a dictionary with a key that is the Protein ID at position 0 in the dictionary. The second part reads in the file xml which has this format: *Species, Protein ID, E Value, Length* Streptomyces sp. AA4, ZP_05482482, 2.8293600000000001e-140, 5256, Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256, Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256, Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260, Streptomyces sp. AA4, ZP_07281899, 8.2369599999999995e-138, 5260, .... *same protein id multiple entries The program splits the file and does something with the 1 position which is the proten id in the xml file. After that I am not really sure what is happening. I can't remember what the %s means. Something with a string? When this runs I get the following error: Traceback (most recent call last): File "/Users/ara/Desktop/biopy_programs/merge2.py", line 18, in <module> rslt = "%s,%s"% (line,PIDS[ID]) KeyError: 'StAA4_010100017400,' >From what I can tell it's not happy about the dictionary key. In the end I am looking for a way to merge these two files and for each protein ID add the locus tag and start/stop like this: *Species, Protein ID, Locus Tag, E Value, Length*, *Start/Stop* Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484, 2.8293600000000001e-140, 5256, complement(NZ_ACEV01000078.1:25146..40916) Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484, 8.0333299999999997e-138, 5256, complement(NZ_ACEV01000078.1:25146..40916) Streptomyces sp. AA4, ZP_05482482, StAA4_010100030484, 1.08889e-124, 5256, complement(NZ_ACEV01000078.1:25146..40916) Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 2.9253900000000001e-140, 5260, complement(NZ_GG657746.1:6565974..6581756) Streptomyces sp. AA4, ZP_07281899, SSMG_05939, 8.2369599999999995e-138, 5260, complement(NZ_GG657746.1:6565974..6581756) Do you have any suggestions for how to proceed. It feels like I am getting closer. :) Note: When I change this part of the code to 0 for line in xml: ID = proteinVals.split()[0] rslt = "%s,%s"% (line,PIDS[ID]) print rslt I get the following output: Streptomyces sp. AA4, ZP_05482482, 8.0333299999999997e-138, 5256, ,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983 Streptomyces sp. AA4, ZP_05482482, 1.08889e-124, 5256, ,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983 Streptomyces sp. AA4, ZP_07281899, 2.9253900000000001e-140, 5260, ,ZP_05479896, StAA4_010100017400, NZ_ACEV01000043.1:241968..>242983 Which seems closer but all it's doing is repeating the same Locus Tag and Start/Stop for each entry. Thank you! Ara -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae.
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor