I have never used the difflib or similar and have a few questions. I am working with DNA sequences of length 25. I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma parasite) I am not sure how large the genome is but more that 230,000 sequences. The are programs that do this and really fast, and they eve do partial matches but not quite what I need. So I am looking to build a custom solution. I need to look for each of my sequences of 25 characters example( AGCCTCCCATGATTGAACAGATCAT). The genome is formatted as a continuos string (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG.........)
I don't care where or how many times on if it exists. This is simple I think, str.find(AGCCTCCCATGATTGAACAGATCAT) But I also what to find a close match defined as only wrong at 1 location and I what to record the location. I am not sure how do do this. The only thing I can think of is using a wildcard and performing the search with a wildcard in each position. ie 25 time. For example AGCCTCCCATGATTGAACAGATCAT AGCCTCCCATGATAGAACAGATCAT close match with a miss-match at position 13 *Vincent Davis 720-301-3003 * vinc...@vincentdavis.net my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis>
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor