Vincent Davis wrote: > I have never used the difflib or similar and have a few questions. > I am working with DNA sequences of length 25. I have a list of 230,000 > and need to look for each sequence in the entire genome (toxoplasma > parasite) I am not sure how large the genome is but more that 230,000 > sequences. > The are programs that do this and really fast, and they eve do partial > matches but not quite what I need. So I am looking to build a custom > solution. > I need to look for each of my sequences of 25 characters > example(AGCCTCCCATGATTGAACAGATCAT). > The genome is formatted as a continuos string > (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG.........) > > I don't care where or how many times on if it exists. This is simple I > think, str.find(AGCCTCCCATGATTGAACAGATCAT) > > But I also what to find a close match defined as only wrong at 1 > location and I what to record the location. I am not sure how do do > this. The only thing I can think of is using a wildcard and performing > the search with a wildcard in each position. ie 25 time. > For example > AGCCTCCCATGATTGAACAGATCAT > AGCCTCCCATGATAGAACAGATCAT > close match with a miss-match at position 13
Untested : genome = 'CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG........' sequence = 'AGGCTTGCGGAGCCTGAGGGCGGAG' import fnmatch for i in range(len(sequence)): match = '*' + sequence[0:i] + '?' + sequence[i+1:] + '*' if fnmatch.fnmatch(genome, match) print 'It matches'
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor