Vincent Davis wrote: > I have never used the difflib or similar and have a few questions. > I am working with DNA sequences of length 25. I have a list of 230,000 > and need to look for each sequence in the entire genome (toxoplasma > parasite) I am not sure how large the genome is but more that 230,000 > sequences. > The are programs that do this and really fast, and they eve do partial > matches but not quite what I need. So I am looking to build a custom > solution. > I need to look for each of my sequences of 25 characters > example(AGCCTCCCATGATTGAACAGATCAT). > The genome is formatted as a continuos string > (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG.........) > > I don't care where or how many times on if it exists. This is simple I > think, str.find(AGCCTCCCATGATTGAACAGATCAT) > > But I also what to find a close match defined as only wrong at 1 > location and I what to record the location. I am not sure how do do > this. The only thing I can think of is using a wildcard and performing > the search with a wildcard in each position. ie 25 time. > For example > AGCCTCCCATGATTGAACAGATCAT > AGCCTCCCATGATAGAACAGATCAT > close match with a miss-match at position 13
also : sequence = 'AGGCTTGCGGAGCCTGAGGGCGGAG' seqList = ['*' + sequence[0:i] + '?' + sequence[i+1:] + '*' for i in range(len(sequence))] import fnmatch genome = 'CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG........' if any(fnmatch.fnmatch(genome, i) for i in seqList) print 'It matches' Which might be better if the sequence is fixed and the genome changes inside a loop. HTH
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor