> Ok, what I have is a RNA sequence (about 5 million nucleotides > [characters] long) and have (4100) subsequences (from another sequence) > and the sub-sequences are 6 characters long, that I want to find in it.
Hi Lauren, I am positive that this problem has been tackled before. Have you talked to any of your fellow bioinformaticists? It might not be effective to ask on Python-tutor because, for typical problems, regexes are sufficient. For your particular scenario, regexes may not be, so you may want to ask specialists like those on the Biopython mailing lists: http://biopython.org/wiki/Mailing_lists It may seem like a simple extension of regular expression search, but as you may have noticed, feeding 4100 regex searches across that RNA sequence is going to take some time. And trying to feed all of those subsequences as a single regular expression (using ORs) is probably not going to work too well either: the engine has limitations on how large the pattern can be before it starts behaving very badly. To scale to this problem, we need to do something different. What you're probably looking for is more typically known as the keyword matching problem, and there are algorithms specifically used for keyword matching. For example, take a look at Nicholas Lehuen's Pytst package, which implements ternary search trees: http://nicolas.lehuen.com/index.php/category/Pytst I've also written a different package that uses the "Aho-Corasick" automata matcher, but to tell the truth I've let the code stale a bit, and can't support it (too busy with other work). But if you're interested in tinkering with it, it's here: http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/. If you're going to do more string-matching stuff, I'd recommend a look into "Algorithms on Strings, Trees, and Sequences". http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521585198 It's one of the better books on bioinformatics string algorithms, and it specifically covers this class of sequence-searching problems. _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor