> So, what makes regex wrong for this job? question still remains: does the > search start at the beginning of the line each time or does it step forward > from the last search? I will check out beautiful soup as suggested in a > subsequent mail; I'd still like to finish this process:<}}
Mathematically, regular expressions can capture a certain class of text called the "regular languages". Regular languages have a few characteristics. As a concrete example of a limitation: you can't write a pattern that properly does parentheses matching with a regular expression alone. This isn't a challenge to your machismo: it's a matter of mathematics! For the precise details on the impossibility proof, you'd need to take a CS theory class, and in particular, learn about the "pumping lemma for regular expressions". Sipser's "Introduction to the Theory of Computation" has a good presentation. This is one reason why CS theory matters: it can tell you when some approach is not a good idea. :P HTML is not a regular language: it has nested substructure. The same problem about matching balanced parentheses is essentially that of matching start and end tags. So that's the objections from the purely mathematical point of view. This is not to say that regular expressions are useless: they work well for breaking down HTML into a sequence of tokens. If you only care about processing individual tokens at a time, regexes might be appropriate. They're just not the best tool for everything. From a practical point of view: HTML parsing libraries such as Beautiful Soup are nicer to work with than plain regular expressions. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor