> >>> import re > >>> file = open("file1.html") > >>> data = file.read() > >>> catRe = re.compile(r'<strong>Title:</strong>(.*?)<br><strong>') > > # I searched around the docs on regexes I have and found that the "r" > # after the re.compile(' will detect repeating words.
Hi Oswaldo, Actually, no. What you're seeing is a "raw" string literal. See: http://www.amk.ca/python/howto/regex/regex.html#SECTION000420000000000000000 for more details about this. The idea is that we often want to make strings where backslashes are just literally backslashes, rather than treated by Python as escape characters. The Regular Expression HOWTO itself is pretty good and talks about some of the stuff you've been running into, so here's a link to the base url that you may want to look at: http://www.amk.ca/python/howto/regex/ > I want to read the whole string even if it has repeating words. #Also, > I dont understand the actual regex (.*?) . If I want to match > #everything inside </strong> and <br><strong> , shouldn`t I just put a > "*" #? You're confusing the "globbing" notation used in Unix shells with the miniature pattern language used in regular expressions. They both use similar symbols, but with totally different interpretations. Be aware of this context, as it's easy to get confused because of their surface similarities. For example, "ab*" under a globbing interpretation means: 'a' and 'b', followed by any number of characters. But under a regular expression interpretation, this means: 'a', followed by any number of 'b's. As a followup: to express the idea: "'a' and 'b', followed by any number of characters," as a regular expression pattern, we'd write: "ab.*" So any globbing pattern can be translated fairly easily to a regular expression pattern. However, going the other way don't usually work: it's often not possible to take an arbitrary regular expression, like "ab*", and make it work as a glob. So regular expressions are more expressive than globs, but with that power comes great resp... err, I mean, more complexity. *grin* > I also found that on some of the strings I want to extract, when python > reads them using file.read(), there are newline characters and other > stuff that doesn`t show up in the actual html source. Not certain that I understand what you mean there. Can you show us? read() should not adulterate the byte stream that comes out of your files. > Do I have to take these in to account in the regex or will it > automatically include them? Newlines are, by default, handled differently than other characters. You can add an 're.DOTALL' flag so that newlines are also matched by the '.' regular expression metacharacter; see the Regex HOWTO above to see how this might work. As an aside: the problems you're running into is very much why we encourage folks not to process HTML with regular expressions: RE's also come with their own somewhat-high learning curve. Good luck to you. _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor