OK before I got in to the loop in the script I decided to try first with one file and I have some doubts with the some parts in the script,plus I got an error:
>>> import re >>> file = open("file1.html") >>> data = file.read() >>> catRe = re.compile(r'<strong>Title:</strong>(.*?)<br><strong>') # I searched around the docs on regexes I have and found that the "r" #after the re.compile(' will detect repeating words.Why is this useful in #my case? I want to read the whole string even if it has repeating words. #Also, I dont understand the actual regex (.*?) . If I want to match #everything inside </strong> and <br><strong> , shouldn`t I just put a "*" # ? I tried that and it gave me an error of course. >>> m = catRe.search(data) >>> category = m.group(1) Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'NoneType' object has no attribute 'group' >>> I also found that on some of the strings I want to extract, when python reads them using file.read(), there are newline characters and other stuff that doesn`t show up in the actual html source.Do I have to take these in to account in the regex or will it automatically include them? > --- Ursprüngliche Nachricht --- > Von: Kent Johnson <[EMAIL PROTECTED]> > An: Python Tutor <tutor@python.org> > Betreff: Re: [Tutor] Extracting data from HTML files > Datum: Thu, 29 Dec 2005 14:18:38 -0500 > > Try something like this: > > def process(data): > # this is a function you define to process the data from one file > > maxFileIndex = ... # whatever the max count is > for i in range(1, maxFileIndex+1): # i will take on each value > # from 1 to maxFileIndex > name = 'article%s.html' % i # make a file name > f = open(name) # open the file and read its contents > data = f.read() > f.close() > process(data) > > Kent > > PS Please reply to the list -- Telefonieren Sie schon oder sparen Sie noch? NEU: GMX Phone_Flat http://www.gmx.net/de/go/telefonie _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor