> From: Danny Yoo <[EMAIL PROTECTED]> > [...] > The Regular Expression HOWTO itself is pretty good and talks about some of > the stuff you've been running into, so here's a link to the base url that > you may want to look at: > > http://www.amk.ca/python/howto/regex/
Ah yes I`ve been reading that same doc and got confused on the use of the "r" I guess [......] > > I also found that on some of the strings I want to extract, when python > > reads them using file.read(), there are newline characters and other > > stuff that doesn`t show up in the actual html source. > > Not certain that I understand what you mean there. Can you show us? > read() should not adulterate the byte stream that comes out of your >files. >>> file = open("file1.html") >>> file.read() '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\r\n"http://www.w3.org/TR/html4/loose.dtd">\r\n<html>\r\n<head>\r\n<!-- Script for select box changes -->\r\n<script type="text/javascript">\r\n [...] That`s just a snippet from the html code.I`m guessing it won`t cause any problems since it`s just the newlines from reading the HTML code and not actually *in* the code. [...] > As an aside: the problems you're running into is very much why we > encourage folks not to process HTML with regular expressions: RE's also > come with their own somewhat-high learning curve. > > Good luck to you. Yes I`m seeing this right now hehe....but since all the files I have to process have the same structure (they were generated by a script) I think it might be easier to use RE`s here. Do you have any idea of what other tool I can use? I took a look at BeautifulSoup but it seemed a bit overkill and very much over my current python knowledge. Thanks! -- 10 GB Mailbox, 100 FreeSMS/Monat http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor