Hello, I was taking a look at BeautifulSoup as recommended by bob and from what I can tell it`s just what I`m looking for but it`s a bit over my current skills with python I`m afraid.I`ll still keep playing with it and see what I can come up with. I`ll also take a look at regexes as recommended by Kent Johnson to see if it`ll work here.My guess is this is the way to go since the data I need is always in the same line number in the HTML source.So I could just go to the specific line numbers, look for my data and strip out the unnecesary tags. Thanks for the help guys, if anyone`s got more tips they are more than welcome :) Thanks again and happy holidays!
> --- Ursprüngliche Nachricht --- > Von: Kent Johnson <[EMAIL PROTECTED]> > An: Python Tutor <tutor@python.org> > Betreff: Re: [Tutor] Extracting data from HTML files > Datum: Wed, 28 Dec 2005 22:16:47 -0500 > > [EMAIL PROTECTED] wrote: > > I`m trying to make a python script for extracting certain data from HTML > > files.These files are from a template so they all have the same > formatting.I > > just want to extract the data from certain fields.It would also be nice > to > > insert it into a mysql database, but I`ll leave that for later since I`m > > stuck in just reading the files. > > Say for example the HTML file has the following format: > > > > <strong>Category:</strong>Category1<br><br> > > [...] > > <strong>Name:</strong>Filename.exe<br><br> > > [...] > > <strong>Description:</strong>Description1.<br><br> > > > Since your data is all in the same form, I think a regex will easily > find this data. Something like > > import re > catRe = re.compile(r'<strong>Category:</strong>(.*?)<br><br>') > data = ...read the HTML file here > m = catRe.search(data) > category = m.group(1) > > > I also thought regexes might be useful for this but I suck at using > regexes > > so that`s another problem. > > Regexes take some effort to learn but it is worth it, they are a very > useful tool in many contexts, not just Python. Have you read the regex > HOW-TO? > http://www.amk.ca/python/howto/regex/ > > Kent > > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > -- 10 GB Mailbox, 100 FreeSMS/Monat http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor