my 2 cents:
I agree, BeutifulSoup is great for HTML parsing.
It's a little weird and unintuitive at first, but once you get it, it's
becomes quit an asset.
>When you have time, do try going through a few examples with
>BeautifulSoup. The web page there comes with some interesting examples,
>and
> > category = m.group(1)
> > >
> > > Traceback (most recent call last):
> > > File "", line 1, in ?
> > > AttributeError: 'NoneType' object has no attribute 'group'
> >
> > In this case the match failed, so m is None and m.group(1) gives an
> >error.
>
> So my problem is in the actual regex?
> > > I also found that on some of the strings I want to extract, when
> > > python reads them using file.read(), there are newline characters
> > > and other stuff that doesn`t show up in the actual html source.
> >
> > Not certain that I understand what you mean there. Can you show us?
> > read(
Kent Johnson wrote:
> >
> import re
> file = open("file1.html")
> data = file.read()
> catRe = re.compile(r'Title:(.*?)')
>
> Thi regex does not agree with the data you originally posted. Your
> original data was
> Category:Category1
>
> Do you see the difference? Your regex has
> From: Danny Yoo <[EMAIL PROTECTED]>
>
[...]
> The Regular Expression HOWTO itself is pretty good and talks about some of
> the stuff you've been running into, so here's a link to the base url that
> you may want to look at:
>
> http://www.amk.ca/python/howto/regex/
Ah yes I`ve been readin
Oswaldo Martinez wrote:
> OK before I got in to the loop in the script I decided to try first with one
> file and I have some doubts with the some parts in the script,plus I got an
> error:
>
>
import re
file = open("file1.html")
data = file.read()
catRe = re.compile(r'Title:(.*?)
> >>> import re
> >>> file = open("file1.html")
> >>> data = file.read()
> >>> catRe = re.compile(r'Title:(.*?)')
>
> # I searched around the docs on regexes I have and found that the "r"
> # after the re.compile(' will detect repeating words.
Hi Oswaldo,
Actually, no. What you're seeing is a "
these in to
account in the regex or will it automatically include them?
> --- Ursprüngliche Nachricht ---
> Von: Kent Johnson <[EMAIL PROTECTED]>
> An: Python Tutor
> Betreff: Re: [Tutor] Extracting data from HTML files
> Datum: Thu, 29 Dec 2005 14:18:38 -0500
>
> Try so
[EMAIL PROTECTED] wrote:
> The HTML comes from a bunch of files which are saved in my computer.They
> were generated by a php script and I want to extract certain fields for
> insertion in to a MySQL db.
> I`m trying to get the hang of correctly opening the files first :)
> There are about a thousa
loop in the script
since the files are named article1.html,article2.html,etc.
Thanks for the help!
> --- Ursprüngliche Nachricht ---
> Von: Kent Johnson <[EMAIL PROTECTED]>
> An: unknown
> Kopie: tutor@python.org
> Betreff: Re: [Tutor] Extracting data from HTML files
> Datum
[EMAIL PROTECTED] wrote:
> I`ll also take a look at regexes as recommended by Kent Johnson to see if
> it`ll work here.My guess is this is the way to go since the data I need is
> always in the same line number in the HTML source.So I could just go to the
> specific line numbers, look for my data a
tips they are more than
welcome :)
Thanks again and happy holidays!
> --- Ursprüngliche Nachricht ---
> Von: Kent Johnson <[EMAIL PROTECTED]>
> An: Python Tutor
> Betreff: Re: [Tutor] Extracting data from HTML files
> Datum: Wed, 28 Dec 2005 22:16:47 -0500
>
> [EMAI
[EMAIL PROTECTED] wrote:
> I`m trying to make a python script for extracting certain data from HTML
> files.These files are from a template so they all have the same formatting.I
> just want to extract the data from certain fields.It would also be nice to
> insert it into a mysql database, but I`ll
At 01:26 PM 12/28/2005, [EMAIL PROTECTED] wrote:
>[snip]
>I`m trying to make a python script for extracting certain data from HTML
>filesSay for example the HTML file has the following format:
>Category:Category1
>[...]
>Name:Filename.exe
>[...]
>Description:Description1.
>
>Taking in to accoun
14 matches
Mail list logo