HTML parsing confusion
Sorry for the noob question, but I've gone through the documentation
on python.org, tried some of the diveintopython and boddie's examples,
and looked through some of the numerous posts in this group on the
subject and I'm still rather confused. I know that there are some
great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
am trying to accomplish a simple task with a minimal (as in nil)
amount of adding in modules that aren't "stock" 2.5, and writing a
huge class of my own (or copying one from diveintopython) seems
overkill for what I want to do.
Here's what I want to accomplish... I want to open a page, identify a
specific point in the page, and turn the information there into
plaintext. For example, on the www.diveintopython.org page, I want to
turn the paragraph that starts "Translations are freely
permitted" (and ends ..."let me know"), into a string variable.
Opening the file seems pretty straightforward.
>>> import urllib
>>> page = urllib.urlopen("http://diveintopython.org/";)
>>> source = page.read()
>>> page.close()
gets me to a string variable consisting of the un-parsed contents of
the page.
Now things get confusing, though, since there appear to be several
approaches.
One that I read somewhere was:
>>> from xml.dom.ext.reader import HtmlLib
>>> reader = HtmlLib.Reader()
>>> doc = reader.fromString(source)
This gets me doc as
>>> paragraphs = doc.getElementsByTagName('p')
gets me all of the paragraph children, and the one I specifically want
can then be referenced with: paragraphs[5] This method seems to be
pretty straightforward, but what do I do with it to get it into a
string cleanly?
>>> from xml.dom.ext import PrettyPrint
>>> PrettyPrint(paragraphs[5])
shows me the text, but still in html, and I can't seem to get it to
turn into a string variable, and I think the PrettyPrint function is
unnecessary for what I want to do. Formatter seems to do what I want,
but I can't figure out how to link the "Element Node" at
paragraphs[5] with the formatter functions to produce the string I
want as output. I tried some of the htmllib.HTMLParser(formatter
stuff) examples, but while I can supposedly get that to work with
formatter a little easier, I can't figure out how to get HTMLParser to
drill down specifically to the 6th paragraph's contents.
Thanks in advance.
- A.
--
http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
> Pardon me, but the standard issue Python 2.n (for n in range(5, 2, > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous > 200-modules PyXML package installed. And you don't want the 75Kb > BeautifulSoup? I wasn't aware that I had PyXML installed, and can't find a reference to having it installed in pydocs. And that highlights the problem I have at the moment with using other modules. I move from computer to computer regularly, and while all have a recent copy of Python, each has different (or no) extra modules, and I don't always have the luxury of downloading extras. That being said, if there's a simple way of doing it with BeautifulSoup, please show me an example. Maybe I can figure out a way to carry the extra modules I need around with me. -- http://mail.python.org/mailman/listinfo/python-list
Re: Problem with processing XML
On Jan 22, 9:11 am, John Carlyle-Clarke <[EMAIL PROTECTED]> wrote: > By the way, is pyxml a live project or not? Should it still be used? > It's odd that if you go tohttp://www.python.org/and click the link > "Using python for..." XML, it leads you tohttp://pyxml.sourceforge.net/topics/ > > If you then follow the download links > tohttp://sourceforge.net/project/showfiles.php?group_id=6473you see that > the latest file is 2004, and there are no versions for newer pythons. > It also says "PyXML is no longer maintained". Shouldn't the link be > removed from python.org? I was wondering that myself. Any answer yet? -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote:
> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
> > 200-modules PyXML package installed. And you don't want the 75Kb
> > BeautifulSoup?
>
> I wasn't aware that I had PyXML installed, and can't find a reference
> to having it installed in pydocs. ...
Ugh. Found it. Sorry about that, but I still don't understand why
there isn't a simple way to do this without using PyXML, BeautifulSoup
or libxml2dom. What's the point in having sgmllib, htmllib,
HTMLParser, and formatter all built in if I have to use use someone
else's modules to write a couple of lines of code that achieve the
simple thing I want. I get the feeling that this would be easier if I
just broke down and wrote a couple of regular expressions, but it
hardly seems a 'pythonic' way of going about things.
# get the source (assuming you don't have it locally and have an
internet connection)
>>> import urllib
>>> page = urllib.urlopen("http://diveintopython.org/";)
>>> source = page.read()
>>> page.close()
# set up some regex to find tags, strip them out, and correct some
formatting oddities
>>> import re
>>> p = re.compile(r'(.*?)',re.DOTALL)
>>> tag_strip = re.compile(r'>(.*?)<',re.DOTALL)
>>> fix_format = re.compile(r'\n +',re.MULTILINE)
# achieve clean results.
>>> paragraphs = re.findall(p,source)
>>> text_list = re.findall(tag_strip,paragraphs[5])
>>> text = "".join(text_list)
>>> clean_text = re.sub(fix_format," ",text)
This works, and is small and easily reproduced, but seems like it
would break easily and seems a waste of other *ML specific parsers.
--
http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 22, 11:39 am, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote: > Alnilam wrote: > > On Jan 22, 8:44 am, Alnilam <[EMAIL PROTECTED]> wrote: > >> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2, > >> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous > >> > 200-modules PyXML package installed. And you don't want the 75Kb > >> > BeautifulSoup? > > >> I wasn't aware that I had PyXML installed, and can't find a reference > >> to having it installed in pydocs. ... > > > Ugh. Found it. Sorry about that, but I still don't understand why > > there isn't a simple way to do this without using PyXML, BeautifulSoup > > or libxml2dom. What's the point in having sgmllib, htmllib, > > HTMLParser, and formatter all built in if I have to use use someone > > else's modules to write a couple of lines of code that achieve the > > simple thing I want. I get the feeling that this would be easier if I > > just broke down and wrote a couple of regular expressions, but it > > hardly seems a 'pythonic' way of going about things. > > This is simply a gross misunderstanding of what BeautifulSoup or lxml > accomplish. Dealing with mal-formatted HTML whilst trying to make _some_ > sense is by no means trivial. And just because you can come up with a few > lines of code using rexes that work for your current use-case doesn't mean > that they serve as general html-fixing-routine. Or do you think the rather > long history and 75Kb of code for BS are because it's creator wasn't aware > of rexes? > > And it also makes no sense stuffing everything remotely useful into the > standard lib. This would force to align development and release cycles, > resulting in much less features and stability as it can be wished. > > And to be honest: I fail to see where your problem is. BeatifulSoup is a > single Python file. So whatever you carry with you from machine to machine, > if it's capable of holding a file of your own code, you can simply put > BeautifulSoup beside it - even if it was a floppy disk. > > Diez I am, by no means, trying to trivialize the work that goes into creating the numerous modules out there. However as a relatively novice programmer trying to figure out something, the fact that these modules are pushed on people with such zealous devotion that you take offense at my desire to not use them gives me a bit of pause. I use non-included modules for tasks that require them, when the capability to do something clearly can't be done easily another way (eg. MySQLdb). I am sure that there will be plenty of times where I will use BeautifulSoup. In this instance, however, I was trying to solve a specific problem which I attempted to lay out clearly from the outset. I was asking this community if there was a simple way to use only the tools included with Python to parse a bit of html. If the answer is no, that's fine. Confusing, but fine. If the answer is yes, great. I look forward to learning from someone's example. If you don't have an answer, or a positive contribution, then please don't interject your angst into this thread. -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 22, 7:29 pm, "Gabriel Genellina" <[EMAIL PROTECTED]>
wrote:
>
> > I was asking this community if there was a simple way to use only the
> > tools included with Python to parse a bit of html.
>
> If you *know* that your document is valid HTML, you can use the HTMLParser
> module in the standard Python library. Or even the parser in the htmllib
> module. But a lot of HTML pages out there are invalid, some are grossly
> invalid, and those parsers are just unable to handle them. This is why
> modules like BeautifulSoup exist: they contain a lot of heuristics and
> trial-and-error and personal experience from the developers, in order to
> guess more or less what the page author intended to write and make some
> sense of that "tag soup".
> A guesswork like that is not suitable for the std lib ("Errors should
> never pass silently" and "In the face of ambiguity, refuse the temptation
> to guess.") but makes a perfect 3rd party module.
>
> If you want to use regular expressions, and that works OK for the
> documents you are handling now, fine. But don't complain when your RE's
> match too much or too little or don't match at all because of unclosed
> tags, improperly nested tags, nonsense markup, or just a valid combination
> that you didn't take into account.
>
> --
> Gabriel Genellina
Thanks, Gabriel. That does make sense, both what the benefits of
BeautifulSoup are and why it probably won't become std lib anytime
soon.
The pages I'm trying to write this code to run against aren't in the
wild, though. They are static html files on my company's lan, are very
consistent in format, and are (I believe) valid html. They just have
specific paragraphs of useful information, located in the same place
in each file, that I want to 'harvest' and put to better use. I used
diveintopython.org as an example only (and in part because it had good
clean html formatting). I am pretty sure that I could craft some
regular expressions to do the work -- which of course would not be the
case if I was screen scraping web pages in the 'wild' -- but I was
trying to find a way to do that using one of those std libs you
mentioned.
I'm not sure if HTMLParser or htmllib would work better to achieve the
same effect as the regex example I gave above, or how to get them to
do that. I thought I'd come close, but as someone pointed out early
on, I'd accidently tapped into PyXML which is installed where I was
testing code, but not necessarily where I need it. It may turn out
that the regex way works faster, but falling back on methods I'm
comfortable with doesn't help expand my Python knowledge.
So if anyone can tell me how to get HTMLParser or htmllib to grab a
specific paragraph, and then provide the text in that paragraph in a
clean, markup-free format, I'd appreciate it.
--
http://mail.python.org/mailman/listinfo/python-list
Re: HTML parsing confusion
On Jan 23, 3:54 am, "M.-A. Lemburg" <[EMAIL PROTECTED]> wrote: > >> I was asking this community if there was a simple way to use only the > >> tools included with Python to parse a bit of html. > > There are lots of ways doing HTML parsing in Python. A common > one is e.g. using mxTidy to convert the HTML into valid XHTML > and then use ElementTree to parse the data. > > http://www.egenix.com/files/python/mxTidy.htmlhttp://docs.python.org/lib/module-xml.etree.ElementTree.html > > For simple tasks you can also use the HTMLParser that's part > of the Python std lib. > > http://docs.python.org/lib/module-HTMLParser.html > > Which tools to use is really dependent on what you are > trying to solve. > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Source (#1, Jan 23 2008)>>> > Python/Zope Consulting and Support ... http://www.egenix.com/ > >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ > >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ > > > > Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! > > eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg > Registered at Amtsgericht Duesseldorf: HRB 46611 Thanks. So far that makes 3 votes for BeautifulSoup, and one vote each for libxml2dom, pyparsing, and mxTidy. I'm sure those would all be great solutions, if I was looking to solve my coding question with external modules. Several folks have mentioned now that they think that if I have files that are valid XHTML, that I could use htmllib, HTMLParser, or ElementTree (all of which are part of the standard libraries in v 2.5). Skipping past html validation, and html to xhtml 'cleaning', and instead starting with the assumption that I have files that are valid XHTML, can anyone give me a good example of how I would use _ htmllib, HTMLParser, or ElementTree _ to parse out the text of one specific childNode, similar to the examples that I provided above using regex? -- http://mail.python.org/mailman/listinfo/python-list
