Re: How extract data from XHTML Transitional web pages? got xml.dom.minidom troubles..

Paul Boddie Fri, 02 Mar 2007 15:56:03 -0800

[EMAIL PROTECTED] wrote:
> I'm trying to extract some data from an XHTML Transitional web page.
>
> What is best way to do this?


An XML parser should be sufficient. However...

> xml.dom.minidom.parseString("text of web page") gives errors about it
> not being well formed XML.
>
> Do I just need to add something like <?xml ...?> or what?

If the page isn't well-formed then it isn't proper XHTML since the
XHTML specification [1] says...

    4.1. Documents must be well-formed

Yes, it's a heading, albeit in an "informative" section describing how
XHTML differs from HTML 4. See "3.2. User Agent Conformance" for a
"normative" mention of well-formedness.

You could try libxml2dom (or other libxml2-based solutions) for some
fairly effective HTML parsing:

    libxml2dom.parseString("text of document here", html=1)

See http://www.python.org/pypi/libxml2dom for more details.

Paul

[1] http://www.w3.org/TR/xhtml1/

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How *extract* data from XHTML Transitional web pages? got xml.dom.minidom troubles..

Reply via email to

Re: How extract data from XHTML Transitional web pages? got xml.dom.minidom troubles..