[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Damien Neil

Damien Neil  added the comment:

I just ran into this problem.  I was very surprised to realize that
every time the code I was working on parsed a docbook file, it generated
several HTTP requests to oasis-open.org to fetch the docbook DTDs.

I attempted to fix the issue by adding an EntityResolver that would
cache fetched DTDs.  (The documentation on how to do this is not, by the
way, very clear.)

Unfortunately, this proves to not be possible.  The main docbook DTD
includes subsidiary DTDs using relative system identifiers.  For
example, the main DTD at:

publicId: -//OASIS//DTD DocBook V4.1//EN
systemId: http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd

...includes this second DTD:

publicId: -//OASIS//ENTITIES DocBook Notations V4.4//EN
systemId: dbnotnx.mod

The EntityResolver's resolveEntity() method is not, however, passed the
base path to resolve the relative systemId from.

This makes it impossible to properly implement a parser which caches
fetched DTDs.

--
nosy: +damien

___
Python tracker 
<http://bugs.python.org/issue2124>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Damien Neil

Damien Neil  added the comment:

On Feb 3, 2009, at 1:42 PM, Martin v. Löwis wrote:
> Sure. But ContentHandler.setDocumentLocator receives it, and you are
> supposed to store it for the entire parse, to always know what entity
> is being processed if you want to.

Where in the following sequence am I supposed to receive the document 
locator?

parser = xml.sax.make_parser()
parser.setEntityResolver(CachingEntityResolver())
doc = xml.dom.minidom.parse('file.xml', parser)

The content handler is being created deep inside xml.dom.  It does, in 
fact, store the document locator, but not in any place that I can easily 
access without breaking several layers of abstraction.

Or, as a more general question: How can I get a DOM tree that includes 
external entities?  If there's an easy way to do it, the documentation 
does not make it clear at all.

___
Python tracker 
<http://bugs.python.org/issue2124>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Damien Neil

Damien Neil  added the comment:

I just discovered another really fun wrinkle in this.

Let's say I want to have my entity resolver return a reference to my 
local copy of a DTD.  I write:

source = xml.sax.InputSource()
source.setPublicId(publicId)
source.setSystemId(systemId)
source.setCharacterStream(file(path_to_local_copy))
return source

This will appear to work.

However, the parser will still silently fetch the DTD over the network!  
I needed to call source.setByteStream()--character streams are silently 
ignored.

I'd never have noticed this if I hadn't used strace on my process and 
noticed a slew of recvfrom() calls that shouldn't have been there.

___
Python tracker 
<http://bugs.python.org/issue2124>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Damien Neil

Damien Neil  added the comment:

On Feb 3, 2009, at 11:23 AM, Martin v. Löwis wrote:
> I don't think this is actually the case. Did you try calling getSystemId
> on the locator?

EntityResolver.resolveEntity() is called with the publicId and systemId as 
arguments. It does not receive a locator.

- Damien

___
Python tracker 
<http://bugs.python.org/issue2124>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Damien Neil

Damien Neil  added the comment:

On Feb 3, 2009, at 3:12 PM, Martin v. Löwis wrote:
> This is DOM parsing, not SAX parsing.

1) The title of this ticket begins with "xml.sax and xml.dom...".
2) I am creating a SAX parser and passing it to xml.dom, which uses it.

> So break layers of abstraction, then. Or else, use dom.expatbuilder,
> and ignore SAX/pulldom for DOM parsing.

Is that really the answer?

Read the source code to xml.dom.*, and write hacks based on what I find 
there?  Note also that xml.dom.expatbuilder does not appear to be an 
external API--there is no mention of it in the documentation for 
xml.dom.*.

> This tracker is really not the place to ask questions; use python-list
> for that.

That was a rhetorical question.

The answer is, as best I can tell, "You can't do that."

___
Python tracker 
<http://bugs.python.org/issue2124>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com