[issue2124] xml.sax and xml.dom fetch DTDs by default
Damien Neil added the comment: I just ran into this problem. I was very surprised to realize that every time the code I was working on parsed a docbook file, it generated several HTTP requests to oasis-open.org to fetch the docbook DTDs. I attempted to fix the issue by adding an EntityResolver that would cache fetched DTDs. (The documentation on how to do this is not, by the way, very clear.) Unfortunately, this proves to not be possible. The main docbook DTD includes subsidiary DTDs using relative system identifiers. For example, the main DTD at: publicId: -//OASIS//DTD DocBook V4.1//EN systemId: http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd ...includes this second DTD: publicId: -//OASIS//ENTITIES DocBook Notations V4.4//EN systemId: dbnotnx.mod The EntityResolver's resolveEntity() method is not, however, passed the base path to resolve the relative systemId from. This makes it impossible to properly implement a parser which caches fetched DTDs. -- nosy: +damien ___ Python tracker <http://bugs.python.org/issue2124> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Damien Neil added the comment: On Feb 3, 2009, at 1:42 PM, Martin v. Löwis wrote: > Sure. But ContentHandler.setDocumentLocator receives it, and you are > supposed to store it for the entire parse, to always know what entity > is being processed if you want to. Where in the following sequence am I supposed to receive the document locator? parser = xml.sax.make_parser() parser.setEntityResolver(CachingEntityResolver()) doc = xml.dom.minidom.parse('file.xml', parser) The content handler is being created deep inside xml.dom. It does, in fact, store the document locator, but not in any place that I can easily access without breaking several layers of abstraction. Or, as a more general question: How can I get a DOM tree that includes external entities? If there's an easy way to do it, the documentation does not make it clear at all. ___ Python tracker <http://bugs.python.org/issue2124> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Damien Neil added the comment: I just discovered another really fun wrinkle in this. Let's say I want to have my entity resolver return a reference to my local copy of a DTD. I write: source = xml.sax.InputSource() source.setPublicId(publicId) source.setSystemId(systemId) source.setCharacterStream(file(path_to_local_copy)) return source This will appear to work. However, the parser will still silently fetch the DTD over the network! I needed to call source.setByteStream()--character streams are silently ignored. I'd never have noticed this if I hadn't used strace on my process and noticed a slew of recvfrom() calls that shouldn't have been there. ___ Python tracker <http://bugs.python.org/issue2124> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Damien Neil added the comment: On Feb 3, 2009, at 11:23 AM, Martin v. Löwis wrote: > I don't think this is actually the case. Did you try calling getSystemId > on the locator? EntityResolver.resolveEntity() is called with the publicId and systemId as arguments. It does not receive a locator. - Damien ___ Python tracker <http://bugs.python.org/issue2124> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Damien Neil added the comment: On Feb 3, 2009, at 3:12 PM, Martin v. Löwis wrote: > This is DOM parsing, not SAX parsing. 1) The title of this ticket begins with "xml.sax and xml.dom...". 2) I am creating a SAX parser and passing it to xml.dom, which uses it. > So break layers of abstraction, then. Or else, use dom.expatbuilder, > and ignore SAX/pulldom for DOM parsing. Is that really the answer? Read the source code to xml.dom.*, and write hacks based on what I find there? Note also that xml.dom.expatbuilder does not appear to be an external API--there is no mention of it in the documentation for xml.dom.*. > This tracker is really not the place to ask questions; use python-list > for that. That was a rhetorical question. The answer is, as best I can tell, "You can't do that." ___ Python tracker <http://bugs.python.org/issue2124> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com