New submission from Raynard Sandwick:
Note that URIs in the following are only meant as links when in parentheses;
otherwise, they are identifiers and mostly will not yield useful results. I
have only worked with xml.sax in Python 2.6 and 2.7, so I cannot speak to its
current state in later versions.
The condition described in Python issue #2124
(http://bugs.python.org/issue2124) may yet be a defect, and is at the least a
reasonably important enhancement, but apparently was not sufficiently
specified, so I will attempt to clarify. As an aside, it is similar to a
libxml2 issue on which I have also commented today
(https://bugzilla.gnome.org/show_bug.cgi?id=162776), whose statement of issue
actually contains what I would expect to be correct behavior if the toggling
action were setting an option/feature rather than importing an additional
module.
The most common case, and the reason w3c has been inundated with the described
requests, is that every time any user anywhere uses xml.sax in its default form
to parse an XHTML document containing a doctype declaration, a request is sent
to www.w3.org for the contents of that DTD from the URI in its system
identifier. This is not documented anywhere (which would be the primary reason
to call this a defect), and is confusing because it has the effect of using the
terms parser and validator (or "validating parser," whichever is the preferred
name) interchangeably.
The w3c is largely to blame, since their own definition document for XML
(http://www.w3.org/TR/REC-xml/#sec-external-ent) defines the DTD as a "special
kind of external entity," and then goes on to say that XML processors *MAY* use
any combination of pubid+sysid to find an alternative method of resolving the
reference, but otherwise *MUST* use the URI.
However, this is only necessary when *validating* XML. The DTD is a "mostly
useless, but required" (http://en.wikipedia.org/wiki/Document_Type_Declaration)
entity in HTML5, e.g., but is not required in XML generally. Even when present,
the only time a processor should consult the DTD is during validation, not
parsing. If the default parser revealed by xml.sax is a validator rather than
just a parser, that should be communicated clearly to the user. When we discuss
a CSV parser, we expect it to accept lines separated by some character, each
with columns separated by commas. We do not expect it to verify that certain
values are found in certain columns of the first line unless we specify that it
should. In specifying that it should, we have asked for a validator rather than
a parser. This issue is related to the XML analogue of that distinction.
The most valid and important complaint in the referenced blog post is: "don't
fetch stuff unless you actually need it," which is what xml.sax users may be
unwittingly doing if validation is the default behavior. Further, if xml.sax
were actually *not* conducting validation by default, there is no reason
whatever to retrieve the DTD, since any external entity references can remain
unresolved in well-formed XML prior to validation.
Note that the features, http://xml.org/sax/features/external-general-entities,
.../external-parameter-entities, and .../validation have no specified defaults
(http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html#package_description).
Making these enabled by default causes network-required side effects, which I
would contend is improper: unless a user asks for network activity, none should
occur. An implicit request for network activity, such as validation, should be
fully and widely-visibly documented as a legitimate side effect.
The set of primary use cases for the xml.sax parsers certainly include
validation, but users will often be unaware that it is the default, and more
importantly be unaware that the parser will therefore request the DTD from its
URI. While the feature, .../external-general-entities, partially solves the
problem, it is not a full solution, because a well-formed XML document can
contain external entities regardless of the location of DTD subsets. The w3c's
description ("special kind of external entity") is important here - the DTD is
special for a reason, and has its own tag/specifier as a result: resolving
general external entities after intentionally omitting an external DTD subset
is an acceptable use case, especially in a non-validating parser.
My proposal would be to enhance/fix xml.sax by doing the following:
1) allow toggling of external DTD subset loading via a feature such as
http://apache.org/xml/features/nonvalidating/load-external-dtd
(http://xerces.apache.org/xerces-j/features.html),
2) cause the feature, http://xml.org/sax/features/validation, to automatically
enable the DTD loading feature as well, just as it does for the two currently
implemented external entity features,
3) document the default behavior, specially n