[issue17318] xml.sax and xml.dom fetch DTDs by default

2013-02-27 Thread Raynard Sandwick

New submission from Raynard Sandwick:

Note that URIs in the following are only meant as links when in parentheses; 
otherwise, they are identifiers and mostly will not yield useful results. I 
have only worked with xml.sax in Python 2.6 and 2.7, so I cannot speak to its 
current state in later versions.

The condition described in Python issue #2124 
(http://bugs.python.org/issue2124) may yet be a defect, and is at the least a 
reasonably important enhancement, but apparently was not sufficiently 
specified, so I will attempt to clarify. As an aside, it is similar to a 
libxml2 issue on which I have also commented today 
(https://bugzilla.gnome.org/show_bug.cgi?id=162776), whose statement of issue 
actually contains what I would expect to be correct behavior if the toggling 
action were setting an option/feature rather than importing an additional 
module.

The most common case, and the reason w3c has been inundated with the described 
requests, is that every time any user anywhere uses xml.sax in its default form 
to parse an XHTML document containing a doctype declaration, a request is sent 
to www.w3.org for the contents of that DTD from the URI in its system 
identifier. This is not documented anywhere (which would be the primary reason 
to call this a defect), and is confusing because it has the effect of using the 
terms parser and validator (or "validating parser," whichever is the preferred 
name) interchangeably.

The w3c is largely to blame, since their own definition document for XML 
(http://www.w3.org/TR/REC-xml/#sec-external-ent) defines the DTD as a "special 
kind of external entity," and then goes on to say that XML processors *MAY* use 
any combination of pubid+sysid to find an alternative method of resolving the 
reference, but otherwise *MUST* use the URI.

However, this is only necessary when *validating* XML. The DTD is a "mostly 
useless, but required" (http://en.wikipedia.org/wiki/Document_Type_Declaration) 
entity in HTML5, e.g., but is not required in XML generally. Even when present, 
the only time a processor should consult the DTD is during validation, not 
parsing. If the default parser revealed by xml.sax is a validator rather than 
just a parser, that should be communicated clearly to the user. When we discuss 
a CSV parser, we expect it to accept lines separated by some character, each 
with columns separated by commas. We do not expect it to verify that certain 
values are found in certain columns of the first line unless we specify that it 
should. In specifying that it should, we have asked for a validator rather than 
a parser. This issue is related to the XML analogue of that distinction.

The most valid and important complaint in the referenced blog post is: "don't 
fetch stuff unless you actually need it," which is what xml.sax users may be 
unwittingly doing if validation is the default behavior. Further, if xml.sax 
were actually *not* conducting validation by default, there is no reason 
whatever to retrieve the DTD, since any external entity references can remain 
unresolved in well-formed XML prior to validation.

Note that the features, http://xml.org/sax/features/external-general-entities, 
.../external-parameter-entities, and .../validation have no specified defaults 
(http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html#package_description).
 Making these enabled by default causes network-required side effects, which I 
would contend is improper: unless a user asks for network activity, none should 
occur. An implicit request for network activity, such as validation, should be 
fully and widely-visibly documented as a legitimate side effect.

The set of primary use cases for the xml.sax parsers certainly include 
validation, but users will often be unaware that it is the default, and more 
importantly be unaware that the parser will therefore request the DTD from its 
URI. While the feature, .../external-general-entities, partially solves the 
problem, it is not a full solution, because a well-formed XML document can 
contain external entities regardless of the location of DTD subsets. The w3c's 
description ("special kind of external entity") is important here - the DTD is 
special for a reason, and has its own tag/specifier as a result: resolving 
general external entities after intentionally omitting an external DTD subset 
is an acceptable use case, especially in a non-validating parser.

My proposal would be to enhance/fix xml.sax by doing the following:

1) allow toggling of external DTD subset loading via a feature such as 
http://apache.org/xml/features/nonvalidating/load-external-dtd 
(http://xerces.apache.org/xerces-j/features.html),
2) cause the feature, http://xml.org/sax/features/validation, to automatically 
enable the DTD loading feature as well, just as it does for the two currently 
implemented external entity features,
3) document the default behavior, specially n

[issue2124] xml.sax and xml.dom fetch DTDs by default

2013-02-27 Thread Raynard Sandwick

Raynard Sandwick added the comment:

I have opened issue #17318 to try to specify the problem better. While I do 
think that catalogs are the correct fix for the validation use case (and thus 
would like to see something more out-of-the-box in that vein), the real trouble 
is that users are often unaware that they're sending requests to DTD URIs, so 
some combination of fixes in default behavior and/or documentation is 
definitely needed.

The external_ges feature does help, in a way, but is poorly communicated to new 
users, and moreover does not respect the difference between external DTD 
subsets and external general entities (there's a reason "DOCTYPE" isn't spelled 
"ENTITY").

The default behavior is not well documented, and the constraining behavior of 
DTDs is frequently unnecessary. Either a user should have to explicitly enable 
validation, or it should be irrevocably obvious to a user that validation is 
the default behavior, and in both cases it should be blatantly documented that 
validation may cause network side effects. I think the input has been 
reasonable all around, and yet I find it rather insane that this issue didn't 
eventually at least result in a documentation fix, thanks to what looks like 
push-back for push-back's sake, though I will gladly admit the conclusion that 
it was underspecified is entirely valid.

Anyway, further info in the new issue...

--
nosy: +rsandwick3

___
Python tracker 
<http://bugs.python.org/issue2124>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17239] XML vulnerabilities in Python

2013-03-25 Thread Raynard Sandwick

Changes by Raynard Sandwick :


--
nosy: +rsandwick3

___
Python tracker 
<http://bugs.python.org/issue17239>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com