[issue5752] xml.dom.minidom does not escape CR, LF and TAB characters within attribute values
Tomalak added the comment: @devon: Thanks for pointing & linking back here. -- ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not handle newline characters in attribute values
New submission from Tomalak : Current behavior upon toxml() is: Upon reading the document again, the new line is normalized and collapsed into a space (according to the XML spec, section 3.3.3), which means that it is lost. Better behavior would be something like this (within attribute values only): -- components: XML messages: 85964 nosy: Tomalak severity: normal status: open title: xml.dom.minidom does not handle newline characters in attribute values versions: Python 2.4, Python 2.5, Python 2.6, Python 2.7, Python 3.0, Python 3.1 ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not handle newline characters in attribute values
Changes by Tomalak : -- type: -> behavior ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not handle newline characters in attribute values
Tomalak added the comment: @Francesco Sechi: Would it not just require a minimal change to the _write_data() method? Something along the lines of (sorry, no Python expert, maybe I am way off): def _write_data(writer, data, is_attrib=False): "Writes datachars to writer." if is_attrib: data = data.replace("\r", " ").replace("\n", " ") data = data.replace("&", "&").replace("<", "<") data = data.replace("\"", """).replace(">", ">") writer.write(data) and in Element.writexml(): #[...] for a_name in a_names: writer.write(" %s=\"" % a_name) _write_data(writer, attrs[a_name].value, True) #[...] -- ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not handle newline characters in attribute values
Tomalak added the comment: Of course it should be: def _write_data(writer, data, is_attrib=False): "Writes datachars to writer." data = data.replace("&", "&").replace("<", "<") data = data.replace("\"", """).replace(">", ">") if is_attrib: data = data.replace("\r", " ").replace("\n", " ") writer.write(data) -- ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not handle newline characters in attribute values
Tomalak added the comment: Hmm... I thought toxml() is the part that needs to be fixed, not the parsing/reading. I mentioned the reading only to outline the data loss that occurs eventually. My point is: The toxml() (i.e. _write_data) *actually writes* the newline to the output. And within parameters, it just shouldn't. -- ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not handle newline characters in attribute values
Tomalak added the comment: Attaching a patch that fixes the problem. -- keywords: +patch Added file: http://bugs.python.org/file13919/minidom.patch ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not handle newline characters in attribute values
Tomalak added the comment: Attaching a test file that outlines the problem. Output on my system (Windows / Python 3.0) is: Without the patch: C:\Python30>python.exe c:\minidom_test.py False 1 -->"multiline value" 2 -->"multiline value" With the patch: C:\Python30>python.exe c:\minidom_test.py True 1 -->"multiline value" 2 -->"multiline value" -- Added file: http://bugs.python.org/file13920/toxml_test.py ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not handle newline characters in attribute values
Changes by Tomalak : Removed file: http://bugs.python.org/file13920/toxml_test.py ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not handle newline characters in attribute values
Changes by Tomalak : Added file: http://bugs.python.org/file13921/minidom_test.py ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not handle newline characters in attribute values
Tomalak added the comment: Francesco, I think you are missing the point. :-) The problem has two sides. If I create an XML document using the DOM (not by parsing it from a string!), then I can put newline characters into attribute value. This is allowed and conforms to the XML spec. However, *literal* newlines in an attribute value (i.e. when the document is parsed from a string) have no meaning. The parser treats them as if they were insignificant whitespace -- they are converted to a single space. This is also valid and conforms to the XML spec. The catch: This leads to an actual data loss if I *wanted* to store newline characters in an attribute -- unless the newline characters are properly encoded. Encoding the newline characters is also valid and conforms to the spec, so the DOM implementation should do it. In other words - the parsing process you refer to is actually working fine. If an attribute contains a literal newline, it is indeed okay to collapse it into a space. It's only the document serializing that is broken. Minidom is clearly missing functionality here, and it does not conform to the XML spec. If I store a string of data in an XML document, it must be ensured that upon reading the document again, I get the *same* data back. This is what I check with my test script. -- ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not escape newline characters within attribute values
Changes by Tomalak : -- title: xml.dom.minidom does not handle newline characters in attribute values -> xml.dom.minidom does not escape newline characters within attribute values ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not escape newline characters within attribute values
Tomalak added the comment: Francesco, > if you want to encode the newline character, > this should be done by both parseString and > setAttribute methods. Otherwise, the > behaviour is not symmetric. I believe you still don't see the issue. The behaviour is not symmetric *now*. You store a '\n' in an attribute value with setAttribute(), save the document to XML, load it again and out comes a space where the '\n' should have been. The point is that parseString() behaves correctly, but serializing does not. There is only one side to fix, because only one side is broken. > If you want to encode the newline in different > manner, you should develop a patch that > introduces this kind of encoding in both > parseString and setAttribute methods. It would be pointless to do the encoding in setAttribute(). The valid ways to XML-encode a '\n' character are ' ', ' ' or ' '. Doing so in setAttribute() would produce doubly encoded output, like this: '
'. This is even more wrong. However, if parseString() encounters a ' ' in the input, it correctly translates this to '\n' in the DOM. As I said, there is nothing to fix in parsing, this exercise is about getting minidom to actually *output* a ' ' where appropriate. :-) -- ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not escape newline characters within attribute values
Tomalak added the comment: Daniel Diniz: The proposed behaviour is correct: http://www.w3.org/TR/2000/WD-xml-c14n-2119.html#charescaping "In attribute values, the character information items TAB (#x9), newline (#xA), and carriage-return (#xD) are represented by " ", " ", and " " respectively." Since the behaviour is correct, it is also desirable. :-) I don't think that this change could cause existing solution to break since the current inconsistency in handling these characters make it impossible to rely on this anyway. Thanks for putting up the unit test diff. -- ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not escape newline characters within attribute values
Changes by Tomalak : Removed file: http://bugs.python.org/file13919/minidom.patch ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not escape newline characters within attribute values
Changes by Tomalak : Added file: http://bugs.python.org/file13977/minidom.patch ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not escape newline characters within attribute values
Tomalak added the comment: I changed the patch to include support for TAB characters, which were also left unencoded before. Also I switched encoding from ' ' etc. to ' '. This is equivalent, but the spec uses the latter variant. -- ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5752] xml.dom.minidom does not escape CR, LF and TAB characters within attribute values
Changes by Tomalak : -- title: xml.dom.minidom does not escape newline characters within attribute values -> xml.dom.minidom does not escape CR, LF and TAB characters within attribute values ___ Python tracker <http://bugs.python.org/issue5752> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com