On 11/8/07, Walter Dörwald <[EMAIL PROTECTED]> wrote: > Martin v. Löwis wrote: > > >> Then how about the suggested "xml-auto-detect"? > > > > That is better. > > OK. > > >>> Then, I'd claim that the problem that the codec solves doesn't really > >>> exist. IOW, most XML parsers implement the auto-detection of encodings, > >>> anyway, and this is where architecturally this functionality belongs. > >> But not all XML parsers support all encodings. The XML codec makes it > >> trivial to add this support to an existing parser. > > > > I would like to question this claim. Can you give an example of a parser > > that doesn't support a specific encoding > > It seems that e.g. expat doesn't support UTF-32: > > from xml.parsers import expat > > p = expat.ParserCreate() > e = "utf-32" > s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) > p.Parse(s, True) > > This fails with: > > Traceback (most recent call last): > File "gurk.py", line 6, in <module> > p.Parse(s, True) > xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, > column 1 > > Replace "utf-32" with "utf-16" and the problem goes away. > > > and where adding such a codec > > solves that problem? > > > > In particular, why would that parser know how to process Python Unicode > > strings? > > It doesn't have to. You can use an XML encoder to reencode the unicode > string into bytes (forcing an encoding that the parser knows): > > import codecs > from xml.parsers import expat > > ci = codecs.lookup("xml-auto-detect") > p = expat.ParserCreate() > e = "utf-32" > s = (u"<?xml version='1.0' encoding=%r?><foo/>" % e).encode(e) > s = ci.encode(ci.decode(s)[0], encoding="utf-8")[0] > p.Parse(s, True) > > >> Furthermore encoding-detection might be part of the responsibility of > >> the XML parser, but this decoding phase is totally distinct from the > >> parsing phase, so why not put the decoding into a common library? > > > > I would not object to that - just to expose it as a codec. Adding it > > to the XML library is fine, IMO. > > But it does make sense as a codec. The decoding phase of an XML parser > has to turn a byte stream into a unicode stream. That's the job of a codec.
Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc codecs to do the encoding. There's no need to create a magical mystery codec to pick out which though. It's not even sufficient for XML: 1) round-tripping a file should be done in the original encoding. Containing the auto-detected encoding within a codec doesn't let you see what it picked. 2) the encoding may be specified externally from the file/stream[1]. The xml parser needs to handle these out-of-band encodings anyway. [2] http://mail.python.org/pipermail/xml-sig/2004-October/010649.html -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com