[EMAIL PROTECTED] wrote:
>
> I was playing with python encodings and noticed this:
>
> [EMAIL PROTECTED]:~$ python2.4
> Python 2.4 (#2, Dec 3 2004, 17:59:05)
> [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> unicode('\x9d', 'iso8859_1')
> u'\x9d'
>>>>
>
> U+009D is NOT a valid unicode character (it is not even a iso8859_1
> valid character)
It *IS* a valid unicode and iso8859-1 character, so the behaviour of the
python decoder is correct. The range U+0080 - U+009F is used for various
control characters. There's rarely a valid use for these characters in
documents, so you can be pretty sure that a document using these characters
is windows-1252 - it is valid iso-8859-1, but for a heuristic guess it's
probably saver to assume windows-1252.
If you want an exception to be thrown, you'll need to implement your own
codec, something like 'iso8859_1_nocc' - mmm.. I could try this myself,
because I do such a test in one of my projects, too ;)
> The same happens if I use 'latin-1' instead of 'iso8859_1'.
>
> This caught me by surprise, since I was doing some heuristics guessing
> string encodings, and 'iso8859_1' gave no errors even if the input
> encoding was different.
>
> Is this a known behaviour, or I discovered a terrible unknown bug in
> python encoding implementation that should be immediately reported and
> fixed? :-)
>
>
> happy new year,
>
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
--
http://mail.python.org/mailman/listinfo/python-list