On Wed, 07 Nov 2012 23:47:13 +0100, Victor Stinner <victor.stin...@gmail.com> wrote: > 2012/11/7 Alexandre Vassalotti <alexan...@peadrop.com>: > > The Unicode code points in the U+DC00-DFFF range (low surrogate area) can't > > be encoded in UTF-8. Quoting from RFC 3629: > > > > The definition of UTF-8 prohibits encoding character numbers between U+D800 > > and U+DFFF, which are reserved for use with the UTF-16 encoding form (as > > surrogate pairs) and do not directly represent characters. > > > > > > It looks like this test was doing something specific with regards to this. > > So, I am curious as well about this change. > > os.fsencode() uses the surrogateescape error handler (PEP 393) on UNIX. > > >>> os.fsencode('\udcf1\udcea\udcf0\udce8\udcef\udcf2') > b'\xf1\xea\xf0\xe8\xef\xf2' > > I replaced this arbitrary string (and other similar constant strings) > with support.FS_NONASCII which is more portable (should be available > on all locale encodings... except ASCII) and documented. > > I rewrote test_cmd_line_script.test_non_ascii() (and other tests) in > Python 3.4 to use support.FS_NONASCII. > > This change should improve code coverage on heterogeneous environments.
Alexandre's point was that the string did not appear to be arbitrary, but rather appeared to specifically be a string containing surrogates. Is this not the case? --David _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com