In article <[email protected]>, Ben Finney <[email protected]> wrote: > Ned Deily <[email protected]> writes: > > $ python2.6 -c 'import sys; print sys.stdout.encoding, \ > > sys.stdout.isatty()' > > UTF-8 True > > $ python2.6 -c 'import sys; print sys.stdout.encoding, \ > > sys.stdout.isatty()' > foo ; cat foo > > None False > > So shouldn't the second case also detect UTF-8? The filesystem knows > it's UTF-8, the shell knows it too. Why doesn't Python know it?
The filesystem knows what is UTF-8? While the setting of the locale environment variables may influence how the file system interprets the *name* of a file, it has no direct influence on what the *contents* of a file is or is supposed to be. Remember in python 2.x, a file is a just sequence of bytes. If you want to write encode Unicode to the file, you need to use something like codecs.open to wrap the file object with the proper streamwriter encoder. What confuses matters in 2.x is the print statement's under-the-covers implicit Unicode encoding for files connected to a terminal: http://bugs.python.org/issue612627 http://bugs.python.org/issue4947 http://wiki.python.org/moin/PrintFails >>> x = u'\u0430\u0431\u0432' >>> print x [nice looking characters here] >>> sys.stdout.write(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) >>> sys.stdout.encoding 'UTF-8' In python 3.x, of course, the encoding happens automatically but you still have to tell python, via the "encoding" argument to open, what the encoding of the file's content is (or accept python's default which may not be very useful): >>> open('foo1','w').encoding 'mac-roman' WTF, indeed. -- Ned Deily, [email protected] -- http://mail.python.org/mailman/listinfo/python-list
