On Wed, Jan 28, 2015 at 10:35 AM, Sunil Tech <sunil.tech...@gmail.com> wrote: > Hi All, > > When i copied a text from web and pasted in the python-terminal, it > automatically coverted into unicode(i suppose) > > can anyone tell me how it does? > Eg: >>>> p = "你好" >>>> p > '\xe4\xbd\xa0\xe5\xa5\xbd' >>>> o = 'ªîV' >>>> o > '\xc2\xaa\xc3\xaeV' >>>>
No, it didn’t. You created a bytestring, that contains some bytes. Python does NOT think of `p` as a unicode string of 2 characters, it’s a bytestring of 6 bytes. You cannot use that byte string to reliably get only the first character, for example — `p[0]` will get you garbage ('\xe4' which will render as a question mark on an UTF-8 terminal). In order to get a real unicode string, you must do one of the following: (a) prepend it with u''. This works only if your locale is set correctly and Python knows you use UTF-8. For example: >>> p = u"你好" >>> p u'\u4f60\u597d' (b) Use decode on the bytestring, which is safer and does not depend on a properly configured system. >>> p = "你好".decode('utf-8') >>> p u'\u4f60\u597d' However, this does not apply in Python 3. Python 3 defaults to Unicode strings, so you can do: >>> p = "你好" and have proper Unicode handling, assuming your system locale is set correctly. If it isn’t, >>> p = b"你好".decode('utf-8') would do it. -- Chris Warrick <https://chriswarrick.com/> PGP: 5EAAEA16 _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor