Paul Johnston wrote: > Hi > I have a string which I convert into a list then read through it > printing its glyph and numeric representation > > #-*- coding: utf-8 -*- > > thestring = "abcd" > thelist = list(thestring) > > for c in thelist: > print c, > print ord(c) > > Works fine for latin characters but when I put in a unicode character > a two byte character gives me two characters. For example an arabic > alef returns > > * 216 > * 167 > > ( the first asterix is the empty set symbol the second a double "s") > > Putting in sequential characters i.e. alef, beh, teh mabuta, gives me > sequential listings i.e. > 216 167 > 216 168 > 216 169 > So it is reading the correct details. > > > Is there anyway to get the c in the for loop to recognise it is > reading a multiple byte character. > I have followed the info in PEP 0263 and am using Python 2.4.3 Build > 12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2
Use unicode objects instead of byte strings. The above string literal is _not_ affected by the coding:-header whatsoever. That applies only to u"some text" literals, and makes them a unicode object. The normal string literals are just bytes - because of your encoding being properly set in the editor, an entered multibyte-character is stored as such. In a nutshell: try the above using u"abcd". Diez -- http://mail.python.org/mailman/listinfo/python-list
