Re: [Tutor] unichr not working as expected

Steven D'Aprano Mon, 22 Jul 2013 21:03:36 -0700

On 23/07/13 04:14, Jim Mooney wrote:

  I tried translating the odd chars I found in my dos tree /f listing to
symbols, but I'm getting this error. The chars certainly aren't over
10000,  The ord is only 13 - so what's wrong here?


def main():
     zark = ''
     for x in "ÀÄÄÄ":
         zark += unichr(ord(x)-45)



This is broken in three ways that I can see.

Firstly, assuming you are using Python 2.7 (as you have said in the past), 
"ÀÄÄÄ" does not mean what you think it means.

In Python 3, this is a Unicode string containing four individual characters:

LATIN CAPITAL LETTER A WITH GRAVE
LATIN CAPITAL LETTER A WITH DIAERESIS
LATIN CAPITAL LETTER A WITH DIAERESIS
LATIN CAPITAL LETTER A WITH DIAERESIS

Why you have duplicates, I do not know :-)

But in Python 2, that's not what you will get. What you get depends on your 
environment, and is unpredictable. For example, on my system, using a Linux 
terminal interactively with the terminal set to UTF-8, I get:

py> for c in "ÀÄ":  # removing duplicates
...     print c, ord(c)
...
� 195
� 128
� 195
� 132


Yes, that's right, I get FOUR (not two) "characters" (actually bytes). But if I 
change the terminal settings to, say, ISO-8859-7:

py> for c in "ΓΓ":
...     print c, ord(c)
...
Γ 195
 128
Γ 195
 132

the bytes stay the same (195, 128, 195, 132) but the *meaning* of those bytes 
change completely.

So, the point is, if you are running Python 2.7, what you get from a byte string like 
"ÀÄ" is unpredictable. What you need is a Unicode string u"ÀÄ", which will 
exactly what it looks like.

That's the first issue.

Second issue, you build up a string using this idiom:

zark = ''
for c in something:
    zark += c


Even though this works, this is a bad habit to get into and you should avoid 
it: it risks being unpredictably slower than continental drift, and in a way 
that is *really* hard to diagnose. I've seen a case of this fool the finest 
Python core developers for *weeks*, regarding a reported bug where Python was 
painfully slow but only for SOME but not all Windows users.

The reason why accumulating strings using + can be slow when there are a lot of 
strings is because it is a Shlemiel the painter's algorithm:

http://www.joelonsoftware.com/articles/fog0000000319.html‎


The reason why sometimes it is *not* slow is that CPython 2.3 and beyond 
includes a clever optimization trick which can *sometimes* fix this issue, but 
it depends on details for the operating system's memory handling, and of course 
it doesn't apply to other implementations like Jython, IronPython, PyPy and 
Nuitka.

So do yourself a favour and get out of the habit of accumulating strings in a 
for loop using + since it will bite you one day. (Adding one or two strings is 
fine.)


Problem number three: you generate characters using this:

unichr(ord(x)-45)

but that gives you a negative number if ord(x) is less than 45, which gives you 
exactly the result you see:

py> unichr(-1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)


(By the way, you're very naughty. The code you show *cannot possibly generate 
the error you claim it generates*. Bad Jim, no biscuit!)


I don't understand what the ord(x)-45 is intended to do. The effect is to give 
the 45th previous character, e.g. the 45th character before 'n' is 'A'. But 
characters below chr(45) don't have anything 45 characters previous, so you 
need to rethink what you are trying to do.


--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] unichr not working as expected

Reply via email to