spir wrote:
[back to the list after a rather long break]

Hello,

I stepped on a unicode issue ;-) (one more)
Below an illustration:

============================class U(unicode):
        def __str__(self):
                return self

# if you can't properly see the string below,
# 128<ordinals<255
c0 =¶ÿµ"
c1 =("¶ÿµ","utf8")
c2 =nicode("¶ÿµ","utf8")

for c in (c0,c1,c2):
        try:
                print "%s" %c,
        except UnicodeEncodeError:
                print "***",
        try:
                print c.__str__(),
        except UnicodeEncodeError:
                print "***",
        try:
                print str(c)
        except UnicodeEncodeError:
                print "***"

=

¶ÿµ ¶ÿµ ¶ÿµ
¶ÿµ ¶ÿµ ***
¶ÿµ *** ***
==============================

The last line shows that a regular unicode cannot be passed to str() (more or 
less ok) nor __str__() (not ok at all).
Maybe I overlook some obvious point (again). If not, then this means 2 issues 
in fact:

-1- The old ambiguity of str() meaning both "create an instance of type str from the given 
data" and "build a textual representation of the given object, through __str__", 
which has always been a semantic flaw for me, becomes concretely problematic when we have text that 
is not str.
Well, i'm very surprised of this. Actually, how comes this point doesn't seem 
to be very well known; how is it simply possible to use unicode without 
stepping on this problem? I guess this breaks years or even decades of habits 
for coders used to write str() when they mean __str__().

-2- How is it possible that __str__ does not work on a unicode object? It seems 
that the method is simply not implemented on unicode, the type, and __repr__ 
neither. So that it falls back to str().
Strangely enough, % interpolation works, which means that for both types of 
text a short circuit is used, namely return the text itself as is. I would have 
bet my last cents that % would simply delegate to __str__, or maybe that they 
were the same func in fact, synonyms, but obviously I was wrong!

Looking for workarounds, I first tried to overload (or rather create) __str__ 
like in the U type above. But this solution is far to be ideal cause we still 
cannot use str() (I mean my digits can write it while my head is 
who-knows-where). Also, it is really unusable in fact for the following reason:
=================================
print c1.__class__
print c1[1].__class__
c3 =1 ; print (c1+c3).__class__
=
<class '__main__.U'>
<type 'unicode'>
<type 'unicode'>
==================================
Any operation will return back a unicode instead of the original type. So that 
the said type would have to overload all possible operations on text, which is 
much, indeed, to convert back the results. I don't even speak of performance 
issues.

So, the only solution seems to me to use % everywhere, hunt all str and __str__ 
and __repr__ and such in all code.

I hope I'm wrong on this. Please, give me a better solution ;-)



------
la vita e estrany



I'm not the one to help with this, because my unicode experience is rather limited. But I think I know enough to ask a few useful questions.

1) What version of Python are you doing this on, what OS, and what code page is your stdout using?

2) What coding declaration do you have in your source file? Without it, I can't even define those literals. I added the line
#-*- coding: utf-8 -*-
as line 2 of my source file to get past that one. But I really don't know much about this literal string that I pasted from your email.

3) Could you give us the hex equivalent of the 3 character string you're trying to give us in the email. The only clue you gave us was that the bytes were between 129 and 254, which they aren't, on my machine, at least with a utf-8 coding declaration.
repr(u"¶ÿµ") -->  u'\xb6\xff\xb5'   length= 3
repr(c0) -->  '\xc2\xb6\xc3\xbf\xc2\xb5'  length = 6

You say that __str__() isn't defined on Unicode objects, but that's not the case, at least in 2.6.2. Works fine on ASCII characters, but something causes an exception for your strings. Since you're eating the exception, all you know is something went wrong, not what went wrong. And since my environment is probably totally different, ... I get the exception text: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Incidentally, you'll probably save yourself a lot of grief in the long run if you change your editor to always expand tabs to spaces (4-per). It's dangerous, and not recommended to mix tabs and spaces in the same file, and it's surprising how often spaces get mixed in by accident. In Python3.x it's illegal to mix them.

DaveA


_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to