unicode data - accessing codepoints > FFFF on narrow python builts

vbr Wed, 18 Apr 2007 02:47:10 -0700

Hi all,
I'd like to ask about the usage of unicode data on a narrow python build.
Unicode string literals \N{name} work even without the (explicit) import of 
unicodedata and it correctly handles also the  "wider" unicodes planes - over 
FFFF


>>>  u"\N{LATIN SMALL LETTER E}"
u'e'
>>>  u"\N{GOTHIC LETTER AHSA}"
u'\U00010330'

The unicode data functions works analogous in the basic plane, but behave 
differently otherwise:

>>>  unicodedata.lookup("LATIN SMALL LETTER E")
u'e'
>>> unicodedata.lookup("GOTHIC LETTER AHSA")
u'\u0330'

(0001 gets trimmed)

Is it a bug in unicodedata, or is this the expected behaviour on a narrow build?

Another problem I have is to access the "characters" and their properties by 
the respective codepoints:
under FFFF it is possible, to use unichr(), which isn't valid for higher 
valules on a narrow build
It is possible to derive the codepoint from the surrogate pair, which would be 
usable also for wider codepoints.

Currently, I'm using a kind of parallel database for some unicode ranges above 
FFFF, but I don't think, this is the most effective way.
I actually found something similar at http: / / 
inamidst.com/phenny/modules/codepoint.py  using directly the UnicodeData.txt;

but I was wondering, If there is a simpler way for doing that; it seems 
obvious, that the data are present, if it could be used for constucting unicode 
literals.

Any hints are welcome,   thanks.

vbr
-- 
http://mail.python.org/mailman/listinfo/python-list

unicode data - accessing codepoints > FFFF on narrow python builts

Reply via email to