Bug#279000: Python curses bindings and UTF-8

Martin Michlmayr Thu, 26 May 2005 11:53:35 -0700

I have a question about the Python curses bindings and UTF-8 support.
I tried to do my homework but couldn't find a good answer so I'm
asking you.  I know that ncurses supports UTF-8 but it seems that the
Python bindings don't, or maybe I have to manually encode UTF-8
strings in some way, but there's no documentation.


I have attached a Python test program which uses the curses bindings
to output some characters.  While some are displayed in curses, some
aren't (the bullet).

Can someone tell me:
 - is this a bug in my program?
 - is this a bug in the Python curses bindings?

I'm not talking about wide-character support here.  I've seen there's
an open request for this, see
http://mail.python.org/pipermail/python-bugs-list/2003-March/016679.html
I'm talking about normal UTF-8 support.

I'd be really glad if someone could give me an answer to this
becauset there are at least two Python and curses based programs I use
(cplay, jack) that get this wrong and I'd like to get them fixed.


I had a previous conversation with Martin v. Löwis, but I cannot quite
made sense of it:

* "Martin v. Löwis" <[EMAIL PROTECTED]> [2005-01-19 23:48]:
> I'm uncertain as to how ncurses deals with multi-byte characters - one
> might think that UTF-8 is just a multi-byte encoding (where a single
> character can require multiple bytes), and that they can be sent to
> the terminal as-is. This, of course, would require that control
> sequences are sent to the terminal only at character boundaries -
> something that the application would have to guarantee.
> 
> The other issue with multi-byte characters is columnns - you cannot
> equate "single byte == single column". OTOH, in "true" non-ASCII
> applications, you cannot thus equate, anyway, since some characters
> (e.g. Hanji full-width characters) take two columns, anyway. Not
> sure how curses deals with that phenomenon.
> 
> >I guess once there are UTF-8 aware Python curses bindings, I have to
> >change cplay to use UTF-8 internally (however that may work), but
> >right now I'm wondering about the bindings itself.
> 
> The libncursesw5 is certainly not about UTF-8, and Python does not
> support it at the moment.
> 
> It probably should, which would mean that, on the C API, you don't
> pass char/char* anymore, but wchar_t/wchar_t*. On the Python API,
> you would pass Unicode objects, instead of string objects.
> 
> IOW, there seem to be two options:
> 1. Use char*, and UTF-8, and Python byte strings. Make sure you
>    always keep the multiple bytes of a byte string together;
>    this is easiest to achieve by converting them to Unicode
>    temporarily. So instead of
> 
>    for c in data:
>        if condition: output escape sequence
>        output c
> 
>    do
> 
>    udata = data.decode("UTF-8")
>    for c in udata:
>        if condition: output escape sequence
>        output c.encode("UTF-8")

This doesn't seem to work.

> 2. Implement a true Unicode API for curses, using libncursesw.
>    This would check the actual parameters to see whether they
>    are byte strings or Unicode strings, and invoke the appropriate
>    curses library (assuming you can mix curses and cursesw in single
>    terminal - or choke if somebody tries to mix byte strings and
>    Unicode strings in a single terminal).
> 
>    Then, above loop becomes
> 
>    udata = data.decode("UTF-8")
>    for c in udata:
>        if condition: output escape sequence
>        output c
> 
>    So the difference would be that you can directly send Unicode
>    characters, instead of encoding them as UTF-8 first.

Who would be in a position to implement this?  I thought that Python
has been UTF-8/Unicode ready for years but it seems this is not the
case for the curses bindings.

> In either case, you might need to deal with the issue of full-width
> characters (i.e. characters that consume horizontally twice as
> much space as the latin letters). Not all terminals support full-width
> in the first place; xterm is an example for a terminal that does.

-- 
Martin Michlmayr
http://www.cyrius.com/

#!/usr/bin/python
# -*- coding: utf-8 *-

import curses
import time

a = 'ä'
b = '•'
c = '人'
u_a = unicode(a, "utf-8")
u_b = unicode(b, "utf-8")
u_c = unicode(c, "utf-8")

print a
print u_a
print b
print u_b
print c
print u_c
time.sleep(3)

w = curses.initscr()
w.addstr('umlaut a: ' + a)
w.addstr("\n")
w.addstr('bullet: ' + b)
w.addstr("\n")
w.addstr("Chinese ren: " + c)
w.addstr("\n")
w.addstr("\n")

w.addstr('umlaut a: ' + u_a.encode("utf-8"))
w.addstr("\n")
w.addstr('bullet: ' + u_b.encode("utf-8"))
w.addstr("\n")
w.addstr("Chinese ren: " + u_c.encode("utf-8"))
w.addstr("\n")
w.addstr("\n")
w.refresh()
time.sleep(3)
curses.endwin()

Bug#279000: Python curses bindings and UTF-8

Reply via email to