I have a question about the Python curses bindings and UTF-8 support. I tried to do my homework but couldn't find a good answer so I'm asking you. I know that ncurses supports UTF-8 but it seems that the Python bindings don't, or maybe I have to manually encode UTF-8 strings in some way, but there's no documentation.
I have attached a Python test program which uses the curses bindings to output some characters. While some are displayed in curses, some aren't (the bullet). Can someone tell me: - is this a bug in my program? - is this a bug in the Python curses bindings? I'm not talking about wide-character support here. I've seen there's an open request for this, see http://mail.python.org/pipermail/python-bugs-list/2003-March/016679.html I'm talking about normal UTF-8 support. I'd be really glad if someone could give me an answer to this becauset there are at least two Python and curses based programs I use (cplay, jack) that get this wrong and I'd like to get them fixed. I had a previous conversation with Martin v. Löwis, but I cannot quite made sense of it: * "Martin v. Löwis" <[EMAIL PROTECTED]> [2005-01-19 23:48]: > I'm uncertain as to how ncurses deals with multi-byte characters - one > might think that UTF-8 is just a multi-byte encoding (where a single > character can require multiple bytes), and that they can be sent to > the terminal as-is. This, of course, would require that control > sequences are sent to the terminal only at character boundaries - > something that the application would have to guarantee. > > The other issue with multi-byte characters is columnns - you cannot > equate "single byte == single column". OTOH, in "true" non-ASCII > applications, you cannot thus equate, anyway, since some characters > (e.g. Hanji full-width characters) take two columns, anyway. Not > sure how curses deals with that phenomenon. > > >I guess once there are UTF-8 aware Python curses bindings, I have to > >change cplay to use UTF-8 internally (however that may work), but > >right now I'm wondering about the bindings itself. > > The libncursesw5 is certainly not about UTF-8, and Python does not > support it at the moment. > > It probably should, which would mean that, on the C API, you don't > pass char/char* anymore, but wchar_t/wchar_t*. On the Python API, > you would pass Unicode objects, instead of string objects. > > IOW, there seem to be two options: > 1. Use char*, and UTF-8, and Python byte strings. Make sure you > always keep the multiple bytes of a byte string together; > this is easiest to achieve by converting them to Unicode > temporarily. So instead of > > for c in data: > if condition: output escape sequence > output c > > do > > udata = data.decode("UTF-8") > for c in udata: > if condition: output escape sequence > output c.encode("UTF-8") This doesn't seem to work. > 2. Implement a true Unicode API for curses, using libncursesw. > This would check the actual parameters to see whether they > are byte strings or Unicode strings, and invoke the appropriate > curses library (assuming you can mix curses and cursesw in single > terminal - or choke if somebody tries to mix byte strings and > Unicode strings in a single terminal). > > Then, above loop becomes > > udata = data.decode("UTF-8") > for c in udata: > if condition: output escape sequence > output c > > So the difference would be that you can directly send Unicode > characters, instead of encoding them as UTF-8 first. Who would be in a position to implement this? I thought that Python has been UTF-8/Unicode ready for years but it seems this is not the case for the curses bindings. > In either case, you might need to deal with the issue of full-width > characters (i.e. characters that consume horizontally twice as > much space as the latin letters). Not all terminals support full-width > in the first place; xterm is an example for a terminal that does. -- Martin Michlmayr http://www.cyrius.com/
#!/usr/bin/python # -*- coding: utf-8 *- import curses import time a = 'ä' b = '•' c = '人' u_a = unicode(a, "utf-8") u_b = unicode(b, "utf-8") u_c = unicode(c, "utf-8") print a print u_a print b print u_b print c print u_c time.sleep(3) w = curses.initscr() w.addstr('umlaut a: ' + a) w.addstr("\n") w.addstr('bullet: ' + b) w.addstr("\n") w.addstr("Chinese ren: " + c) w.addstr("\n") w.addstr("\n") w.addstr('umlaut a: ' + u_a.encode("utf-8")) w.addstr("\n") w.addstr('bullet: ' + u_b.encode("utf-8")) w.addstr("\n") w.addstr("Chinese ren: " + u_c.encode("utf-8")) w.addstr("\n") w.addstr("\n") w.refresh() time.sleep(3) curses.endwin()