Marcus Brinkmann <[EMAIL PROTECTED]> writes: [I'm quoting this list for reference ]
> > A1. Chop the unicode stream up into graphemes. > > A2. Convert each grapheme into the local encoding, resulting in one or > > more bytes each. (I think you can do this with iconv). > > A3. Pass each grapheme to the term input handling (libtermserver), > > using the local encoding. > > > > B1. Convert stream into local encoding. > > B2. Chop up the stream into graphemes, using rules that depend on > > the local encoding. (I don't think iconv can do this easily). > > B3. Pass the graphemes on to the term input handling. > A2: I wonder if it is really true that one Unicode grapheme always encodes > in at most one grapheme in the local encoding, but I guess that is a > somewhat reasonable assumption. Nevertheless, this feels a bit unpleasant. If you define a "grapheme" as something that occupies exactly one horizontal space, then it has to be the same thing regardless of encoding. But of course there may be some exceptions. > B2: This looks ugly, but it isn't so bad, ... > No, neither would be acceptable, nor does libreadline do it this way. After > all, any program that wants to handle multibyte encodings cleanly has to do > it, term is not really special in that regard. As I see it, the basic problem is as follows: I (the term) receive a string containing a newline, some characters, and a TAB. All in my local multibyte charset. I want to echo them back, so I send back a newline sequence, and I send the characters, and then I want to expand the TAB. To do that I need to compute my current position on the line, from the given string. I may have missed some useful functions, but I don't know any easy way to do that. I'm not terribly familiar with the mb* functions, but from the documentation of the conversion functions (I looked at mbrtowc) it looks like it assumes that one "multibyte character" corresponds to a single "wide character". If "wide" means unicode, then this is not necessarily a complete grapheme (corresponding to one unicode base char and some combining chars). The documentation of mbrlen says - Function: size_t mbrlen (const char *restrict S, size_t N, mbstate_t *PS) The `mbrlen' function ("multibyte restartable length") computes the number of at most N bytes starting at S which form the next valid and complete multibyte character. If I feed it A + combining diaresis above (as utf8, the only multibyte encoding I have any familiarity with, this particular example could be fixed by saying that strings should always use normalization form C, but I don't think that's possible for other multibyte encodings), will mprlen treat that as one or two "complete multibyte characters"? It will only occupy one horizontal space when displayed. What does libreadline do? > > stream of graphemes (in local encoding) to term. One could also move > > some of the work even further away from term, into the input client. > > Ideally, the input drivers don't need to be configured and adapted to your > local encoding environment. I would like to keep it that way if at all > possible. I agree completely with that. What one could do is to say that the input driver should feed the console a sequence of unicode graphemes, not just unicode characters. I.e. the "chopping up" part, A1, can take place already in the input driver. Or one could move part of the terminal input processing to the input driver, for instance that may be the easiest way to make sure that backspace and dead keys interact in the right way. /Niels _______________________________________________ Bug-hurd mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/bug-hurd