Re: term, utf-8 and cooked mode, combining characters

Niels Möller Wed, 18 Sep 2002 08:45:19 -0700

Marcus Brinkmann <[EMAIL PROTECTED]> writes:

[I'm quoting this list for reference ]


> > A1. Chop the unicode stream up into graphemes.
> > A2. Convert each grapheme into the local encoding, resulting in one or
> >     more bytes each. (I think you can do this with iconv).
> > A3. Pass each grapheme to the term input handling (libtermserver),
> >     using the local encoding.
> > 
> > B1. Convert stream into local encoding.
> > B2. Chop up the stream into graphemes, using rules that depend on
> >     the local encoding. (I don't think iconv can do this easily).
> > B3. Pass the graphemes on to the term input handling.

> A2: I wonder if it is really true that one Unicode grapheme always encodes
> in at most one grapheme in the local encoding, but I guess that is a
> somewhat reasonable assumption.  Nevertheless, this feels a bit unpleasant.

If you define a "grapheme" as something that occupies exactly one
horizontal space, then it has to be the same thing regardless of
encoding. But of course there may be some exceptions.

> B2: This looks ugly, but it isn't so bad,
...
> No, neither would be acceptable, nor does libreadline do it this way.  After
> all, any program that wants to handle multibyte encodings cleanly has to do
> it, term is not really special in that regard.

As I see it, the basic problem is as follows: I (the term) receive a
string containing a newline, some characters, and a TAB. All in my
local multibyte charset. I want to echo them back, so I send back a
newline sequence, and I send the characters, and then I want to expand
the TAB. To do that I need to compute my current position on the line, from
the given string. I may have missed some useful functions, but I don't
know any easy way to do that.

I'm not terribly familiar with the mb* functions, but from the
documentation of the conversion functions (I looked at mbrtowc) it
looks like it assumes that one "multibyte character" corresponds to a
single "wide character". If "wide" means unicode, then this is not
necessarily a complete grapheme (corresponding to one unicode base
char and some combining chars). The documentation of mbrlen says

 - Function: size_t mbrlen (const char *restrict S, size_t N, mbstate_t
          *PS)
     The `mbrlen' function ("multibyte restartable length") computes
     the number of at most N bytes starting at S which form the next
     valid and complete multibyte character.

If I feed it A + combining diaresis above (as utf8, the only multibyte
encoding I have any familiarity with, this particular example could be
fixed by saying that strings should always use normalization form C,
but I don't think that's possible for other multibyte encodings), will
mprlen treat that as one or two "complete multibyte characters"? It will
only occupy one horizontal space when displayed.

What does libreadline do?

> > stream of graphemes (in local encoding) to term. One could also move
> > some of the work even further away from term, into the input client.
> 
> Ideally, the input drivers don't need to be configured and adapted to your
> local encoding environment.  I would like to keep it that way if at all
> possible.

I agree completely with that. What one could do is to say that the
input driver should feed the console a sequence of unicode graphemes,
not just unicode characters. I.e. the "chopping up" part, A1, can take
place already in the input driver. Or one could move part of the
terminal input processing to the input driver, for instance that may
be the easiest way to make sure that backspace and dead keys interact
in the right way.

/Niels


_______________________________________________
Bug-hurd mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/bug-hurd

Re: term, utf-8 and cooked mode, combining characters

Reply via email to