Re: Bytewise u??_conv_from_encoding

Marc Nieper-Wißkirchen Sat, 01 Jan 2022 05:22:40 -0800

Hi Bruno,

thanks for your insights, valuable as always.


Am Sa., 1. Jan. 2022 um 13:57 Uhr schrieb Bruno Haible <br...@clisp.org>:
>
> Hi Marc,
>
> > The demand to read a file (in local encoding) and to decode it
> > incrementally seems a typical one.
>
> There are four ways to satisfy this demand.
>
> (A) Using a pipe at the shell level:
>       iconv -t UTF-8 | my-program
>
> (B) Using a programming language that has a coroutines concept.
>     This way, both the decoder and the consumer can be programmed in
>     a straightforward manner.
>
> (C) In C, with multiple threads.
>
> (D) In C, with a decoder programmed in a straightforward manner
>     and a consumer that is written as a callback with state.
>
> (E) In C, with a decoder written as a callback with state
>     and a consumer programmed in a straightforward manner.
>
> > Thus, I am wondering whether it makes sense to offer a stateful
> > decoder that takes byte by byte and signals as soon as a decoded byte
> > sequence is ready.
>
> It seems that you are thinking of approach (D).

> I think (D) is the worst, because writing application code in a callback
> style with state is hard and error-prone. I would favour (E) instead,
> if (A) is not possible.

If I understand your classification correctly, I meant something more
like (E) than (D), I think. As an interface, I would propose would be
something along the following lines:

decoder_t d = decoder_create (iconveh_t *cd);
switch (decoder_push (d, byte))
  {
  case DECODER_BYTE_READ:
    char *res = decoder_result (d);
    size_t len = decoder_length (d);
    ...
  case DECODER_EOF:
    ...
  case DECODER_INCOMPLETE:
    ...
  case DECODER_ERROR:
    ...
  }
...
decoder_destroy (d);

> (B) means to use a different programming language. I can't recommend C++ [1].

The main problem I see with C++'s coroutines is that they are
stackless coroutines; their expressiveness is tiny compared to
languages with full coroutine support, to say nothing of programming
languages like Scheme with its first-class continuations.

> (C) is possible, but complex. See e.g. gnulib's pipe-filter-ii.c or
> pipe-filter-gi.c. Generally, threads are overkill when all you need are
> coroutines.

I agree. Unfortunately, Posix's response to dropping makecontext and
friends seems to be to use threads. It would be great if C had a
lightweight context-swapping mechanism.

> Now, when implementing (E), it will be useful to have some kind of "abstract
> input stream" data type. Such a thing does not exist in C, for historical
> reasons. But it can be done similarly to the "abstract output stream" data
> type that is at the heart of GNU libtextstyle [2][3][4].

I will have to take a closer look at that library.

> > On top of that, a decoding Unicode mbfile interface can be built, say 
> > ucfile.
>
> One of the problems of byte-by-byte decoding is that it's inefficient. It's
> way more efficient to do the same task (decoding, consuming) on an entire
> buffer of, say, at least 1 KiB. Buffering minimizes the context switches and
> time spent in function entry/exit. That needs to be considered in the design.

The mbfile interface tries hard not to read more than necessary in
advance to support interactive streams. That possibility should be
preserved, I think.

In my API proposal above, decoder_push can be redesigned to look as follows:

int decoder_push (decoder_t decoder, char *src, size_t srclen)

By the way, libunistring's u??_conv_from_encoding does not seem to be
adapted to consuming buffers. The problem is that one doesn't know in
advance where boundaries of multi-byte sequences are so
u??_conv_from_encoding will likely signal a decoding error.

What would be more helpful would be a version of
u??_conv_from_encoding that returns the decoded part of the string
before the invalid sequence and that gives the position of the invalid
sequence. For piping purposes, it would still not be very comfortable
because one then would have to copy by hand the undecoded part of the
string to the beginning of the buffer and refill the rest of the
buffer from the source.

Marc

Re: Bytewise u??_conv_from_encoding

Reply via email to