Re: Bytewise u??_conv_from_encoding

Bruno Haible Wed, 05 Jan 2022 12:59:34 -0800

Hello Marc,

> > (A) Using a pipe at the shell level:
> >       iconv -t UTF-8 | my-program
> >
> > (B) Using a programming language that has a coroutines concept.
> >     This way, both the decoder and the consumer can be programmed in
> >     a straightforward manner.
> >
> > (C) In C, with multiple threads.
> >
> > (D) In C, with a decoder programmed in a straightforward manner
> >     and a consumer that is written as a callback with state.
> >
> > (E) In C, with a decoder written as a callback with state
> >     and a consumer programmed in a straightforward manner.
> >
> > > Thus, I am wondering whether it makes sense to offer a stateful
> > > decoder that takes byte by byte and signals as soon as a decoded byte
> > > sequence is ready.
> >
> > It seems that you are thinking of approach (D).
> 
> > I think (D) is the worst, because writing application code in a callback
> > style with state is hard and error-prone. I would favour (E) instead,
> > if (A) is not possible.
> 
> If I understand your classification correctly, I meant something more
> like (E) than (D), I think. As an interface, I would propose would be
> something along the following lines:
> 
> decoder_t d = decoder_create (iconveh_t *cd);
> switch (decoder_push (d, byte))
>   {
>   case DECODER_BYTE_READ:
>     char *res = decoder_result (d);
>     size_t len = decoder_length (d);
>     ...


What does the programmer do here with res and len? This is where things
get complex.

>   case DECODER_EOF:
>     ...
>   case DECODER_INCOMPLETE:
>     ...
>   case DECODER_ERROR:
>     ...
>   }
> ...
> decoder_destroy (d);

What you describe here is (D), in my view.

(E) would look like this:

  extern decoder_t create_decoder_context (void);
  extern void push_bytes_into_decoder (const char *p, size_t n, decoder_t);
  extern void free_decoder_context (decoder_t);

> > (B) means to use a different programming language. I can't recommend C++ 
> > [1].
> 
> The main problem I see with C++'s coroutines is that they are
> stackless coroutines; their expressiveness is tiny compared to
> languages with full coroutine support, to say nothing of programming
> languages like Scheme with its first-class continuations.

It doesn't surprise me. 'constexpr', another new addition to C++, similarly
does only a fraction of what would be useful.

> > (C) is possible, but complex. See e.g. gnulib's pipe-filter-ii.c or
> > pipe-filter-gi.c. Generally, threads are overkill when all you need are
> > coroutines.
> 
> I agree. Unfortunately, Posix's response to dropping makecontext and
> friends seems to be to use threads. It would be great if C had a
> lightweight context-swapping mechanism.

Maybe. I think setcontext() has a severe problem; see
<https://www.gnu.org/software/gnulib/manual/html_node/setcontext.html>.

> By the way, libunistring's u??_conv_from_encoding does not seem to be
> adapted to consuming buffers. The problem is that one doesn't know in
> advance where boundaries of multi-byte sequences are so
> u??_conv_from_encoding will likely signal a decoding error.

Yes, u??_conv_from_encoding is made for converting entire strings.
If you want to restart conversion after some bytes that are part of
a multibyte character, you need the low-level iconv().

Bruno

Re: Bytewise u??_conv_from_encoding

Reply via email to