Hi Bruno, thanks for your insights, valuable as always.
Am Sa., 1. Jan. 2022 um 13:57 Uhr schrieb Bruno Haible <br...@clisp.org>: > > Hi Marc, > > > The demand to read a file (in local encoding) and to decode it > > incrementally seems a typical one. > > There are four ways to satisfy this demand. > > (A) Using a pipe at the shell level: > iconv -t UTF-8 | my-program > > (B) Using a programming language that has a coroutines concept. > This way, both the decoder and the consumer can be programmed in > a straightforward manner. > > (C) In C, with multiple threads. > > (D) In C, with a decoder programmed in a straightforward manner > and a consumer that is written as a callback with state. > > (E) In C, with a decoder written as a callback with state > and a consumer programmed in a straightforward manner. > > > Thus, I am wondering whether it makes sense to offer a stateful > > decoder that takes byte by byte and signals as soon as a decoded byte > > sequence is ready. > > It seems that you are thinking of approach (D). > I think (D) is the worst, because writing application code in a callback > style with state is hard and error-prone. I would favour (E) instead, > if (A) is not possible. If I understand your classification correctly, I meant something more like (E) than (D), I think. As an interface, I would propose would be something along the following lines: decoder_t d = decoder_create (iconveh_t *cd); switch (decoder_push (d, byte)) { case DECODER_BYTE_READ: char *res = decoder_result (d); size_t len = decoder_length (d); ... case DECODER_EOF: ... case DECODER_INCOMPLETE: ... case DECODER_ERROR: ... } ... decoder_destroy (d); > (B) means to use a different programming language. I can't recommend C++ [1]. The main problem I see with C++'s coroutines is that they are stackless coroutines; their expressiveness is tiny compared to languages with full coroutine support, to say nothing of programming languages like Scheme with its first-class continuations. > (C) is possible, but complex. See e.g. gnulib's pipe-filter-ii.c or > pipe-filter-gi.c. Generally, threads are overkill when all you need are > coroutines. I agree. Unfortunately, Posix's response to dropping makecontext and friends seems to be to use threads. It would be great if C had a lightweight context-swapping mechanism. > Now, when implementing (E), it will be useful to have some kind of "abstract > input stream" data type. Such a thing does not exist in C, for historical > reasons. But it can be done similarly to the "abstract output stream" data > type that is at the heart of GNU libtextstyle [2][3][4]. I will have to take a closer look at that library. > > On top of that, a decoding Unicode mbfile interface can be built, say > > ucfile. > > One of the problems of byte-by-byte decoding is that it's inefficient. It's > way more efficient to do the same task (decoding, consuming) on an entire > buffer of, say, at least 1 KiB. Buffering minimizes the context switches and > time spent in function entry/exit. That needs to be considered in the design. The mbfile interface tries hard not to read more than necessary in advance to support interactive streams. That possibility should be preserved, I think. In my API proposal above, decoder_push can be redesigned to look as follows: int decoder_push (decoder_t decoder, char *src, size_t srclen) By the way, libunistring's u??_conv_from_encoding does not seem to be adapted to consuming buffers. The problem is that one doesn't know in advance where boundaries of multi-byte sequences are so u??_conv_from_encoding will likely signal a decoding error. What would be more helpful would be a version of u??_conv_from_encoding that returns the decoded part of the string before the invalid sequence and that gives the position of the invalid sequence. For piping purposes, it would still not be very comfortable because one then would have to copy by hand the undecoded part of the string to the beginning of the buffer and refill the rest of the buffer from the source. Marc