Paul Eggert wrote:
> > It gets this info from mbrtoc32, which on most platforms gets this info
> > from mbrtowc. This multibyte scanner knows when the bytes it has seen
> > so far constitute
> >    - a complete character, or
> >    - an invalid character, or
> >    - an incomplete character (i.e. if additional bytes may lead to a
> >      complete character).
> 
> Ah, I had thought that the idea was to treat all the bytes of a byte 
> sequence from 10646-1[1] R.2 Table 1 as a single invalid "character" 
> (i.e., not a real character) if the byte sequence is not valid UTF-8.

An arbitrary sequence of invalid bytes (which therefore could be
arbitrarily long) is not meant here. This would not produce good results
for the user, and would not be implementable in O(1) space.

> That's what Kuhn seems to be suggesting in [2].

Note also that Markus Kuhn suggested many things in his test file,
and only few of them have been widely adopted (some because of the
API, such as mbrtowc(), made it easy to adopt it, some because they
were security relevant). The majority hasn't been adopted. xterm
is probably the only terminal emulator that renders the entire section 3
of [2] as Markus Kuhn proposed.

> For example, as I understand it, the byte sequence F4 90 80 80, which I 
> had thought you were saying would be treated as a single byte sequence 
> [F4 90 80 80] because that's in R.2 Table 1, would instead be treated as 
> [F4 90] [80] [80], because [F4 90] is not an incomplete character 
> (additional bytes cannot lead to a complete character).

Yes, here [F4 90] is not an incomplete character, but rather an invalid
multibyte sequence, per [3] p. 125 table 3-7.

In [1] p. 127 the discussion of "maximal subparts" is therefore not relevant
for F4 90 80 80.

Per [1] p. 127 paragraph 3, decoders can decompose it to
  [F4 90] [80] [80]
or to
  [F4] [90] [80] [80]

Since mbrtowc() returns (size_t)(-1) for this sequence, without telling
how long the invalid sequence was, decoders/scanners that are based on
mbrtowc() (or mbrtoc32()) will decompose it like this:
  [F4] [90] [80] [80]

Bruno

> [1]: https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
> [2]: https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

[3] https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf





Reply via email to