On Sat, 24 Jun 2017 19:48:07 +0100 Daniel Boles <dboles....@gmail.com> wrote: > On 24 June 2017 at 19:12, Chris Vine <vine35792...@gmail.com> wrote: > > > On Sat, 24 Jun 2017 19:08:36 +0100 > > Chris Vine <vine35792...@gmail.com> wrote: > > > > > It is because UTF-8 is a multibyte encoding, and any one > > > character may require between 1 and 5 bytes to represent it. If > > > you were allowed to change a byte at will you would be able to > > > introduce invalid encoding sequences. As to the absense of > > > documentation, maybe it is because this was thought to be > > > self-evident, dunno. > > > > And I should perhaps also make the point that these operators > > return a 32-bit unicode character, not a byte, which is consequent > > on the same point. If you allowed mutation, the length of the > > string (in bytes) might change. > > Right, of course. It does seem very obvious now. It seemed to > completely slip my mind that we're dealing with characters of > arbitrary width, not e.g. UTF-16. :( Thanks for the comprehensive > answer to a stupid question!
UTF-16 is also a variable width encoding, with surrogate pairs for anything outside the basic multilingual plane. Which is why UTF-16 is regarded by many as a fairly unhelpful encoding. It does have the feature that for the average japanese text, it does occupy slightly less space that UTF-8. The same is not true of Chinese text though. _______________________________________________ gtkmm-list mailing list gtkmm-list@gnome.org https://mail.gnome.org/mailman/listinfo/gtkmm-list