On Sat, 24 Jun 2017 19:48:07 +0100
Daniel Boles <dboles....@gmail.com> wrote:
> On 24 June 2017 at 19:12, Chris Vine <vine35792...@gmail.com> wrote:
> 
> > On Sat, 24 Jun 2017 19:08:36 +0100
> > Chris Vine <vine35792...@gmail.com> wrote:
> >  
> > > It is because UTF-8 is a multibyte encoding, and any one
> > > character may require between 1 and 5 bytes to represent it.  If
> > > you were allowed to change a byte at will you would be able to
> > > introduce invalid encoding sequences.  As to the absense of
> > > documentation, maybe it is because this was thought to be
> > > self-evident, dunno.  
> >
> > And I should perhaps also make the point that these operators
> > return a 32-bit unicode character, not a byte, which is consequent
> > on the same point.  If you allowed mutation, the length of the
> > string (in bytes) might change.  
> 
> Right, of course. It does seem very obvious now. It seemed to
> completely slip my mind that we're dealing with characters of
> arbitrary width, not e.g. UTF-16. :( Thanks for the comprehensive
> answer to a stupid question!

UTF-16 is also a variable width encoding, with surrogate pairs for
anything outside the basic multilingual plane.  Which is why UTF-16 is
regarded by many as a fairly unhelpful encoding.  It does have the
feature that for the average japanese text, it does occupy slightly
less space that UTF-8.  The same is not true of Chinese text though.
_______________________________________________
gtkmm-list mailing list
gtkmm-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtkmm-list

Reply via email to