There seem to be some claims that you cannot random-access a UTF-8 string with errors in it. This is false if you define the handling of errors to strict patterns that do not contain valid encodings, and easiest with my recommendation that errors only be 1 byte long.

To make this sample code simple, the buffer has a 0 byte before the first actual byte and another after the last one, this avoids the need to pass the buffer ends to the functions. Real implementations may need
to pass the pointer to one or both ends.

// Returns the length of a UTF-8 code point starting at p,
// or returns 0 if it is not a valid encoding. The rest of this
// code treats 0 as a 1-byte-long "code point"
int utf8_length(const unsigned char* p)
{
  if (p < 0x80) return 1; // ASCII
  else if (p < 0xC2) return 0; // continuation and overlong
  else ... // multi-byte codes
}

// return the start of the UTF-8 code point that
// p is pointing at one of the bytes of.
const unsigned char* utf8_start(const unsigned char* p)
{
  for (int i = 0; i < 4; i++)
     if (utf8_length(p-i) > i) return p-i;
  return p;
}

// p is assumed to point at the start of a code point, return the next
// one, or the 0 off the end of the buffer
const unsigned char* utf8_next(const unsigned char* p)
{
  int n = utf8_length(p);
  return p + (n ? n : 1);
}

// p is assumed to point at the start of a code point, return the
// previous one, or the 0 before the start of the buffer
const unsigned char* utf8_prev(const unsigned char* p)
{
  return utf8_start(p-1);
}
_______________________________________________
wayland-devel mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/wayland-devel

Reply via email to