Re: [Rd] [PATCH] Improve utf8clen and remove utf8_table4

Duncan Murdoch Sun, 19 Mar 2017 05:39:47 -0700

On 19/03/2017 2:31 AM, Sahil Kang wrote:

Given a char `c' which should be the start byte of a utf8 character,
the utf8clen function returns the byte length of the utf8 character.


Before this patch, the utf8clen function would return either:
     * 1 if `c' was an ascii character or a utf8 continuation byte
     * An int in the range [2, 6] indicating the byte length of the utf8
character

With this patch, the utf8clen function will now return either:
     * -1 if `c' is not a valid utf8 start byte
     * The byte length of the utf8 character (the number of leading 1's,
really)

I believe returning -1 for continuation bytes makes utf8clen less error
prone.
The utf8_table4 array is no longer needed and has been removed.

utf8clen is used internally by R in more than a dozen places, and islikely used in packages as well. Have you checked that this change insemantics won't break any of those uses?


Duncan Murdoch

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [PATCH] Improve utf8clen and remove utf8_table4

Reply via email to