Eric Blake <[email protected]> writes:
> On 08/08/2018 07:02 AM, Markus Armbruster wrote:
>> We reject bytes that can't occur in valid UTF-8 (\xC0..\xC1,
>> \xF5..\xFF in the lexer. That's insufficient; there's plenty of
>> invalid UTF-8 not containing these bytes, as demonstrated by
>> check-qjson:
>>
>> * Malformed sequences
>>
>> - Unexpected continuation bytes
>>
>> - Missing continuation bytes after start bytes other than
>> \xC0..\xC1, \xF5..\xFD.
>>
>> * Overlong sequences with start bytes other than \xC0..\xC1,
>> \xF5..\xFD.
>>
>> * Invalid code points
>>
>> Fixing this in the lexer would be bothersome. Fixing it in the parser
>> is straightforward, so do that.
>>
>> Signed-off-by: Markus Armbruster <[email protected]>
>> ---
>
>> @@ -193,12 +198,15 @@ static QString
>> *qstring_from_escaped_str(JSONParserContext *ctxt,
>> goto out;
>> }
>> } else {
>> - char dummy[2];
>> -
>> - dummy[0] = *ptr;
>> - dummy[1] = 0;
>> -
>> - qstring_append(str, dummy);
>> + cp = mod_utf8_codepoint(ptr, 6, &end);
>
> Why are you hard-coding 6 here, rather than computing min(6,
> strchr(ptr,0)-ptr)? If the user passes an invalid sequence at the end
> of the string, can we end up making mod_utf8_codepoint() read beyond
> the end of our string? Would it be better to just always pass the
> remaining string length (mod_utf8_codepoint() only cares about
> stopping short of 6 bytes, but never reads beyond there even if you
> pass a larger number)?
mod_utf8_codepoint() never reads beyond '\0'. The second parameter
exists only so you can further limit reads. I like to provide that
capability, because it sometimes saves a silly substring copy.
>> + if (cp <= 0) {
>> + parse_error(ctxt, token, "invalid UTF-8 sequence in
>> string");
>> + goto out;
>> + }
>> + ptr = end - 1;
>> + len = mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp);
>> + assert(len >= 0);
>> + qstring_append(str, utf8_buf);
>> }
>> }
>>
>
>> +++ b/util/unicode.c
>> @@ -13,6 +13,21 @@
>> #include "qemu/osdep.h"
>> #include "qemu/unicode.h"
>>
>
>> +ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint)
>> +{
>> + assert(bufsz >= 5);
>> +
>> + if (!is_valid_codepoint(codepoint)) {
>> + return -1;
>> + }
>> +
>> + if (codepoint > 0 && codepoint <= 0x7F) {
>> + buf[0] = codepoint & 0x7F;
>
> Dead use of binary &. But acceptable for symmetry with the other code
> branches.
Exactly as dead as ...
>> + buf[1] = 0;
>> + return 1;
>> + }
>> + if (codepoint <= 0x7FF) {
>> + buf[0] = 0xC0 | ((codepoint >> 6) & 0x1F);
... this one, and ...
>> + buf[1] = 0x80 | (codepoint & 0x3F);
>> + buf[2] = 0;
>> + return 2;
>> + }
>> + if (codepoint <= 0xFFFF) {
>> + buf[0] = 0xE0 | ((codepoint >> 12) & 0x0F);
... this one, and ...
>> + buf[1] = 0x80 | ((codepoint >> 6) & 0x3F);
>> + buf[2] = 0x80 | (codepoint & 0x3F);
>> + buf[3] = 0;
>> + return 3;
>> + }
>> + buf[0] = 0xF0 | ((codepoint >> 18) & 0x07);
... even this one.
The last one only because is_valid_codepoint() rejects codepoints >
0x10FFFFu, which is admittedly a non-local argument.
I'm debating whether to keep or drop the redundant masking. Got a
preference?
>> + buf[1] = 0x80 | ((codepoint >> 12) & 0x3F);
>> + buf[2] = 0x80 | ((codepoint >> 6) & 0x3F);
>> + buf[3] = 0x80 | (codepoint & 0x3F);
>> + buf[4] = 0;
>> + return 4;
>> +}
>>
>
> Overall, looks nice.
Thanks!