Eric Blake <[email protected]> writes:
> On 08/17/2018 10:05 AM, Markus Armbruster wrote:
>> For input 0123, the lexer produces the tokens
>>
>> JSON_ERROR 01
>> JSON_INTEGER 23
>>
>> Reporting an error is correct; 0123 is invalid according to RFC 7159.
>> But the error recovery isn't nice.
>>
>> Make the finite state machine eat digits before going into the error
>> state. The lexer now produces
>>
>> JSON_ERROR 0123
>>
>> Signed-off-by: Markus Armbruster <[email protected]>
>> Reviewed-by: Eric Blake <[email protected]>
>
> Did you also want to reject invalid attempts at hex numbers, by adding
> [xXa-fA-F] to the set of characters eaten by IN_BAD_ZERO?
I put one foot on a slippery slope with this patch...
In review of v1, we discussed whether to try matching non-integer
numbers with redundant leading zero. Doing that tightly in the lexer
requires duplicating six states. A simpler alternative is to have the
lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+.
Your suggestion for hexadecimal numbers is digit salad with different
digits: [0-9a-fA-FxX]. Another option is their union: [0-9a-fA-FxX+-].
Even more radical would be eating anything but whitespace and structural
characters: [^][}{:, \t\n\r]. That idea pushed to the limit results in
a two-stage lexer: first stage finds token strings, where a token string
is a structural character or a sequence of non-structural,
non-whitespace characters, second stage rejects invalid token strings.
Hmm, we could try to recover from lexical errors more smartly in
general: instead of ending the JSON error token after the first
offending character, end it before the first whitespace or structural
character following the offending character.
I can try that, but I'd prefer to try it in a follow-up patch.
>> + [IN_BAD_ZERO] = {
>> + ['0' ... '9'] = IN_BAD_ZERO,
>> + },
>> +