Re: [PATCH 2/5] json-parser: replace with a push parser

Paolo Bonzini Mon, 09 Feb 2026 02:57:57 -0800

On 2/9/26 10:36, Markus Armbruster wrote:

+typedef struct JSONParserContext {
+    Error *err;
+    GQueue *stack;


This tells us we have a stack of something, but not what the something
is.  Further down, we'll see what the something is.

The commit message explains what the stack means.

Worth a comment here?


Yes, makes sense.

We know the stack is at most 1024 deep (MAX_NESTING).  Have you
considered using an array instead of GQueue?

It'd be only 8K, but I preferred to keep MAX_NESTING hidden withinjson-streamer.c. Not having bounds checking makes me a bit nervousabout having fixed-size arrays.


See more below about responsibility split between parser and streamer.

+    va_list *ap;
+} JSONParserContext;
+


Having to move struct definitions to headers is always a bit sad.  Could
this go into json-parser-int.h instead?

No because it's embedded in JSONMessageParser, just like JSONLexer. Itwas previously hidden only because the parser was created and destroyedafter parentheses were balanced, but not doing that is kind of the wholestory here. :)

+/*
+ * Objects: { } | { members }
+ * - Empty: { -> AFTER_LCURLY -> }
+ * - Non-empty: { -> AFTER_LCURLY -> BEFORE_KEY -> string -> END_OF_KEY -> : ->
+ *              BEFORE_VALUE -> value -> END_OF_VALUE -> , -> BEFORE_KEY -> ... 
-> }
+ *
+ * Arrays: [ ] | [ elements ]
+ * - Empty: [ -> AFTER_LSQUARE -> ]
+ * - Non-empty: [ -> AFTER_LSQUARE -> BEFORE_VALUE -> value -> END_OF_VALUE -> , 
->
+ *              BEFORE_VALUE -> ... -> ]
+ *
+ * The two cases for END_OF_VALUE are distinguished based on the type of 
QObject at
+ * top-of-stack.
+ */


I'm afraid I need a comment telling me how to read this comment.

That's basically the automaton that you request below (with epsilontransitions), I'll expand on it.

+typedef enum JSONParserState {
+    AFTER_LCURLY,
+    AFTER_LSQUARE,
+    BEFORE_KEY,
+    BEFORE_VALUE,
+    END_OF_KEY,
+    END_OF_VALUE,
+} JSONParserState;
+
+typedef struct JSONParserStackEntry {
+    /* A QString with the last parsed key, or a QList/QDict for the current 
container.  */
+    QObject *partial;
+
+    /* Needed to distinguish whether the parser is waiting for a colon or 
comma.  */
+    JSONParserState state;


If the comment was 100% accurate, a bool would do.

Hmm when I wrote I did think that a bool + the type of the top QObjectwould do. But really there are three cases that the top QObject doesnot distinguish:


- "after opening bracket, closed parenthesis or next element accepted",

- "after comma, next element required"

- "after element, closed parenthesis or comma accepted"

So, a bool would not be enough.  Will fix.

+/* Note that entry->partial does *not* lose its reference count even if value 
== NULL.  */
+static JSONParserStackEntry *pop_entry(JSONParserContext *ctxt, QObject 
**value)
+{
+    JSONParserStackEntry *entry = g_queue_pop_tail(ctxt->stack);
+    if (value) {
+        *value = entry->partial;
+    }
+    g_free(entry);
+    return current_entry(ctxt);
+}


This is a rather peculiar function.  If you disagree, try writing a
contract for it :)

Hmm... yes, better pull the "value = entry->partial" to the callers evenif there are several of them. The code is overall simpler.

The parser's interesting part follows.  The parser is a pushdown
automaton.  The pushdown automaton is coded in C.  On the one hand,
d'oh!  Of course it is.  On the other hand, it's hard to see the actual
automaton in C.  May I have a comment explaining state and state
transitions?  Perhaps we better start with an informal description,
discuss it, then distill the discussion into a comment.

That was the purpose of the mysterious comment above. If a mix betweenBNF and automaton is okay for you, it could be something like


input := value        -> END_OF_VALUE
         END_OF_INPUT -> check stack is empty, return parsed value

// entered on BEFORE_VALUE; after any of these rules are processed, the
// parser has completed a QObject and is in the END_OF_VALUE state.
//
// When the parser reaches the END_OF_VALUE state, it examines the
// top of the stack to see if it's coming from "input" (stack empty),
// "array_items" (TOS is a QList) or "dict_pairs" (TOS is a QString; the
// item below will be a QDict).  It then proceeds with the corresponding
// actions, which will be one of:
// - return parsed value
// - add value to QList
// - pop QString with the key, add key/value to the QDict

value := literal       -> END_OF_VALUE
      | '['            -> push empty QList -> AFTER_LSQUARE
         after_lsquare -> END_OF_VALUE
      | '{'            -> push empty QDict -> AFTER_LCURLY
         after_lcurly  -> END_OF_VALUE

// non-recursive values, entered on BEFORE_VALUE
literal := INTEGER   -> END_OF_VALUE
     | FLOAT         -> END_OF_VALUE
     | KEYWORD       -> END_OF_VALUE
     | STRING        -> END_OF_VALUE
     | INTERP        -> END_OF_VALUE

// entered on AFTER_LSQUARE
after_lsquare := ']' -> pop completed QList -> END_OF_VALUE
       | ϵ           -> BEFORE_VALUE
         array_items -> END_OF_VALUE

// entered on BEFORE_VALUE, with TOS being a QList
array_items := value   -> add value to QList -> END_OF_VALUE
       (']'            -> pop completed QList -> END_OF_VALUE
        | ','          -> BEFORE_VALUE
          array_items) -> END_OF_VALUE

// entered on AFTER_LCURLY
after_lcurly := '}'  -> pop completed QDict -> END_OF_VALUE
       | ϵ           -> BEFORE_KEY
         dict_pairs  -> END_OF_VALUE

// entered on BEFORE_KEY, with TOS being a QDict
dict_pairs := STRING  -> push QString -> END_OF_KEY
       ':'            -> BEFORE_VALUE
       value          -> pop QString + add pair to QDict -> END_OF_VALUE
       ('}'           -> pop completed QDict -> END_OF_VALUE
        | ','         -> BEFORE_KEY
          dict_pairs) -> END_OF_VALUE

+static QObject *json_parser_parse_token(JSONParserContext *ctxt, const 
JSONToken *token)


Rename to parse_token()?  Or inline into json_parser_feed()?


Renaming is good.

+            if (!key) {
+                return NULL;
+            }


Key to understand the switch is the meaning of "break" and "return
NULL".

We do the latter when we're done for this token.

We do the former when this token completed a (sub-)value.  @value holds
it.  If the stack is empty, @value is the result of the parse, and the
code after the switch returns it.  Else, @value is a sub-value, and the
code after the switch stores it in the object being constructed.


Correct.  Want me to include it somehow or is the grammar above fine?

+
+            /* Store key in a special entry on the stack */
+            push_entry(ctxt, QOBJECT(key), END_OF_KEY);
+        } else {
+            parse_error(ctxt, token, "expecting key");


What's the state machine's parse error recovery story?


As simple as it can be:

    assert(!ctxt->err);

because a push parser can actually refuse to deal with tokens after anerror, and make recovery someone else's problem.

In other words a push parser can actually be the pure theoretical thingfrom your compilers and automata book, and all the engineering sits ontop. IMO this (at least for a JSON parser that you won't touch thatoften) balances the complexity of the state machine.

So, error recovery is handled entirely in json-streamer.c. As of thispatch, json-streamer.c has gathered all tokens up to the next balancedparentheses, and that's implicitly the place where the error recovery iscomplete.

+QObject *json_parser_feed(JSONParserContext *ctxt, const JSONToken *token, 
Error **errp)


The return value isn't immediately obvious.  A brief function contract
could help the reader.


Ok, will add.

Before the patch: flush the queue when both counts are zero or at least
one is negative.

After the patch: additionally flush it on end of input.

I figure the two hunks together make the parser see a JSON_END_OF_INPUT
token at the end of input.  Correct?

Yes, see the commit message. This is the only change I have made hereto the parser/streamer split, because it makes sense to have theunbalanced parentheses error happen in the parser. Otherwise you're notreally implementing the whole grammar.

- json = json_parser_parse(&parser->tokens, parser->ap, &err);

+    json_parser_init(&ctxt, parser->ap);
+
+    /* Process all tokens in the queue */
+    while (!g_queue_is_empty(&parser->tokens)) {
+        token = g_queue_pop_head(&parser->tokens);
+        json = json_parser_feed(&ctxt, token, &err);
+        g_free(token);
+        if (json || err) {
+            break;
+        }
+    }
+
+    json_parser_destroy(&ctxt);


This is where we switch from feeding the parser balanced expressions to
feeding it tokens one by one.

When we break the loop early, parser->tokens is not empty here.  Fine,
because...

out_emit:

      parser->brace_count = 0;

        parser->bracket_count = 0;
        json_message_free_tokens(parser);

... we drain it here.

        parser->token_size = 0;
        parser->emit(parser->opaque, json, err);
    }


Correct, and this reset code will survive in patch 3, sort of like this:

    // feed tokens as they're passed to the streamer
    if (!ctxt->err) {
        json = json_parser_feed(...)
        if (json || ctxt->err) {
            parser->emit(parser->opaque, json, ctxt->err);
        }
    }

    if (value || error recovery complete) {
        // reset the streamer state
        // reset the parser
    }

Paolo

Re: [PATCH 2/5] json-parser: replace with a push parser

Reply via email to