On 2/9/26 10:36, Markus Armbruster wrote:
+typedef struct JSONParserContext {
+ Error *err;
+ GQueue *stack;
This tells us we have a stack of something, but not what the something
is. Further down, we'll see what the something is.
The commit message explains what the stack means.
Worth a comment here?
Yes, makes sense.
We know the stack is at most 1024 deep (MAX_NESTING). Have you
considered using an array instead of GQueue?
It'd be only 8K, but I preferred to keep MAX_NESTING hidden within
json-streamer.c. Not having bounds checking makes me a bit nervous
about having fixed-size arrays.
See more below about responsibility split between parser and streamer.
+ va_list *ap;
+} JSONParserContext;
+
Having to move struct definitions to headers is always a bit sad. Could
this go into json-parser-int.h instead?
No because it's embedded in JSONMessageParser, just like JSONLexer. It
was previously hidden only because the parser was created and destroyed
after parentheses were balanced, but not doing that is kind of the whole
story here. :)
+/*
+ * Objects: { } | { members }
+ * - Empty: { -> AFTER_LCURLY -> }
+ * - Non-empty: { -> AFTER_LCURLY -> BEFORE_KEY -> string -> END_OF_KEY -> : ->
+ * BEFORE_VALUE -> value -> END_OF_VALUE -> , -> BEFORE_KEY -> ...
-> }
+ *
+ * Arrays: [ ] | [ elements ]
+ * - Empty: [ -> AFTER_LSQUARE -> ]
+ * - Non-empty: [ -> AFTER_LSQUARE -> BEFORE_VALUE -> value -> END_OF_VALUE -> ,
->
+ * BEFORE_VALUE -> ... -> ]
+ *
+ * The two cases for END_OF_VALUE are distinguished based on the type of
QObject at
+ * top-of-stack.
+ */
I'm afraid I need a comment telling me how to read this comment.
That's basically the automaton that you request below (with epsilon
transitions), I'll expand on it.
+typedef enum JSONParserState {
+ AFTER_LCURLY,
+ AFTER_LSQUARE,
+ BEFORE_KEY,
+ BEFORE_VALUE,
+ END_OF_KEY,
+ END_OF_VALUE,
+} JSONParserState;
+
+typedef struct JSONParserStackEntry {
+ /* A QString with the last parsed key, or a QList/QDict for the current
container. */
+ QObject *partial;
+
+ /* Needed to distinguish whether the parser is waiting for a colon or
comma. */
+ JSONParserState state;
If the comment was 100% accurate, a bool would do.
Hmm when I wrote I did think that a bool + the type of the top QObject
would do. But really there are three cases that the top QObject does
not distinguish:
- "after opening bracket, closed parenthesis or next element accepted",
- "after comma, next element required"
- "after element, closed parenthesis or comma accepted"
So, a bool would not be enough. Will fix.
+/* Note that entry->partial does *not* lose its reference count even if value
== NULL. */
+static JSONParserStackEntry *pop_entry(JSONParserContext *ctxt, QObject
**value)
+{
+ JSONParserStackEntry *entry = g_queue_pop_tail(ctxt->stack);
+ if (value) {
+ *value = entry->partial;
+ }
+ g_free(entry);
+ return current_entry(ctxt);
+}
This is a rather peculiar function. If you disagree, try writing a
contract for it :)
Hmm... yes, better pull the "value = entry->partial" to the callers even
if there are several of them. The code is overall simpler.
The parser's interesting part follows. The parser is a pushdown
automaton. The pushdown automaton is coded in C. On the one hand,
d'oh! Of course it is. On the other hand, it's hard to see the actual
automaton in C. May I have a comment explaining state and state
transitions? Perhaps we better start with an informal description,
discuss it, then distill the discussion into a comment.
That was the purpose of the mysterious comment above. If a mix between
BNF and automaton is okay for you, it could be something like
input := value -> END_OF_VALUE
END_OF_INPUT -> check stack is empty, return parsed value
// entered on BEFORE_VALUE; after any of these rules are processed, the
// parser has completed a QObject and is in the END_OF_VALUE state.
//
// When the parser reaches the END_OF_VALUE state, it examines the
// top of the stack to see if it's coming from "input" (stack empty),
// "array_items" (TOS is a QList) or "dict_pairs" (TOS is a QString; the
// item below will be a QDict). It then proceeds with the corresponding
// actions, which will be one of:
// - return parsed value
// - add value to QList
// - pop QString with the key, add key/value to the QDict
value := literal -> END_OF_VALUE
| '[' -> push empty QList -> AFTER_LSQUARE
after_lsquare -> END_OF_VALUE
| '{' -> push empty QDict -> AFTER_LCURLY
after_lcurly -> END_OF_VALUE
// non-recursive values, entered on BEFORE_VALUE
literal := INTEGER -> END_OF_VALUE
| FLOAT -> END_OF_VALUE
| KEYWORD -> END_OF_VALUE
| STRING -> END_OF_VALUE
| INTERP -> END_OF_VALUE
// entered on AFTER_LSQUARE
after_lsquare := ']' -> pop completed QList -> END_OF_VALUE
| ϵ -> BEFORE_VALUE
array_items -> END_OF_VALUE
// entered on BEFORE_VALUE, with TOS being a QList
array_items := value -> add value to QList -> END_OF_VALUE
(']' -> pop completed QList -> END_OF_VALUE
| ',' -> BEFORE_VALUE
array_items) -> END_OF_VALUE
// entered on AFTER_LCURLY
after_lcurly := '}' -> pop completed QDict -> END_OF_VALUE
| ϵ -> BEFORE_KEY
dict_pairs -> END_OF_VALUE
// entered on BEFORE_KEY, with TOS being a QDict
dict_pairs := STRING -> push QString -> END_OF_KEY
':' -> BEFORE_VALUE
value -> pop QString + add pair to QDict -> END_OF_VALUE
('}' -> pop completed QDict -> END_OF_VALUE
| ',' -> BEFORE_KEY
dict_pairs) -> END_OF_VALUE
+static QObject *json_parser_parse_token(JSONParserContext *ctxt, const
JSONToken *token)
Rename to parse_token()? Or inline into json_parser_feed()?
Renaming is good.
+ if (!key) {
+ return NULL;
+ }
Key to understand the switch is the meaning of "break" and "return
NULL".
We do the latter when we're done for this token.
We do the former when this token completed a (sub-)value. @value holds
it. If the stack is empty, @value is the result of the parse, and the
code after the switch returns it. Else, @value is a sub-value, and the
code after the switch stores it in the object being constructed.
Correct. Want me to include it somehow or is the grammar above fine?
+
+ /* Store key in a special entry on the stack */
+ push_entry(ctxt, QOBJECT(key), END_OF_KEY);
+ } else {
+ parse_error(ctxt, token, "expecting key");
What's the state machine's parse error recovery story?
As simple as it can be:
assert(!ctxt->err);
because a push parser can actually refuse to deal with tokens after an
error, and make recovery someone else's problem.
In other words a push parser can actually be the pure theoretical thing
from your compilers and automata book, and all the engineering sits on
top. IMO this (at least for a JSON parser that you won't touch that
often) balances the complexity of the state machine.
So, error recovery is handled entirely in json-streamer.c. As of this
patch, json-streamer.c has gathered all tokens up to the next balanced
parentheses, and that's implicitly the place where the error recovery is
complete.
+QObject *json_parser_feed(JSONParserContext *ctxt, const JSONToken *token,
Error **errp)
The return value isn't immediately obvious. A brief function contract
could help the reader.
Ok, will add.
Before the patch: flush the queue when both counts are zero or at least
one is negative.
After the patch: additionally flush it on end of input.
I figure the two hunks together make the parser see a JSON_END_OF_INPUT
token at the end of input. Correct?
Yes, see the commit message. This is the only change I have made here
to the parser/streamer split, because it makes sense to have the
unbalanced parentheses error happen in the parser. Otherwise you're not
really implementing the whole grammar.
}
- json = json_parser_parse(&parser->tokens, parser->ap, &err);
+ json_parser_init(&ctxt, parser->ap);
+
+ /* Process all tokens in the queue */
+ while (!g_queue_is_empty(&parser->tokens)) {
+ token = g_queue_pop_head(&parser->tokens);
+ json = json_parser_feed(&ctxt, token, &err);
+ g_free(token);
+ if (json || err) {
+ break;
+ }
+ }
+
+ json_parser_destroy(&ctxt);
This is where we switch from feeding the parser balanced expressions to
feeding it tokens one by one.
When we break the loop early, parser->tokens is not empty here. Fine,
because...
out_emit:
parser->brace_count = 0;
parser->bracket_count = 0;
json_message_free_tokens(parser);
... we drain it here.
parser->token_size = 0;
parser->emit(parser->opaque, json, err);
}
Correct, and this reset code will survive in patch 3, sort of like this:
// feed tokens as they're passed to the streamer
if (!ctxt->err) {
json = json_parser_feed(...)
if (json || ctxt->err) {
parser->emit(parser->opaque, json, ctxt->err);
}
}
if (value || error recovery complete) {
// reset the streamer state
// reset the parser
}
Paolo