On Thu,  5 Feb 2026 10:26:32 +0100
Robin Jarry <[email protected]> wrote:

> This series introduces a deferred enqueue API for the graph library that
> simplifies node development while maintaining performance.
> 
> The current node implementations use a manual speculation pattern where
> each node pre-allocates destination buffer slots, tracks which packets
> diverge from the speculated edge, and handles fixups at the end. This
> results in complex boilerplate code with multiple local variables
> (to_next, from, held, last_spec), memcpy calls, and stream get/put
> operations repeated across every node.
> 
> The new rte_node_enqueue_deferred() API handles this automatically:
> - Tracks runs of consecutive packets going to the same edge
> - Flushes runs in bulk when the edge changes
> - Uses rte_node_next_stream_move() (pointer swap) when all packets
>   go to the same destination
> - Preserves last_edge across invocations for cross-batch speculation
> 
> The deferred state is stored in the node's fast-path cache line 1,
> alongside xstat_off, keeping frequently accessed data together.
> 
> Performance was measured with l3fwd forwarding between two ports of an
> Intel E810-XXV 2x25G NIC (1 RX queue per port). Two graph worker threads
> ran on hyper threads of the same physical core on an Intel Xeon Silver
> 4316 CPU @ 2.30GHz.
> 
> Results:
> - Baseline (manual speculation): 37.0 Mpps
> - Deferred API:                  36.2 Mpps (-2.2%)
> 
> The slight overhead comes from per-packet edge comparisons. However,
> this is offset by:
> - 826 fewer lines of code across 13 node implementations
> - Reduced instruction cache pressure from simpler code paths
> - Elimination of per-node speculation boilerplate
> - Easier development of new nodes
> 
> Robin Jarry (3):
>   graph: optimize rte_node_enqueue_next to batch by edge
>   graph: add deferred enqueue API for batch processing
>   node: use deferred enqueue API in process functions
> 
>  app/graph/ip4_output_hook.c         |  35 +-------
>  lib/graph/graph_populate.c          |   1 +
>  lib/graph/rte_graph_worker_common.h |  90 ++++++++++++++++++-
>  lib/node/interface_tx_feature.c     | 105 +++-------------------
>  lib/node/ip4_local.c                |  36 +-------
>  lib/node/ip4_lookup.c               |  37 +-------
>  lib/node/ip4_lookup_fib.c           |  36 +-------
>  lib/node/ip4_lookup_neon.h          | 100 ++-------------------
>  lib/node/ip4_lookup_sse.h           | 100 ++-------------------
>  lib/node/ip4_rewrite.c              | 120 +++----------------------
>  lib/node/ip6_lookup.c               |  95 ++------------------
>  lib/node/ip6_lookup_fib.c           |  34 +-------
>  lib/node/ip6_rewrite.c              | 118 +++----------------------
>  lib/node/pkt_cls.c                  | 130 +++-------------------------
>  lib/node/udp4_input.c               |  42 +--------
>  15 files changed, 170 insertions(+), 909 deletions(-)
> 

AI review comments:

Review: [RFC PATCH dpdk 1/3] graph: optimize rte_node_enqueue_next to batch by 
edge
        [RFC PATCH dpdk 2/3] graph: add deferred enqueue API for batch 
processing
        [RFC PATCH dpdk 3/3] node: use deferred enqueue API in process functions

This is a well-motivated series. The deferred enqueue API is a clean
abstraction that replaces the repetitive speculation boilerplate in
every node process function. The code reduction in patch 3 speaks
for itself -- roughly 900 lines of hand-rolled speculation logic
replaced with single-line rte_node_enqueue_deferred() calls.

Patch 1/3 - rte_node_enqueue_next batching

Error: Out-of-bounds read when nb_objs is 0.

  rte_edge_t last = nexts[0];

reads nexts[0] unconditionally. If nb_objs == 0, this is an
out-of-bounds access. The old code handled this correctly because
the loop `for (i = 0; i < nb_objs; i++)` would simply not execute.
Add a guard:

  if (nb_objs == 0)
      return;

Patch 2/3 - deferred enqueue API

Error: Flush in __rte_node_process fires for nodes that do not use
the deferred API, corrupting their data.

__rte_node_process now unconditionally checks:

  if (node->deferred_last_edge != RTE_EDGE_ID_INVALID)
      __rte_node_enqueue_deferred_flush(graph, node);

deferred_last_edge is initialized to RTE_EDGE_ID_INVALID at graph
populate time, but it is preserved across invocations ("Keep
deferred_last_edge from previous invocation for speculation"). Once
a deferred-API node sets it to a valid value, it will never be
reset to RTE_EDGE_ID_INVALID.

This means any node that uses the old manual speculation pattern
(rte_node_next_stream_move / stream_get / stream_put directly)
will have the flush fire after its process function returns,
because deferred_last_edge may be stale from a prior incarnation
or from the node's first call. The flush would then attempt
rte_node_next_stream_move based on the stale deferred_last_edge
and the current node->idx (which may still be non-zero because
stream_move does not zero src->idx).

Even though patch 3 converts all in-tree nodes, this is a public
API change in __rte_node_process that affects all node
implementations including third-party and out-of-tree nodes that
are compiled against the DPDK headers. Any node process function
that does not use rte_node_enqueue_deferred() will be silently
broken.

The fix is to ensure the flush only fires when the deferred API
was actually used during this invocation. For example, reset
deferred_last_edge to RTE_EDGE_ID_INVALID at the start of
__rte_node_process (instead of preserving it), or add a flag
that is set when rte_node_enqueue_deferred() is first called:

  /* Option A: reset each invocation (loses cross-batch
   * speculation but is safe) */
  node->deferred_last_edge = RTE_EDGE_ID_INVALID;

  /* Option B: add a "deferred_active" flag set by
   * rte_node_enqueue_deferred, cleared by flush */

Note that option A sacrifices the cross-batch speculation benefit
described in the commit message. Option B preserves speculation
but adds a branch and a byte to the cache line. The right choice
depends on how much the cross-batch speculation matters in
practice -- given the 2-3% overhead already mentioned, the
safety of option A may be preferable.

Warning: deferred_run_start and deferred_last_edge add 4 bytes to
cache line 1 of struct rte_node.

The commit says "stored in the node fast-path cache line 1,
keeping it close to other frequently accessed node data." Since
this is an ABI change to a public struct, it should be noted in
the release notes. Verify that the addition does not push other
hot fields out of the cache line or cross a cache line boundary.
The current cache line 1 only had xstat_off before this change,
so there is room, but this should be explicitly confirmed.

Warning: __rte_node_enqueue_deferred_flush uses node->idx after
the process function has potentially manipulated it.

The flush reads node->idx as "count" to determine how many
objects to flush. This relies on the invariant that the process
function did NOT modify node->idx. With the deferred API this
is true (the process function just calls
rte_node_enqueue_deferred which doesn't touch node->idx), but
this implicit contract is fragile and undocumented. If any
process function uses a mix of deferred and non-deferred enqueue
APIs, node->idx may not reflect the original object count. Add a
comment or assertion documenting this invariant.

Warning: the new API should be marked __rte_experimental.

rte_node_enqueue_deferred() is a new public inline function in an
installed header. Per DPDK policy, new APIs must be marked
__rte_experimental.

Patch 3/3 - node conversions

Warning: pkt_cls_node_process loses its cross-invocation l2l3_type
speculation state.

The old code saved the last packet type in ctx->l2l3_type and
used it as the speculative next index on the following invocation.
The new code removes the pkt_cls_node_ctx struct entirely and
relies on deferred_last_edge for speculation. This works because
deferred_last_edge maps to the next node edge, which is what the
old code was ultimately speculating on. However, the old code
speculated on the *packet type* (l2l3_type) which was then
mapped to the edge -- this two-level speculation could be more
precise if different packet types map to the same edge. In
practice, this is likely fine since the deferred API speculates
on the edge directly, which is what matters for performance.
This is noted for completeness.

Info: The conversions are mechanical and consistent across all
13 files. Each follows the same pattern: remove from/to_next/
held/last_spec/next_index variables, remove stream_get/put/
memcpy/enqueue_x1 logic, replace with
rte_node_enqueue_deferred(graph, node, next, i). The manual
`i` index tracking is correct in all cases -- the x4 loops
increment i by 4, the x1 tail loops by 1, matching the pkts
pointer increment in n_left_from.

Reviewed-by: Stephen Hemminger <[email protected]>

Reply via email to