On Tue, 10 Mar 2026 10:20:04 +0100 Vincent Jardin <[email protected]> wrote:
> This series adds per-queue Tx rate limiting to the mlx5 PMD using > the HW packet pacing (PP) rate table. > > The ConnectX-6 Dx and later NICs expose a per-SQ > packet_pacing_rate_limit_index that can be changed on a live SQ > via modify_bitmask without queue teardown. The kernel mlx5 driver > refcounts PP contexts internally, so queues configured at the same > rate share a single HW rate table entry. > > The series is structured as follows: > > 1. Doc fix for stale packet pacing documentation > 2-3. common/mlx5: query PP capabilities and extend SQ modify > 4-6. net/mlx5: per-queue PP infrastructure, rate_limit callback, > burst pacing devargs (tx_burst_bound, tx_typical_pkt_sz) > 7. net/mlx5: testpmd command to query per-queue rate state > 8. ethdev: add rte_eth_get_queue_rate_limit() symmetric getter > 9. net/mlx5: share PP rate table entries across queues > 10. net/mlx5: rate table capacity query API > > Usage with testpmd: > set port 0 queue 0 rate 1000 > set port 0 queue 1 rate 5000 > set port 0 queue 0 rate 0 # disable > mlx5 port 0 txq 0 rate show # query > > Tested on ConnectX-6 Dx only. > > Vincent Jardin (11): > doc/nics/mlx5: fix stale packet pacing documentation > common/mlx5: query packet pacing rate table capabilities > common/mlx5: extend SQ modify to support rate limit update > net/mlx5: add per-queue packet pacing infrastructure > net/mlx5: support per-queue rate limiting > net/mlx5: add burst pacing devargs > net/mlx5: add testpmd command to query per-queue rate limit > ethdev: add getter for per-queue Tx rate limit > mailmap: update Vincent Jardin email address > net/mlx5: share pacing rate table entries across queues > net/mlx5: add rate table capacity query API > > .mailmap | 3 +- > doc/guides/nics/mlx5.rst | 125 +++++++++++++++++------ > drivers/common/mlx5/mlx5_devx_cmds.c | 20 ++++ > drivers/common/mlx5/mlx5_devx_cmds.h | 14 ++- > drivers/net/mlx5/mlx5.c | 46 +++++++++ > drivers/net/mlx5/mlx5.h | 13 +++ > drivers/net/mlx5/mlx5_testpmd.c | 93 +++++++++++++++++ > drivers/net/mlx5/mlx5_tx.c | 89 +++++++++++++++++ > drivers/net/mlx5/mlx5_tx.h | 5 + > drivers/net/mlx5/mlx5_txpp.c | 75 ++++++++++++++ > drivers/net/mlx5/mlx5_txq.c | 144 +++++++++++++++++++++++++++ > drivers/net/mlx5/rte_pmd_mlx5.h | 57 +++++++++++ > lib/ethdev/ethdev_driver.h | 7 ++ > lib/ethdev/rte_ethdev.c | 28 ++++++ > lib/ethdev/rte_ethdev.h | 24 +++++ > 15 files changed, 710 insertions(+), 33 deletions(-) > Lots to digest here, so did a first pass review using AI. Review: [PATCH v1 00/10] mlx5 per-queue packet pacing rate limiting Submitter: Vincent Jardin <[email protected]> ================================================================ Patch 4/10: net/mlx5: add per-queue packet pacing infrastructure ================================================================ Error: Integer overflow in Mbps-to-kbps conversion mlx5_txq_alloc_pp_rate_limit() computes: rate_kbps = rate_mbps * 1000; Both rate_mbps and rate_kbps are uint32_t. If rate_mbps > 4,294,967 (roughly 4.3 Tbps), the multiplication overflows and silently wraps, programming a wrong rate into the HW rate table. While 4 Tbps is beyond current HW, the function has no upper-bound validation against hca_attr.qos.packet_pacing_max_rate. At minimum, validate rate_mbps against the HCA max rate before the multiply, or widen to uint64_t: uint64_t rate_kbps = (uint64_t)rate_mbps * 1000; and check it fits in the 32-bit PRM field. Warning: No validation of rate_mbps against HCA min/max rate The HCA reports packet_pacing_min_rate and packet_pacing_max_rate (queried in patch 2). mlx5_txq_alloc_pp_rate_limit() does not check the requested rate against these bounds. A rate below the HW minimum will likely be silently rounded or rejected by FW with an opaque error. Validating early with a clear log message would be more helpful. ================================================================ Patch 5/10: net/mlx5: support per-queue rate limiting ================================================================ Error: PP index leaked when mlx5_devx_cmd_modify_sq() fails on rate set In mlx5_set_queue_rate_limit() for the tx_rate > 0 path: ret = mlx5_txq_alloc_pp_rate_limit(sh, &txq_ctrl->rl, tx_rate); if (ret) return ret; ... ret = mlx5_devx_cmd_modify_sq(sq_devx, &sq_attr); if (ret) { ... mlx5_txq_free_pp_rate_limit(&txq_ctrl->rl); return ret; } This looks correct on error -- good. However, note that mlx5_txq_alloc_pp_rate_limit() calls mlx5_txq_free_pp_rate_limit() internally at the top ("Free previous allocation if any"), meaning the OLD PP index is freed BEFORE the new one is allocated, and before the SQ is modified. If the new allocation succeeds but modify_sq fails, the SQ still has the OLD pp_id programmed, but that old PP context was already freed. The SQ is now referencing a freed PP index until the next successful set or queue teardown. Suggested fix: Do not free the old PP context inside mlx5_txq_alloc_pp_rate_limit(). Instead, allocate the new PP into a temporary mlx5_txq_rate_limit, modify the SQ, and only on success free the old PP and swap in the new one. On failure, free the new PP and leave the old one intact. Warning: mlx5_set_queue_rate_limit return value inconsistency The disable path (tx_rate == 0) returns the raw ret from mlx5_devx_cmd_modify_sq() without setting rte_errno, while the capability-check and validation paths return -rte_errno. The ethdev layer expects negative errno values. Verify that mlx5_devx_cmd_modify_sq() returns negative errno consistently. ================================================================ Patch 7/10: net/mlx5: add testpmd command to query per-queue rate ================================================================ Error: Inverted return value check for rte_eth_tx_queue_is_valid() In rte_pmd_mlx5_txq_rate_limit_query(): if (rte_eth_tx_queue_is_valid(port_id, queue_id)) return -EINVAL; rte_eth_tx_queue_is_valid() returns 0 on success (valid queue) and non-zero on error. This check returns -EINVAL when the queue IS valid, and proceeds when it is NOT valid -- the logic is inverted. Should be: if (!rte_eth_tx_queue_is_valid(port_id, queue_id)) or: if (rte_eth_tx_queue_is_valid(port_id, queue_id) != 0) Without this fix, the query always fails on valid queues and may dereference invalid memory on invalid queues. Warning: SQ object field mismatch between set and query paths In mlx5_set_queue_rate_limit() (patch 7 update), the code uses txq_ctrl->obj->sq_obj.sq for modify_sq. But in rte_pmd_mlx5_txq_rate_limit_query(), the query uses txq_ctrl->obj->sq_obj.sq as well (via sq_out). Make sure mlx5_devx_cmd_query_sq() exists and accepts this object -- the existing codebase has mlx5_devx_cmd_query_sq() but verify the parameter is the same sq_obj.sq vs sq distinction. ================================================================ Patch 8/10: ethdev: add getter for per-queue Tx rate limit ================================================================ Warning: New ethdev API needs broader discussion Adding a new eth_dev_ops callback (get_queue_rate_limit) changes the ethdev driver interface. This typically requires: - An RFC or discussion on the mailing list before the patch - Agreement from ethdev maintainers (Thomas, Andrew, Ferruh) - Consideration of whether this belongs as a generic ethdev API or should remain PMD-specific The patch adds it only to mlx5; other PMDs that support set_queue_rate_limit (ixgbe, i40e, ice) would need implementations too, or applications will get inconsistent -ENOTSUP. Warning: Missing release notes for new ethdev API rte_eth_get_queue_rate_limit() is a new public experimental API in lib/ethdev. This needs an entry in doc/guides/rel_notes/ for the 26.07 release. Warning: Missing RTE_EXPORT_EXPERIMENTAL_SYMBOL for mlx5 PMD functions rte_pmd_mlx5_txq_rate_limit_query (patch 7) has the export macro. Verify rte_pmd_mlx5_pp_rate_table_query (patch 10) does too -- it appears to, which is good. ================================================================ Patch 10/10: net/mlx5: add rate table capacity query API ================================================================ Warning: VLA-sized stack array in rte_pmd_mlx5_pp_rate_table_query() The function declares: uint16_t seen[RTE_MAX_QUEUES_PER_PORT]; RTE_MAX_QUEUES_PER_PORT is typically 1024, so this is a 2KB stack allocation. While not enormous, this is inside a function that could be called from any context. The O(n*m) dedup loop (for each queue, scan seen[] for duplicates) is also inefficient for large queue counts. Consider using a bitmap or a small hash set, or just counting unique pp_ids from the shared context rather than scanning all queues. Warning: "used" count only reflects this port's queues The function counts unique PP indices across priv->txqs_n queues for one port, but the HW rate table is shared across all ports on the same device (shared context). If multiple ports share the same PCI device, the "used" count will underreport. Consider iterating over all ports on the same sh, or documenting that the count is per-port only. ================================================================ General series observations ================================================================ Info: Patch 1 (doc fix) has a Fixes tag referencing the original scheduling devargs commit, which is appropriate. However, patches 2-10 are new features and should not have Fixes tags (they don't, which is correct). Info: The series adds three new experimental PMD APIs and one new ethdev API. All new APIs in headers have __rte_experimental on their own line, which is correct. The export macros use the 26.07 version tag. Info: The series lacks a cover letter in this bundle. For a 10-patch series adding a significant new feature, a cover letter explaining the overall design and testing would be expected by reviewers.

