date:20250629

Re: [PATCH 00/11] NXP DPAA2 driver enhancements and fixes

2025-06-29 Thread Stephen Hemminger

On Fri, 30 May 2025 12:43:33 +0530
Gagandeep Singh  wrote:

> This patch series introduces enhancements and fixes to the
> NXP DPAA2 Ethernet driver. 
> It includes support for
>  - software taildrop on ordered queues.
>  - setup speed capabilities.
>  - DPAA2 resource version.
>  - MAC level statistics.
>  - improve PA-VA conversion.
>  - add buffer pool depletion state configuration.
>  - fixes for shaper rate and buffer prepration.
> 


At this late stage in 25.07 release, please don't mix fixes with
new stuff. Also new features should be in release note.

Please resend with fixes only for 25.07

Re: [PATCH v1 4/4] net/ntnic: add warning when sending on a stopped queue

2025-06-29 Thread Stephen Hemminger

On Fri, 20 Jun 2025 13:27:07 +0200
Oleksandr Kolomeiets  wrote:

> When sending a burst of output packets on a stopped transmit queue,
> the packets are written to a memory mapped address.
> On queue start the packets are processed and transmitted by the NIC.
> 
> Signed-off-by: Oleksandr Kolomeiets 
> ---
>  drivers/net/ntnic/ntnic_ethdev.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/ntnic/ntnic_ethdev.c 
> b/drivers/net/ntnic/ntnic_ethdev.c
> index 79ef9e7e7c..4145128d11 100644
> --- a/drivers/net/ntnic/ntnic_ethdev.c
> +++ b/drivers/net/ntnic/ntnic_ethdev.c
> @@ -694,6 +694,10 @@ static uint16_t eth_dev_tx_scg(void *queue, struct 
> rte_mbuf **bufs, uint16_t nb_
>   int pkts_sent = 0;
>   uint16_t nb_segs_arr[MAX_TX_PACKETS];
>  
> + if (!tx_q->enabled)
> + NT_LOG(WRN, NTNIC, "Trying to send a burst of output packets "
> + "on a stopped transmit queue of an 
> Ethernet device");
> +
>   if (nb_pkts > MAX_TX_PACKETS)
>   nb_pkts = MAX_TX_PACKETS;
>  

This may result in log spam if application is sending a lot.
And the message is too long and split across lines.

But best to not do this at all; no other driver does it.

Re: [PATCH v1 1/4] net/ntnic: implement start/stop for Rx/Tx queues

2025-06-29 Thread Stephen Hemminger

On Fri, 20 Jun 2025 13:27:04 +0200
Oleksandr Kolomeiets  wrote:

> The following functions exported by the driver were stubs
> which merely changed the status flags:
> * rx_queue_start
> * rx_queue_stop
> * tx_queue_start
> * tx_queue_stop
> 
> Proper implementation was added to control queues's state.
> 
> Signed-off-by: Oleksandr Kolomeiets 

Since these were broken (and now fixed), you should add a Fixes: tag to this 
patch.

Re: [V3 14/18] net/hinic3: add Rx/Tx functions

2025-06-29 Thread Stephen Hemminger

On Sat, 28 Jun 2025 15:25:37 +0800
Feifei Wang  wrote:

> +#define HINIC3_RX_EMPTY_THRESHOLD 3
> +u16
> +hinic3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, u16 nb_pkts)
> +{
> + struct hinic3_rxq *rxq = rx_queue;
> + struct hinic3_rx_info *rx_info = NULL;
> + volatile struct hinic3_rq_cqe *rx_cqe = NULL;
> + struct rte_mbuf *rxm = NULL;
> + u16 sw_ci, rx_buf_len, wqebb_cnt = 0, pkts = 0;
> + u32 status, pkt_len, vlan_len, offload_type, lro_num;
> + u64 rx_bytes = 0;
> + u32 hash_value;
> +
> +#ifdef HINIC3_XSTAT_PROF_RX
> + uint64_t t1 = rte_get_tsc_cycles();
> + uint64_t t2;
> +#endif
> + if (((rte_get_timer_cycles() - rxq->rxq_stats.tsc) < 
> rxq->wait_time_cycle) &&
> + rxq->rxq_stats.empty >= HINIC3_RX_EMPTY_THRESHOLD)
> + goto out;
> +

NAK
Doing this kind of empty threshold on receive is non-standard.
Driver should not be doing it here.
Many applications do polling optimization in themselves in the polling loop.
This driver specific tweak would interfere with that.

Re: [V3 01/18] add some basic files about hinic3 driver

2025-06-29 Thread Stephen Hemminger

On Sat, 28 Jun 2025 15:25:24 +0800
Feifei Wang  wrote:

> --- /dev/null
> +++ b/doc/guides/nics/hinic3.rst
> @@ -0,0 +1,51 @@
> +..  SPDX-License-Identifier: BSD-3-Clause
> +Copyright(c) 2025 Huawei Technologies Co., Ltd
> +
> +HINIC Poll Mode Driver
> +==
> +
> +The hinic3 PMD (**librte_net_hinic3**) provides poll mode driver support
> +for 25Gbps/100Gbps/200Gbps Huawei SPx series Network Adapters.
> +
> +Features
> +
> +
> +- Multi arch support: x86_64, ARMv8.
> +- Multiple queues for TX and RX
> +- Receiver Side Scaling (RSS)
> +- flow filtering
> +- Checksum offload
> +- TSO offload
> +- Promiscuous mode
> +- Port hardware statistics
> +- Link state information
> +- Link flow control
> +- Scattered and gather for TX and RX
> +- Allmulticast mode
> +- MTU update
> +- Multicast MAC filter
> +- Flow API
> +- Set Link down or up
> +- VLAN filter and VLAN offload
> +- SR-IOV - Partially supported at this point, VFIO only
> +- FW version
> +- LRO
> +
> +Prerequisites
> +-
> +
> +- Learning about Huawei Hi1823 Series Intelligent NICs using
> +  ``_.
> +
> +- Follow the DPDK :ref:`Getting Started Guide for Linux ` to 
> setup the basic DPDK environment.
> +
> +
> +Driver compilation and testing
> +--
> +
> +Refer to the document :ref:`compiling and testing a PMD for a NIC 
> `
> +for details.
> +
> +Limitations or Known issues
> +---
> +X86-32, Windows, and BSD are not supported yet.
> \ No newline at end of file

Fix your editor settings all DPDK doc files should end with a newline.

Re: [PATCH v0 0/3] [v0]drivers/net fixed Coverity issue

2025-06-29 Thread Stephen Hemminger

On Wed, 18 Jun 2025 20:11:10 +0800
Wenbo Cao  wrote:

> v1:
>   *:fixed compile issue
> v0:
>   *:fixed the below issue:
>   Coverity issue: 468860,468866,468858
>   Fixes: 4530e70f1e32 ("net/rnp: support Tx TSO offload")
>   Fixes: 52dfb84e14be ("net/rnp: add device init and uninit")
>   Fixes: 52aae4ed4ffb ("net/rnp: add device capabilities")
>   *:fixed 64k tso
> 
> Wenbo Cao (3):
>   net/rnp: add check firmware respond info
>   net/rnp: fix Tunnel-TSO VLAN header untrusted loop bound
>   net/rnp: fix TSO segmentation for packets of 64KB
> 
>  drivers/net/rnp/base/rnp_fw_cmd.h |   1 +
>  drivers/net/rnp/base/rnp_mbx_fw.c |  15 +++-
>  drivers/net/rnp/rnp_ethdev.c  |  16 ++--
>  drivers/net/rnp/rnp_rxtx.c| 118 +++---
>  drivers/net/rnp/rnp_rxtx.h|   1 +
>  5 files changed, 117 insertions(+), 34 deletions(-)
> 

Overall this patchset looks fine.
Could you try out the suggest bit changes, and resubmit please.

Re: [PATCH 6/6] net/hns3: VF support multi-TCs configure

2025-06-29 Thread Stephen Hemminger

On Wed, 11 Jun 2025 16:19:00 +0800
Dengdui Huang  wrote:

> +#pragma pack(1)
> +#define HNS3_MBX_PRIO_SHIFT  4
> +#define HNS3_MBX_PRIO_MASK   0xFu
> +struct hns3_mbx_tc_config {
> + /*
> +  * Each four bits correspond to one priority's TC.
> +  * Bit0-3 correspond to priority-0's TC, bit4-7 correspond to
> +  * priority-1's TC, and so on.
> +  */
> + uint32_t prio_tc_map;
> + uint8_t tc_dwrr[HNS3_MAX_TC_NUM];
> + uint8_t num_tc;
> + /*
> +  * Each bit correspond to one TC's scheduling mode, 0 means SP
> +  * scheduling mode, 1 means DWRR scheduling mode.
> +  * Bit0 corresponds to TC0, bit1 corresponds to TC1, and so on.
> +  */
> + uint8_t tc_sch_mode;
>  };
> +#pragma pack()
>  

DPDK has portable macros for packing __rte_packed_begin and __rte_packed_end.
Please change to using those macros.
Then rebase, retest and resubmit this patcheset

Re: [PATCH] build: error out when missing elftools python module

2025-06-29 Thread Thomas Monjalon

27/06/2025 17:27, Bruce Richardson:
> In the case where we use the meson python "find_installation()" function
> to get our python binary, we can fail the configure/setup step if the
> elftools module is missing. This avoids later errors on build when the
> module is missed.
> 
> Old output (error logged and config continues):
> 
>   Program python3 (elftools) found: NO
> 
> New output:
>   Program python3 found: YES (/usr/bin/python3)
>   Program python3 (elftools) found: NO
> 
>   ../buildtools/meson.build:15:31: ERROR: python3 is missing modules: elftools
> 
> Signed-off-by: Bruce Richardson 

Applied, thanks.

Re: [PATCH] buildtools/get-test-suites.py: muti-line macros

2025-06-29 Thread Thomas Monjalon

18/06/2025 14:39, Marat Khalili:
> Test list is currently generated by scanning all files for macros
> starting with `REGISTER_` and ending with `_TEST`. Unfortunately, this
> was done line-by-line, and macros split into several lines were silently
> ignored resulting in tests being excluded from test suites without any
> warning.
> 
> Make regular expression multiline, capturing everything until the
> closing parenthesis. (There should be no nested parentheses due to the
> nature of the arguments these macros accept.)
> 
> The rest of the functionality stays the same. The result was manually
> compared to be identical to the previous version.
> 
> Signed-off-by: Marat Khalili 

Applied, thanks.

Re: [PATCH v5 0/5] Use consecutive Tx queues' memory

2025-06-29 Thread Thomas Monjalon

> Bing Zhao (5):
>   net/mlx5: add new devarg for Tx queue consecutive memory
>   net/mlx5: calculate the memory length for all Tx queues
>   net/mlx5: allocate and release unique resources for Tx queues
>   net/mlx5: pass the information in Tx queue start
>   net/mlx5: use consecutive memory for Tx queue creation

Applied, thanks.

Re: [PATCH] more replace memcpy with structure assignment

2025-06-29 Thread Thomas Monjalon

12/06/2025 05:08, Stephen Hemminger:
> Prefer using simple structure assignment instead of memcpy.
> Using a structure assignment preserves type information and
> compiler checks types already.
> 
> Signed-off-by: Stephen Hemminger 

Applied, thanks.

Re: [PATCH v3 0/3] handle sysconf(_SC_PAGESIZE) negative return value

2025-06-29 Thread Thomas Monjalon

24/06/2025 10:03, Morten Brørup:
> Coverity reports some defects, where the root cause seems to be negative
> return value from sysconf(_SC_PAGESIZE) not being handled.
> This series addresses those defects in the DPDK libraries.
> 
> PS: "_SC_PAGESIZE" has the alias "_SC_PAGE_SIZE". Both are covered here.
> 
> Morten Brørup (3):
>   eal/unix: fix log message for madvise() failure
>   eal: handle sysconf(_SC_PAGESIZE) negative return value
>   pmu: handle sysconf(_SC_PAGESIZE) negative return value

Applied, thanks.

Re: [PATCH] doc: fix missing feature matrix for event device

2025-06-29 Thread Thomas Monjalon

16/06/2025 17:05, Jerin Jacob:
> On Mon, Jun 16, 2025 at 2:02 PM  wrote:
> >
> > From: Pavan Nikhilesh 
> >
> > Fix missing feature matrix addition for event device DMA and
> > vector adapters.
> >
> > Fixes: 66a30a29387a ("eventdev/dma: introduce DMA adapter")
> > Fixes: e12c3754da7a ("eventdev/vector: introduce event vector adapter")
> >
> > Signed-off-by: Pavan Nikhilesh 
> 
> Acked-by: Jerin Jacob 
> Tested-by: Jerin Jacob 

Applied, thanks.

release candidate 25.07-rc2

2025-06-29 Thread Thomas Monjalon

A new DPDK release candidate is ready for testing:
https://git.dpdk.org/dpdk/tag/?id=v25.07-rc2

There are 141 new patches in this snapshot.

Release notes:
https://doc.dpdk.org/guides/rel_notes/release_25_07.html

Most significant changes are in multiple drivers.

Please test and report issues on https://bugs.dpdk.org

DPDK 25.07-rc3 is expected in one week,
with a focus on fixes, tests, examples and doc.

Thank you everyone

[PATCH v2] event/dlb2: add dequeue interrupt mode support

2025-06-29 Thread Pravin Pathak

DLB2 port interrupt is implemented using DPDK interrupt
framework. This allows eventdev dequeue API to sleep when
the port queue is empty and gets wakeup when event arrives
at the port. Port dequeue mode is configured using devargs
argument port_dequeue_wait. Supported modes are polling and
interrupt. Default mode is polling.
This commit also adds code to handle device error interrupts
and print alarm details.

Signed-off-by: Pravin Pathak 
Signed-off-by: Tirthendu Sarkar 
---
 doc/guides/eventdevs/dlb2.rst  |  20 +
 drivers/event/dlb2/dlb2.c  | 236 +-
 drivers/event/dlb2/dlb2_iface.c|   7 +
 drivers/event/dlb2/dlb2_iface.h|   8 +
 drivers/event/dlb2/dlb2_priv.h |  18 +
 drivers/event/dlb2/dlb2_user.h | 112 +++
 drivers/event/dlb2/pf/base/dlb2_hw_types.h |  70 ++
 drivers/event/dlb2/pf/base/dlb2_osdep.h|  46 ++
 drivers/event/dlb2/pf/base/dlb2_regs.h | 149 +++-
 drivers/event/dlb2/pf/base/dlb2_resource.c | 825 +
 drivers/event/dlb2/pf/base/dlb2_resource.h |   6 +
 drivers/event/dlb2/pf/dlb2_pf.c| 223 ++
 12 files changed, 1711 insertions(+), 9 deletions(-)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
index 8ec7168f20..a4ba857351 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -477,6 +477,26 @@ Example command to use as meson option for credit handling:
 
meson configure -Dc_args='-DDLB_SW_CREDITS_CHECKS=0 
-DDLB_HW_CREDITS_CHECKS=1'
 
+Interrupt Mode Support
+~~
+DLB dequeue supports interrupt mode for the API rte_event_dequeue_burst().
+The default port dequeue mode is polling. Dequeue wait mode can be configured
+on per eventdev port basis using devargs argument 'port_dequeue_wait'. In
+interrupt mode, if the port queue is empty, the application thread will block
+on the interrupt until a new event arrives. It enters blocking mode only after
+any specified timeout. During the timeout, it will poll the port queue for
+events as usual. Interrupt mode uses the DPDK interrupt support framework.
+
+.. code-block:: console
+
+   --allow ea:00.0,port_dequeue_wait=all:interrupt
+
+port = all//-
+mode = interrupt/polling
+
+Eventdev port interrupt and polling wait modes for dequeue can be set for all
+the ports, a single port, or a range of ports using this parameter.
+
 Running Eventdev Applications with DLB Device
 -
 
diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c
index 084875f1c8..c3e40bd707 100644
--- a/drivers/event/dlb2/dlb2.c
+++ b/drivers/event/dlb2/dlb2.c
@@ -185,6 +185,22 @@ dlb2_init_queue_depth_thresholds(struct dlb2_eventdev 
*dlb2,
}
 }
 
+/* override defaults with value(s) provided on command line */
+static int
+dlb2_init_port_dequeue_wait(struct dlb2_eventdev *dlb2,
+   enum dlb2_port_dequeue_wait_types
+   *port_dequeue_wait_modes)
+{
+   int p;
+
+   for (p = 0; p < DLB2_MAX_NUM_PORTS(dlb2->version); p++) {
+   if (port_dequeue_wait_modes[p] != 0)
+   dlb2->ev_ports[p].qm_port.dequeue_wait =
+   port_dequeue_wait_modes[p];
+   }
+   return 0;
+}
+
 /* override defaults with value(s) provided on command line */
 static void
 dlb2_init_port_cos(struct dlb2_eventdev *dlb2, int *port_cos)
@@ -867,6 +883,111 @@ set_qid_depth_thresh_v2_5(const char *key __rte_unused,
return 0;
 }
 
+static int
+set_port_dequeue_wait_ver(const char *key __rte_unused,
+ const char *value,
+ void *opaque,
+ int version)
+{
+   struct dlb2_port_dequeue_wait *dequeue_wait = opaque;
+   int first, last;
+   enum dlb2_port_dequeue_wait_types wait;
+   const char *valp = value;
+   bool port_list[DLB2_MAX_NUM_PORTS_ALL] = {false};
+   int lmax = DLB2_MAX_NUM_PORTS(version);
+   int len;
+   int lc;
+
+   if (value == NULL || opaque == NULL) {
+   DLB2_LOG_ERR("NULL pointer");
+   return -EINVAL;
+   }
+
+   /* command line override may take a combination of the following forms:
+* port_dequeue_wait=all: ... all ports
+* port_dequeue_wait=portA-portB: ... a range of ports
+* port_dequeue_wait=portA: ... just one port
+*/
+
+   do {
+   do {
+   if (strncmp(valp, "all", 3) == 0) {
+   for (lc = 0; lc < lmax; lc++)
+   port_list[lc] = true;
+   valp += 3;
+   } else if (sscanf(valp, "%d-%d%n",
+ &first,
+ &last,
+ &len) == 2) {
+

[PATCH v1] event/dlb2: update DLB documentation for history list config

2025-06-29 Thread Pravin Pathak

Update DPDK documentation for configuring DLB hardware history
list resource using devargs arguments.

Fixes: 33ab065d0c40 ("event/dlb2: support managing history list resource")

Signed-off-by: Pravin Pathak 
---
 doc/guides/eventdevs/dlb2.rst | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/doc/guides/eventdevs/dlb2.rst b/doc/guides/eventdevs/dlb2.rst
index 8ec7168f20..2f836db010 100644
--- a/doc/guides/eventdevs/dlb2.rst
+++ b/doc/guides/eventdevs/dlb2.rst
@@ -7,6 +7,12 @@ Driver for the Intel® Dynamic Load Balancer (DLB)
 The DPDK DLB poll mode driver supports the Intel® Dynamic Load Balancer,
 hardware versions 2.0 and 2.5.
 
+Please follow the links below to download the Programmer Guides.
+
+`Intel Dynamic Load Balancer 2.0 Programmer Guide 
`_. (Device: 0x2710)
+
+`Intel Dynamic Load Balancer 2.5 Programmer Guide 
`_. (Device: 0x2714)
+
 Prerequisites
 -
 
@@ -477,6 +483,23 @@ Example command to use as meson option for credit handling:
 
meson configure -Dc_args='-DDLB_SW_CREDITS_CHECKS=0 
-DDLB_HW_CREDITS_CHECKS=1'
 
+DLB History List Configuration
+~~
+Every DLB Load Balancing port (i.e., eventdev port not using 
RTE_EVENT_PORT_CFG_SINGLE_LINK flag)
+has a hardware resource call history list entries (HL) associated with it. 
This count decides the number
+of events that can be inflight to the port from the DLB hardware. DLB has 2048 
total HL entries.
+As DLB supports 64 load-balanced ports, by default DLB PMD assigns 32 HL 
entries to each port.
+Following devargs arguments allow application to control HL entries overriding 
default mode.
+DLB API rte_pmd_dlb2_set_port_param() allows setting HL entries for the DLB 
eventdev ports.
+Please refer to section "Fine Tuning History List Entries" in DLB Programmer 
Guide for details.
+
+.. code-block:: console
+
+   --allow ea:00.0,use_default_hl=0,alloc_hl_entries=1024
+
+use_default_hl = 1=Enable (default), 0=Disable
+alloc_hl_entries = 0-2048 Total HL entries
+
 Running Eventdev Applications with DLB Device
 -
 
-- 
2.39.1

RE: [PATCH v0 0/3] [v0]drivers/net fixed Coverity issue

2025-06-29 Thread 11

Hi  Stephen,

Thanks for your guidance, I will submit it next.

Regards Wenbo

> -Original Message-
> From: Stephen Hemminger 
> Sent: 2025年6月30日 1:44
> To: Wenbo Cao 
> Cc: dev@dpdk.org; yao...@mucse.com
> Subject: Re: [PATCH v0 0/3] [v0]drivers/net fixed Coverity issue
> 
> On Wed, 18 Jun 2025 20:11:10 +0800
> Wenbo Cao  wrote:
> 
> > v1:
> >   *:fixed compile issue
> > v0:
> >   *:fixed the below issue:
> > Coverity issue: 468860,468866,468858
> > Fixes: 4530e70f1e32 ("net/rnp: support Tx TSO offload")
> > Fixes: 52dfb84e14be ("net/rnp: add device init and uninit")
> > Fixes: 52aae4ed4ffb ("net/rnp: add device capabilities")
> >   *:fixed 64k tso
> >
> > Wenbo Cao (3):
> >   net/rnp: add check firmware respond info
> >   net/rnp: fix Tunnel-TSO VLAN header untrusted loop bound
> >   net/rnp: fix TSO segmentation for packets of 64KB
> >
> >  drivers/net/rnp/base/rnp_fw_cmd.h |   1 +
> >  drivers/net/rnp/base/rnp_mbx_fw.c |  15 +++-
> >  drivers/net/rnp/rnp_ethdev.c  |  16 ++--
> >  drivers/net/rnp/rnp_rxtx.c| 118 +++---
> >  drivers/net/rnp/rnp_rxtx.h|   1 +
> >  5 files changed, 117 insertions(+), 34 deletions(-)
> >
> 
> Overall this patchset looks fine.
> Could you try out the suggest bit changes, and resubmit please.

[PATCH v2 3/3] net/rnp: fix TSO segmentation for packets of 64KB

2025-06-29 Thread Wenbo Cao

Packets exceeding 64KB TSO size must be fragmented
across multiple descriptors,Otherwise,it may cause
TSO fragmentation anomalies.

Fixes: 4530e70f1e32 ("net/rnp: support Tx TSO offload")
Cc: sta...@dpdk.org

Signed-off-by: Wenbo Cao 
Reviewed-by: Stephen Hemminger 
---
 drivers/net/rnp/rnp_rxtx.c | 48 ++
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/net/rnp/rnp_rxtx.c b/drivers/net/rnp/rnp_rxtx.c
index ee31f17cad..81e8c6ba44 100644
--- a/drivers/net/rnp/rnp_rxtx.c
+++ b/drivers/net/rnp/rnp_rxtx.c
@@ -1157,6 +1157,21 @@ rnp_need_ctrl_desc(uint64_t flags)
return (flags & mask) ? 1 : 0;
 }
 
+#define RNP_MAX_TSO_SEG_LEN(4096)
+static inline uint16_t
+rnp_calc_pkt_desc(struct rte_mbuf *tx_pkt)
+{
+   struct rte_mbuf *txd = tx_pkt;
+   uint16_t count = 0;
+
+   while (txd != NULL) {
+   count += DIV_ROUND_UP(txd->data_len, RNP_MAX_TSO_SEG_LEN);
+   txd = txd->next;
+   }
+
+   return count;
+}
+
 static void
 rnp_build_tx_control_desc(struct rnp_tx_queue *txq,
  volatile struct rnp_tx_desc *txbd,
@@ -1394,6 +1409,10 @@ rnp_multiseg_xmit_pkts(void *_txq, struct rte_mbuf 
**tx_pkts, uint16_t nb_pkts)
tx_pkt = tx_pkts[nb_tx];
ctx_desc_use = rnp_need_ctrl_desc(tx_pkt->ol_flags);
nb_used_bd = tx_pkt->nb_segs + ctx_desc_use;
+   if (tx_pkt->ol_flags & RTE_MBUF_F_TX_TCP_SEG)
+   nb_used_bd = (uint16_t)(rnp_calc_pkt_desc(tx_pkt) + 
ctx_desc_use);
+   else
+   nb_used_bd = tx_pkt->nb_segs + ctx_desc_use;
tx_last = (uint16_t)(tx_id + nb_used_bd - 1);
if (tx_last >= txq->attr.nb_desc)
tx_last = (uint16_t)(tx_last - txq->attr.nb_desc);
@@ -1416,8 +1435,11 @@ rnp_multiseg_xmit_pkts(void *_txq, struct rte_mbuf 
**tx_pkts, uint16_t nb_pkts)
m_seg = tx_pkt;
first_seg = 1;
do {
+   uint16_t remain_len = 0;
+   uint64_t dma_addr = 0;
+
txbd = &txq->tx_bdr[tx_id];
-   txbd->d.cmd = 0;
+   *txbd = txq->zero_desc;
txn = &txq->sw_ring[txe->next_id];
if ((first_seg && m_seg->ol_flags)) {
rnp_setup_tx_offload(txq, txbd,
@@ -1430,11 +1452,29 @@ rnp_multiseg_xmit_pkts(void *_txq, struct rte_mbuf 
**tx_pkts, uint16_t nb_pkts)
rte_pktmbuf_free_seg(txe->mbuf);
txe->mbuf = NULL;
}
+   dma_addr = rnp_get_dma_addr(&txq->attr, m_seg);
+   remain_len = m_seg->data_len;
txe->mbuf = m_seg;
+   while ((tx_pkt->ol_flags & RTE_MBUF_F_TX_TCP_SEG) &&
+   unlikely(remain_len > 
RNP_MAX_TSO_SEG_LEN)) {
+   txbd->d.addr = dma_addr;
+   txbd->d.blen = 
rte_cpu_to_le_32(RNP_MAX_TSO_SEG_LEN);
+   dma_addr += RNP_MAX_TSO_SEG_LEN;
+   remain_len -= RNP_MAX_TSO_SEG_LEN;
+   txe->last_id = tx_last;
+   tx_id = txe->next_id;
+   txe = txn;
+   if (txe->mbuf) {
+   rte_pktmbuf_free_seg(txe->mbuf);
+   txe->mbuf = NULL;
+   }
+   txbd = &txq->tx_bdr[tx_id];
+   *txbd = txq->zero_desc;
+   txn = &txq->sw_ring[txe->next_id];
+   }
txe->last_id = tx_last;
-   txbd->d.addr = rnp_get_dma_addr(&txq->attr, m_seg);
-   txbd->d.blen = rte_cpu_to_le_32(m_seg->data_len);
-   txbd->d.cmd &= ~RNP_CMD_EOP;
+   txbd->d.addr = dma_addr;
+   txbd->d.blen = rte_cpu_to_le_32(remain_len);
m_seg = m_seg->next;
tx_id = txe->next_id;
txe = txn;
-- 
2.34.1

[PATCH v2 2/3] net/rnp: fix Tunnel-TSO VLAN header untrusted loop bound

2025-06-29 Thread Wenbo Cao

Adds support for boundary checking in the VLAN header
and corrects protocol header type verification.

Fixes: 4530e70f1e32 ("net/rnp: support Tx TSO offload")
Cc: sta...@dpdk.org

Signed-off-by: Wenbo Cao 
Reviewed-by: Stephen Hemminger 
---
 drivers/net/rnp/rnp_rxtx.c | 70 ++
 drivers/net/rnp/rnp_rxtx.h |  1 +
 2 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/drivers/net/rnp/rnp_rxtx.c b/drivers/net/rnp/rnp_rxtx.c
index da08728198..ee31f17cad 100644
--- a/drivers/net/rnp/rnp_rxtx.c
+++ b/drivers/net/rnp/rnp_rxtx.c
@@ -1205,6 +1205,7 @@ rnp_build_tx_control_desc(struct rnp_tx_queue *txq,
}
txbd->c.qword0.tunnel_len = tunnel_len;
txbd->c.qword1.cmd |= RNP_CTRL_DESC;
+   txq->tunnel_len = tunnel_len;
 }
 
 static void
@@ -1243,40 +1244,66 @@ rnp_padding_hdr_len(volatile struct rnp_tx_desc *txbd,
txbd->d.mac_ip_len |= l3_len;
 }
 
-static void
-rnp_check_inner_eth_hdr(struct rte_mbuf *mbuf,
+#define RNP_MAX_VLAN_HDR_NUM   (4)
+static int
+rnp_check_inner_eth_hdr(struct rnp_tx_queue *txq,
+   struct rte_mbuf *mbuf,
volatile struct rnp_tx_desc *txbd)
 {
struct rte_ether_hdr *eth_hdr;
uint16_t inner_l2_offset = 0;
struct rte_vlan_hdr *vlan_hdr;
uint16_t ext_l2_len = 0;
-   uint16_t l2_offset = 0;
+   char *vlan_start = NULL;
uint16_t l2_type;
 
-   inner_l2_offset = mbuf->outer_l2_len + mbuf->outer_l3_len +
-   sizeof(struct rte_udp_hdr) +
-   sizeof(struct rte_vxlan_hdr);
+   inner_l2_offset = txq->tunnel_len;
+   if (inner_l2_offset + sizeof(struct rte_ether_hdr) > mbuf->data_len) {
+   RNP_PMD_LOG(ERR, "Invalid inner L2 offset");
+   return -EINVAL;
+   }
eth_hdr = rte_pktmbuf_mtod_offset(mbuf,
struct rte_ether_hdr *, inner_l2_offset);
l2_type = eth_hdr->ether_type;
-   l2_offset = txbd->d.mac_ip_len >> RNP_TX_MAC_LEN_S;
-   while (l2_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_VLAN) ||
-   l2_type == rte_cpu_to_be_16(RTE_ETHER_TYPE_QINQ)) {
-   vlan_hdr = (struct rte_vlan_hdr *)
-   ((char *)eth_hdr + l2_offset);
-   l2_offset += RTE_VLAN_HLEN;
-   ext_l2_len += RTE_VLAN_HLEN;
+   vlan_start = (char *)(eth_hdr + 1);
+   while ((l2_type == RTE_BE16(RTE_ETHER_TYPE_VLAN) ||
+   l2_type == RTE_BE16(RTE_ETHER_TYPE_QINQ)) &&
+   (ext_l2_len < RNP_MAX_VLAN_HDR_NUM * RTE_VLAN_HLEN)) {
+   if (vlan_start + ext_l2_len >
+   rte_pktmbuf_mtod(mbuf, char*) + mbuf->data_len) 
{
+   RNP_PMD_LOG(ERR, "VLAN header exceeds buffer");
+   break;
+   }
+   vlan_hdr = (struct rte_vlan_hdr *)(vlan_start + ext_l2_len);
l2_type = vlan_hdr->eth_proto;
+   ext_l2_len += RTE_VLAN_HLEN;
}
-   txbd->d.mac_ip_len += (ext_l2_len << RNP_TX_MAC_LEN_S);
+   if (unlikely(mbuf->l3_len == 0)) {
+   switch (rte_be_to_cpu_16(l2_type)) {
+   case RTE_ETHER_TYPE_IPV4:
+   txbd->d.mac_ip_len = sizeof(struct rte_ipv4_hdr);
+   break;
+   case RTE_ETHER_TYPE_IPV6:
+   txbd->d.mac_ip_len = sizeof(struct rte_ipv6_hdr);
+   break;
+   default:
+   break;
+   }
+   } else {
+   txbd->d.mac_ip_len = mbuf->l3_len;
+   }
+   ext_l2_len += sizeof(*eth_hdr);
+   txbd->d.mac_ip_len |= (ext_l2_len << RNP_TX_MAC_LEN_S);
+
+   return 0;
 }
 
 #define RNP_TX_L4_OFFLOAD_ALL   (RTE_MBUF_F_TX_SCTP_CKSUM | \
 RTE_MBUF_F_TX_TCP_CKSUM | \
 RTE_MBUF_F_TX_UDP_CKSUM)
 static inline void
-rnp_setup_csum_offload(struct rte_mbuf *mbuf,
+rnp_setup_csum_offload(struct rnp_tx_queue *txq,
+  struct rte_mbuf *mbuf,
   volatile struct rnp_tx_desc *tx_desc)
 {
tx_desc->d.cmd |= (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) ?
@@ -1296,8 +1323,6 @@ rnp_setup_csum_offload(struct rte_mbuf *mbuf,
tx_desc->d.cmd |= RNP_TX_L4TYPE_SCTP;
break;
}
-   tx_desc->d.mac_ip_len = mbuf->l2_len << RNP_TX_MAC_LEN_S;
-   tx_desc->d.mac_ip_len |= mbuf->l3_len;
if (mbuf->ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
tx_desc->d.cmd |= RNP_TX_IP_CKSUM_EN;
tx_desc->d.cmd |= RNP_TX_L4CKSUM_EN;
@@ -1306,9 +1331,8 @@ rnp_setup_csum_offload(struct rte_mbuf *mbuf,
}
if (mbuf->ol_flags & RTE_MBUF_F_TX_TUNNEL_MASK) {
/* need inner l2 l3 lens for inner checksum offload */
-   tx_desc->d.mac_ip_len &= ~RNP_TX_MAC_LEN_MASK;
-   tx_desc->d.mac_ip_len |= RTE_ET

[PATCH v2 0/3] [v2]drivers/net/rnp fixed Coverity issue

2025-06-29 Thread Wenbo Cao

This patchset primarily optimizes the robustness of code logic and
resolves anomalies in TSO segmentation.

v2:
  * Optimized logical portability per Stephen Hemminger's suggestions
v1:
  * fixed complie issue
v0:
  *:fixed the below issue:
Coverity issue: 468860,468866,468858
Fixes: 4530e70f1e32 ("net/rnp: support Tx TSO offload")
Fixes: 52dfb84e14be ("net/rnp: add device init and uninit")
Fixes: 52aae4ed4ffb ("net/rnp: add device capabilities")
  *:fixed 64k tso

Wenbo Cao (3):
  net/rnp: add check firmware respond info
  net/rnp: fix Tunnel-TSO VLAN header untrusted loop bound
  net/rnp: fix TSO segmentation for packets of 64KB

 drivers/net/rnp/base/rnp_mbx_fw.c |  15 +++-
 drivers/net/rnp/rnp_ethdev.c  |  16 ++--
 drivers/net/rnp/rnp_rxtx.c| 118 +++---
 drivers/net/rnp/rnp_rxtx.h|   1 +
 4 files changed, 116 insertions(+), 34 deletions(-)

-- 
2.34.1

[PATCH v2 1/3] net/rnp: add check firmware respond info

2025-06-29 Thread Wenbo Cao

Add logic checks at critical points to detect potentially illegal
firmware information, preventing subsequent logic exceptions.

Fixes: 52aae4ed4ffb ("net/rnp: add device capabilities")
Fixes: 52dfb84e14be ("net/rnp: add device init and uninit")
Cc: sta...@dpdk.org

Signed-off-by: Wenbo Cao 
Reviewed-by: Stephen Hemminger 
---
 drivers/net/rnp/base/rnp_mbx_fw.c | 15 ++-
 drivers/net/rnp/rnp_ethdev.c  | 16 
 2 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/drivers/net/rnp/base/rnp_mbx_fw.c 
b/drivers/net/rnp/base/rnp_mbx_fw.c
index 3e7cf7f9ad..9e0b1730c2 100644
--- a/drivers/net/rnp/base/rnp_mbx_fw.c
+++ b/drivers/net/rnp/base/rnp_mbx_fw.c
@@ -230,6 +230,7 @@ rnp_fw_get_phy_capability(struct rnp_eth_port *port,
return 0;
 }
 
+#define RNP_MAX_LANE_MASK  (0xf)
 int rnp_mbx_fw_get_capability(struct rnp_eth_port *port)
 {
struct rnp_phy_abilities_rep ability;
@@ -252,17 +253,29 @@ int rnp_mbx_fw_get_capability(struct rnp_eth_port *port)
hw->nic_mode = ability.nic_mode;
/* get phy<->lane mapping info */
lane_cnt = rte_popcount32(hw->lane_mask);
+   if (lane_cnt > RNP_MAX_PORT_OF_PF) {
+   RNP_PMD_LOG(ERR, "firmware invalid lane_mask");
+   return -EINVAL;
+   }
temp_mask = hw->lane_mask;
+   if (temp_mask == 0 || temp_mask > RNP_MAX_LANE_MASK) {
+   RNP_PMD_LOG(ERR, "lane_mask is invalid 0x%.2x", 
temp_mask);
+   return -EINVAL;
+   }
if (ability.e.ports_is_sgmii_valid)
is_sgmii_bits = ability.e.lane_is_sgmii;
for (idx = 0; idx < lane_cnt; idx++) {
hw->phy_port_ids[idx] = port_ids[idx];
+   if (temp_mask == 0) {
+   RNP_PMD_LOG(ERR, "temp_mask is zero at idx=%d", 
idx);
+   return -EINVAL;
+   }
lane_bit = ffs(temp_mask) - 1;
lane_idx = port_ids[idx] % lane_cnt;
hw->lane_of_port[lane_idx] = lane_bit;
is_sgmii = lane_bit & is_sgmii_bits ? 1 : 0;
hw->lane_is_sgmii[lane_idx] = is_sgmii;
-   temp_mask &= ~RTE_BIT32(lane_bit);
+   temp_mask &= ~(1ULL << lane_bit);
}
hw->max_port_num = lane_cnt;
}
diff --git a/drivers/net/rnp/rnp_ethdev.c b/drivers/net/rnp/rnp_ethdev.c
index de1c077f61..24eb0b16dd 100644
--- a/drivers/net/rnp/rnp_ethdev.c
+++ b/drivers/net/rnp/rnp_ethdev.c
@@ -751,17 +751,17 @@ rnp_get_speed_caps(struct rte_eth_dev *dev)
 {
struct rnp_eth_port *port = RNP_DEV_TO_PORT(dev);
uint32_t speed_cap = 0;
-   uint32_t i = 0, speed;
uint32_t support_link;
-   uint32_t link_types;
+   uint32_t speed = 0;
+   int bit_pos = 0;
 
support_link = port->attr.phy_meta.supported_link;
-   link_types = rte_popcount64(support_link);
-   if (!link_types)
+   if (support_link == 0)
return 0;
-   for (i = 0; i < link_types; i++) {
-   speed = ffs(support_link) - 1;
-   switch (RTE_BIT32(speed)) {
+   while (support_link) {
+   bit_pos = rte_ffs32(support_link) - 1;
+   speed = RTE_BIT32(bit_pos);
+   switch (speed) {
case RNP_SPEED_CAP_10M_FULL:
speed_cap |= RTE_ETH_LINK_SPEED_10M;
break;
@@ -789,7 +789,7 @@ rnp_get_speed_caps(struct rte_eth_dev *dev)
default:
speed_cap |= 0;
}
-   support_link &= ~RTE_BIT32(speed);
+   support_link &= ~speed;
}
if (!port->attr.phy_meta.link_autoneg)
speed_cap |= RTE_ETH_LINK_SPEED_FIXED;
-- 
2.34.1

Re: [PATCH v3 2/3] eal: handle sysconf(_SC_PAGESIZE) negative return value

2025-06-29 Thread Thomas Monjalon

29/06/2025 00:49, Stephen Hemminger:
> On Sat, 28 Jun 2025 18:45:44 +0200
> Morten Brørup  wrote:
> 
> > > From: Thomas Monjalon [mailto:tho...@monjalon.net]
> > > Sent: Friday, 27 June 2025 20.30
> > > 
> > > 27/06/2025 19:49, Morten Brørup:  
> > > > > From: Thomas Monjalon [mailto:tho...@monjalon.net]
> > > > > Sent: Friday, 27 June 2025 19.35
> > > > >
> > > > > 27/06/2025 18:38, Morten Brørup:  
> > > > > > > From: Thomas Monjalon [mailto:tho...@monjalon.net]
> > > > > > > Sent: Friday, 27 June 2025 17.58
> > > > > > >
> > > > > > > 24/06/2025 10:03, Morten Brørup:  
> > > > > > > > +   if ((ssize_t)page_size < 0)
> > > > > > > > +   rte_panic("sysconf(_SC_PAGESIZE) 
> > > > > > > > failed: %s",
> > > > > > > > +   errno == 0 ? 
> > > > > > > > "Indeterminate" :  
> > > > > > > strerror(errno));
> > > > > > >
> > > > > > > We don't want more rte_panic().
> > > > > > > You could log the problem and return 0 here.
> > > > > > > It will be a problem later, but it may allow the application to  
> > > > > cleanup  
> > > > > > > instead of abrupting crashing.  
> > > > > >
> > > > > > Disagree.
> > > > > > That would be likely to cause crash with division by zero later.
> > > > > > Better to fail early.  
> > > > >
> > > > > Which division by zero?  
> > > >
> > > > Functions dividing by page size. E.g.:
> > > >  
> > > https://elixir.bootlin.com/dpdk/v25.03/source/lib/eal/common/eal_common_
> > > memory.c#L313  
> > > >  
> > > > >
> > > > > I don't think a library should take this decision on behalf of the  
> > > app.  
> > > >
> > > > I expect lots of things to break if sysconf(_SC_PAGESIZE) fails, so  
> > > the purpose of this patch is to centralize error handling here, and only
> > > continue/return with non-failing values.  
> > > >
> > > > Otherwise, everywhere using rte_mem_page_size() or  
> > > sysconf(_SC_PAGESIZE) should implement error handling (or ignore
> > > errors).  
> > > > That's a lot of places, so I'm not going to provide a patch doing  
> > > that.
> > > 
> > > I understand.
> > > 
> > > The problem is that we don't have an exception mechanism in this
> > > language.  
> > 
> > Yep.
> > And everyone assumes sysconf(_SC_PAGESIZE) never fails, which is probably 
> > correct, so nobody implemented error handling for it. Not even in 
> > rte_mem_page_size().
> > Coverity detected the missing error handling, and warns: "Although 
> > rte_mem_page_size() is declared to return unsigned int, it may actually 
> > return a negative value." This defect applies to all functions calling 
> > rte_mem_page_size().
> > This patch adds error handling to ensure that rte_mem_page_size() only 
> > returns non-negative values, or doesn’t return at all - i.e. fails with 
> > rte_panic() - so Coverity is satisfied with callers not implementing error 
> > handling for it.
> > 
> > It would be borderline waste of time fixing all the callers, so I fixed the 
> > root cause to satisfy Coverity.
> > 
> > From an higher level perspective:
> > This is a low level EAL function to determine the page size. I would 
> > consider it reasonable for such a low level EAL function to never fail.
> > If some O/S decides to not have a "system page size", and fail with 
> > "Indeterminate", e.g. to support multiple page sizes, we would need to 
> > handle that somehow. But let's ignore that until it actually happens, if 
> > ever.
> > 
> > If you are skeptical about this patch 2/3 in the series, we can escalate 
> > the discussion to the tech board. If you really hate this patch 2/3, I will 
> > honor a NAK from you. The patch is not important for me; I'm just trying to 
> > clean up.
> > 
> 
> In such cases, I look at glibc source and see if handles it or not.
> Looks like only used a couple of places there, the result of 
> sysconf(_SC_PAGE_SIZE) is checked
> in one of the tests; but is not checked in the loading of locale's.  It 
> expects a valid power of 2
> value there.
> 
> Ok to just die if value isn't valid.

Yes I'm convinced too.

Re: [PATCH] ethdev: sync ethtool link modes with Linux 6.15

2025-06-29 Thread Thomas Monjalon

26/06/2025 17:13, Thomas Monjalon:
> 26/06/2025 16:26, Stephen Hemminger:
> > On Wed, 25 Jun 2025 15:42:02 +0200
> > Thomas Monjalon  wrote:
> > 
> > > diff --git a/lib/ethdev/ethdev_linux_ethtool.c 
> > > b/lib/ethdev/ethdev_linux_ethtool.c
> > > index ec42d3054a..f508cdba6c 100644
> > > --- a/lib/ethdev/ethdev_linux_ethtool.c
> > > +++ b/lib/ethdev/ethdev_linux_ethtool.c
> > > @@ -17,8 +17,9 @@
> > >   *
> > >   * The array below is built from bit definitions with this shell command:
> > >   *   sed -rn 's;.*(ETHTOOL_LINK_MODE_)([0-9]+)([0-9a-zA-Z_]*).*= 
> > > *([0-9]*).*;'\
> > > - *   '[\4] = \2, /\* \1\2\3 *\/;p' /usr/include/linux/ethtool.h |
> > > - *   awk '/_Half_/{$3=$3+1","}1'
> > > + *   '[\4] \2 \1\2\3;p' /usr/include/linux/ethtool.h |
> > > + *   awk '/_Half_/{$2=$2+1}1' |
> > > + *   awk '{printf "\t%5s = %7s, /\* %s *\/\n", $1, $2, $3}'
> > >   */
> > 
> > The commands in the comment never worked verbatim.
> 
> It works on my machine.
> 
> > $  sed -rn 's;.*(ETHTOOL_LINK_MODE_)([0-9]+)([0-9a-zA-Z_]*).*= 
> > *([0-9]*).*;'\
> >'[\4] \2 \1\2\3;p' /usr/include/linux/ethtool.h |
> >awk '/_Half_/{$2=$2+1}1' |
> >awk '{printf "\t%5s = %7s, /\* %s *\/\n", $1, $2, $3}'
> > 
> > > > > sed: -e expression #1, char 63: unterminated `s' command
> > awk: cmd. line:1: warning: escape sequence `\*' treated as plain `*'
> > awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
> 
> The backslashes were added to help the syntax highlighting.
> But I can remove them.

Unfortunately we cannot remove the backslashes,
otherwise compilation of the comment is failing.

Re: [v4 04/10] bus/dpaa: optimize bman acquire/release

2025-06-29 Thread Stephen Hemminger

On Wed, 11 Jun 2025 12:40:33 +0530
vanshika.shu...@nxp.com wrote:

> -#define BMAN_BUF_MASK 0xul
> +RTE_EXPORT_INTERNAL_SYMBOL(bman_release_fast)
> +int
> +bman_release_fast(struct bman_pool *pool, const uint64_t *bufs,
> + uint8_t num)
> +{
> + struct bman_portal *p;
> + struct bm_rcr_entry *r;
> + uint8_t i, avail;
> + uint64_t bpid = pool->params.bpid;
> + struct bm_hw_buf_desc bm_bufs[FSL_BM_BURST_MAX];
> +
> +#ifdef RTE_LIBRTE_DPAA_HWDEBUG
> + if (!num || (num > FSL_BM_BURST_MAX))
> + return -EINVAL;
> + if (pool->params.flags & BMAN_POOL_FLAG_NO_RELEASE)
> + return -EINVAL;
> +#endif
> +
> + p = get_affine_portal();
> + avail = bm_rcr_get_avail(&p->p);
> + if (avail < 2)
> + update_rcr_ci(p, avail);
> + r = bm_rcr_start(&p->p);
> + if (unlikely(!r))
> + return -EBUSY;
> +
> + /*
> +  * we can copy all but the first entry, as this can trigger badness
> +  * with the valid-bit
> +  */
> + bm_bufs[0].bpid = bpid;
> + bm_bufs[0].hi_addr = cpu_to_be16(HI16_OF_U48(bufs[0]));
> + bm_bufs[0].lo_addr = cpu_to_be32(LO32_OF_U48(bufs[0]));
> + for (i = 1; i < num; i++) {
> + bm_bufs[i].hi_addr = cpu_to_be16(HI16_OF_U48(bufs[i]));
> + bm_bufs[i].lo_addr = cpu_to_be32(LO32_OF_U48(bufs[i]));
> + }
> +
> + rte_memcpy(r->bufs, bm_bufs, sizeof(struct bm_buffer) * num);

Use memcpy instead. There are more compiler and security checks around
memcpy().

> +
> + bm_rcr_pvb_commit(&p->p, BM_RCR_VERB_CMD_BPID_SINGLE |
> + (num & BM_RCR_VERB_BUFCOUNT_MASK));
> +
> + return 0;
> +}

Re: [v4 07/10] net/dpaa: add Tx rate limiting DPAA PMD API

2025-06-29 Thread Stephen Hemminger

On Wed, 11 Jun 2025 12:40:36 +0530
vanshika.shu...@nxp.com wrote:

> From: Vinod Pullabhatla 
> 
> Add support to set Tx rate on DPAA platform through PMD APIs
> 
> Signed-off-by: Vinod Pullabhatla 
> Signed-off-by: Vanshika Shukla 
> ---

You intended to add a PMD specific API for rate limiting.
But there is no RTE_EXPORT_SYMBOL so it was never used.

You would have found this if you added a test for it.

Not accepting this without a test in test-pmd for it.

And why is the existing ethdev queue rate_limit not a better API
here?

Re: [v4 02/10] bus/dpaa: add FMan node

2025-06-29 Thread Stephen Hemminger

On Wed, 11 Jun 2025 12:40:31 +0530
vanshika.shu...@nxp.com wrote:

> + fd = open(FMAN_DEVICE_PATH, O_RDWR);
> + if (unlikely(fd < 0)) {
> + DPAA_BUS_LOG(ERR, "Unable to open (%s)", FMAN_DEVICE_PATH);
> + return fd;
>   }

Would helpful to user if you added the errno reason.

DPAA_BUS_LOG(ERR, "Unable to open %s: %s", FMAN_DEVICE_PATH, 
strerror(errno));

Re: [v4 06/10] mempool/dpaa: adjust pool element for LS1043A errata

2025-06-29 Thread Stephen Hemminger

On Wed, 11 Jun 2025 12:40:35 +0530
vanshika.shu...@nxp.com wrote:

> From: Jun Yang 
> 
> Adjust every element of pool by populate callback.
> 1) Make sure start DMA address is aligned with 16B.
> 2) For buffer across 4KB boundary, make sure start DMA address is
>aligned with 256B.
> 
> Signed-off-by: Jun Yang 
> ---
>  drivers/mempool/dpaa/dpaa_mempool.c | 145 +++-
>  drivers/mempool/dpaa/dpaa_mempool.h |  11 ++-
>  2 files changed, 150 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/mempool/dpaa/dpaa_mempool.c 
> b/drivers/mempool/dpaa/dpaa_mempool.c
> index 6c850f5cb2..2af6ebcee2 100644
> --- a/drivers/mempool/dpaa/dpaa_mempool.c
> +++ b/drivers/mempool/dpaa/dpaa_mempool.c
> @@ -1,6 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   *
> - *   Copyright 2017,2019,2023 NXP
> + *   Copyright 2017,2019,2023-2025 NXP
>   *
>   */
>  
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +
>  #include 
>  
>  #include 

Please don't introduce unnecessary whitespace changes.

Re: [v4 08/10] net/dpaa: add devargs for enabling err packets on main queue

2025-06-29 Thread Stephen Hemminger

On Wed, 11 Jun 2025 12:40:37 +0530
vanshika.shu...@nxp.com wrote:

> diff --git a/doc/guides/nics/dpaa.rst b/doc/guides/nics/dpaa.rst
> index de3ae96e07..cc9aef7f83 100644
> --- a/doc/guides/nics/dpaa.rst
> +++ b/doc/guides/nics/dpaa.rst
> @@ -277,6 +277,9 @@ for details.
>  
>  * Use dev arg option ``drv_ieee1588=1`` to enable IEEE 1588 support
>at driver level, e.g. ``dpaa:fm1-mac3,drv_ieee1588=1``.
> +* Use dev arg option ``recv_err_pkts=1`` to receive all packets including
> +  error packets and thus disabling hardware based packet handing
> +  at driver level, e.g. ``dpaa:fm1-mac3,recv_err_pkts=1``.
>  

Need blank line between items in list for doxygen to work.

Re: [PATCH v4 0/7] net/ena: release 2.13.0

2025-06-29 Thread Stephen Hemminger

On Wed, 28 May 2025 13:25:24 +0300
Shai Brandes  wrote:

> This patchset includes an upgrade of the ENA HAL,
> introduces a new feature, and addresses three bug fixes.
> 
> Thank you in advance to the net maintainers and community members
> for your time and effort reviewing the code.
> 
> Best regards,
> Shai Brandes
> AWS Elastic Network Adapter team

Could you check against current main branch.
This no longer applies, either it already got merged or something changed.

[RFC 2/4] event/dsw: add support for credit preallocation

2025-06-29 Thread Mattias Rönnblom

Implement RTE_EVENT_DEV_CAP_CREDIT_PREALLOCATION.

Signed-off-by: Mattias Rönnblom 
---
 drivers/event/dsw/dsw_evdev.c  |  5 ++-
 drivers/event/dsw/dsw_evdev.h  |  6 +++
 drivers/event/dsw/dsw_event.c  | 70 --
 drivers/event/dsw/dsw_xstats.c |  3 ++
 4 files changed, 71 insertions(+), 13 deletions(-)

diff --git a/drivers/event/dsw/dsw_evdev.c b/drivers/event/dsw/dsw_evdev.c
index e819412639..ecc1d947dd 100644
--- a/drivers/event/dsw/dsw_evdev.c
+++ b/drivers/event/dsw/dsw_evdev.c
@@ -228,7 +228,8 @@ dsw_info_get(struct rte_eventdev *dev __rte_unused,
RTE_EVENT_DEV_CAP_NONSEQ_MODE|
RTE_EVENT_DEV_CAP_MULTIPLE_QUEUE_PORT|
RTE_EVENT_DEV_CAP_CARRY_FLOW_ID |
-   RTE_EVENT_DEV_CAP_INDEPENDENT_ENQ
+   RTE_EVENT_DEV_CAP_INDEPENDENT_ENQ |
+   RTE_EVENT_DEV_CAP_CREDIT_PREALLOCATION
};
 }
 
@@ -458,6 +459,8 @@ dsw_probe(struct rte_vdev_device *vdev)
dev->enqueue_forward_burst = dsw_event_enqueue_forward_burst;
dev->dequeue_burst = dsw_event_dequeue_burst;
dev->maintain = dsw_event_maintain;
+   dev->credit_alloc = dsw_event_credit_alloc;
+   dev->credit_free = dsw_event_credit_free;
 
if (rte_eal_process_type() != RTE_PROC_PRIMARY)
return 0;
diff --git a/drivers/event/dsw/dsw_evdev.h b/drivers/event/dsw/dsw_evdev.h
index d78c5f4f26..c026b0a135 100644
--- a/drivers/event/dsw/dsw_evdev.h
+++ b/drivers/event/dsw/dsw_evdev.h
@@ -208,6 +208,7 @@ struct __rte_cache_aligned dsw_port {
 
uint64_t enqueue_calls;
uint64_t new_enqueued;
+   uint64_t new_prealloced_enqueued;
uint64_t forward_enqueued;
uint64_t release_enqueued;
uint64_t queue_enqueued[DSW_MAX_QUEUES];
@@ -284,6 +285,11 @@ uint16_t dsw_event_dequeue_burst(void *port, struct 
rte_event *events,
 uint16_t num, uint64_t wait);
 void dsw_event_maintain(void *port, int op);
 
+int dsw_event_credit_alloc(void *port, unsigned int new_event_threshold,
+  unsigned int num_credits);
+
+int dsw_event_credit_free(void *port, unsigned int num_credits);
+
 int dsw_xstats_get_names(const struct rte_eventdev *dev,
 enum rte_event_dev_xstats_mode mode,
 uint8_t queue_port_id,
diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c
index 399d9f050e..09f353b324 100644
--- a/drivers/event/dsw/dsw_event.c
+++ b/drivers/event/dsw/dsw_event.c
@@ -93,9 +93,11 @@ dsw_port_return_credits(struct dsw_evdev *dsw, struct 
dsw_port *port,
 
 static void
 dsw_port_enqueue_stats(struct dsw_port *port, uint16_t num_new,
-  uint16_t num_forward, uint16_t num_release)
+  uint16_t num_new_prealloced, uint16_t num_forward,
+  uint16_t num_release)
 {
port->new_enqueued += num_new;
+   port->new_prealloced_enqueued += num_new_prealloced;
port->forward_enqueued += num_forward;
port->release_enqueued += num_release;
 }
@@ -1322,12 +1324,26 @@ dsw_port_flush_out_buffers(struct dsw_evdev *dsw, 
struct dsw_port *source_port)
dsw_port_transmit_buffered(dsw, source_port, dest_port_id);
 }
 
+static inline bool
+dsw_should_backpressure(struct dsw_evdev *dsw, int32_t new_event_threshold)
+{
+   int32_t credits_on_loan;
+   bool over_threshold;
+
+   credits_on_loan = rte_atomic_load_explicit(&dsw->credits_on_loan,
+  rte_memory_order_relaxed);
+
+   over_threshold = credits_on_loan > new_event_threshold;
+
+   return over_threshold;
+}
+
 static __rte_always_inline uint16_t
 dsw_event_enqueue_burst_generic(struct dsw_port *source_port,
const struct rte_event events[],
uint16_t events_len, bool op_types_known,
-   uint16_t num_new, uint16_t num_forward,
-   uint16_t num_release)
+   uint16_t num_new, uint16_t num_new_prealloced,
+   uint16_t num_forward, uint16_t num_release)
 {
struct dsw_evdev *dsw = source_port->dsw;
bool enough_credits;
@@ -1364,6 +1380,9 @@ dsw_event_enqueue_burst_generic(struct dsw_port 
*source_port,
case RTE_EVENT_OP_NEW:
num_new++;
break;
+   case RTE_EVENT_OP_NEW_PREALLOCED:
+   num_new_prealloced++;
+   break;
case RTE_EVENT_OP_FORWARD:
num_forward++;
break;
@@ -1379,9 +1398,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port 
*source_port,
 * above the water mark.
 */
if (unlikely(num_new > 0 &&
-

[RFC 3/4] eventdev: add enqueue optimized for prealloced events

2025-06-29 Thread Mattias Rönnblom

Extend Eventdev API with an enqueue function for events of the
RTE_EVENT_OP_NEW_PREALLOCED operation type.

Signed-off-by: Mattias Rönnblom 
---
 lib/eventdev/eventdev_pmd.h  |  2 +
 lib/eventdev/eventdev_private.c  |  1 +
 lib/eventdev/rte_eventdev.h  | 72 
 lib/eventdev/rte_eventdev_core.h |  2 +
 4 files changed, 70 insertions(+), 7 deletions(-)

diff --git a/lib/eventdev/eventdev_pmd.h b/lib/eventdev/eventdev_pmd.h
index 84ec3ea555..d636e9e7ac 100644
--- a/lib/eventdev/eventdev_pmd.h
+++ b/lib/eventdev/eventdev_pmd.h
@@ -166,6 +166,8 @@ struct __rte_cache_aligned rte_eventdev {
/**< Pointer to PMD enqueue burst function. */
event_enqueue_burst_t enqueue_new_burst;
/**< Pointer to PMD enqueue burst function(op new variant) */
+   event_enqueue_burst_t enqueue_new_prealloced_burst;
+   /**< Pointer to PMD enqueue burst function(op new prealloced variant) */
event_enqueue_burst_t enqueue_forward_burst;
/**< Pointer to PMD enqueue burst function(op forward variant) */
event_dequeue_burst_t dequeue_burst;
diff --git a/lib/eventdev/eventdev_private.c b/lib/eventdev/eventdev_private.c
index ec16125d83..d830ba8f3b 100644
--- a/lib/eventdev/eventdev_private.c
+++ b/lib/eventdev/eventdev_private.c
@@ -159,6 +159,7 @@ event_dev_fp_ops_set(struct rte_event_fp_ops *fp_op,
 {
fp_op->enqueue_burst = dev->enqueue_burst;
fp_op->enqueue_new_burst = dev->enqueue_new_burst;
+   fp_op->enqueue_new_prealloced_burst = dev->enqueue_new_prealloced_burst;
fp_op->enqueue_forward_burst = dev->enqueue_forward_burst;
fp_op->dequeue_burst = dev->dequeue_burst;
fp_op->maintain = dev->maintain;
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 812ed2705c..fc71c54b3e 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -507,8 +507,8 @@ struct rte_event;
 #define RTE_EVENT_DEV_CAP_CREDIT_PREALLOCATION (1ULL << 21)
 /**< Event device supports credit preallocation for new events.
  *
- * The event device supports preallocation credits, which in turn allows
- * the use of @ref RTE_EVENT_OP_NEW_PREALLOCED.
+ * The event device supports preallocating credits, which in turn allows
+ * enqueueing events with operation type @ref RTE_EVENT_OP_NEW_PREALLOCED.
  *
  * @see rte_event_credit_alloc()
  * @see rte_event_credit_free()
@@ -2734,6 +2734,64 @@ rte_event_enqueue_new_burst(uint8_t dev_id, uint8_t 
port_id,
 fp_ops->enqueue_new_burst);
 }
 
+/**
+ * Enqueue a burst of events objects of operation type
+ * @ref RTE_EVENT_OP_NEW_PREALLOCED on an event device designated by its
+ * *dev_id* through the event port specified by *port_id*.
+ *
+ * Provides the same functionality as rte_event_enqueue_burst(),
+ * expect that application can use this API when the all objects in
+ * the burst contains the enqueue operation of the type
+ * @ref RTE_EVENT_OP_NEW_PREALLOCED. This specialized function can
+ * provide the additional hint to the PMD and optimize if possible.
+ *
+ * The rte_event_enqueue_new_prealloced_burst() result is undefined if
+ * the enqueue burst has event object of operation type !=
+ * @ref RTE_EVENT_OP_NEW_PREALLOCED.
+ *
+ * This function may only be called on event devices with the
+ * @ref RTE_EVENT_DEV_CAP_CREDIT_PREALLOCATION capability.
+ *
+ * @param dev_id
+ *   The identifier of the device.
+n * @param port_id
+ *   The identifier of the event port.
+ * @param ev
+ *   Points to an array of *nb_events* objects of type *rte_event* structure
+ *   which contain the event object enqueue operations to be processed.
+ * @param nb_events
+ *   The number of event objects to enqueue, typically number of
+ *   rte_event_port_attr_get(...RTE_EVENT_PORT_ATTR_ENQ_DEPTH...)
+ *   available for this port.
+ *
+ * @return
+ *   The number of event objects actually enqueued on the event device. The
+ *   return value can be less than the value of the *nb_events* parameter when
+ *   the event devices queue is full or if invalid parameters are specified in 
a
+ *   *rte_event*. If the return value is less than *nb_events*, the remaining
+ *   events at the end of ev[] are not consumed and the caller has to take care
+ *   of them, and rte_errno is set accordingly. Possible errno values include:
+ *   - EINVAL   The port ID is invalid, device ID is invalid, an event's queue
+ *  ID is invalid, or an event's sched type doesn't match the
+ *  capabilities of the destination queue.
+ *   - ENOSPC   The event port was backpressured and unable to enqueue
+ *  one or more events. This error code is only applicable to
+ *  closed systems.
+ * @see rte_event_port_attr_get(), RTE_EVENT_PORT_ATTR_ENQ_DEPTH
+ * @see rte_event_enqueue_burst()
+ */
+static inline uint16_t
+rte_event_enqueue_new_prealloced_burst(uint8_t dev_id, uint8_t port_id,
+

[RFC 4/4] event/dsw: implement enqueue optimized for prealloced events

2025-06-29 Thread Mattias Rönnblom

Implement rte_event_enqueue_new_prealloced_burst() in DSW.

Signed-off-by: Mattias Rönnblom 
---
 drivers/event/dsw/dsw_evdev.c |  1 +
 drivers/event/dsw/dsw_evdev.h |  3 +++
 drivers/event/dsw/dsw_event.c | 18 ++
 3 files changed, 22 insertions(+)

diff --git a/drivers/event/dsw/dsw_evdev.c b/drivers/event/dsw/dsw_evdev.c
index ecc1d947dd..139f57b5f4 100644
--- a/drivers/event/dsw/dsw_evdev.c
+++ b/drivers/event/dsw/dsw_evdev.c
@@ -456,6 +456,7 @@ dsw_probe(struct rte_vdev_device *vdev)
dev->dev_ops = &dsw_evdev_ops;
dev->enqueue_burst = dsw_event_enqueue_burst;
dev->enqueue_new_burst = dsw_event_enqueue_new_burst;
+   dev->enqueue_new_prealloced_burst = 
dsw_event_enqueue_new_prealloced_burst;
dev->enqueue_forward_burst = dsw_event_enqueue_forward_burst;
dev->dequeue_burst = dsw_event_dequeue_burst;
dev->maintain = dsw_event_maintain;
diff --git a/drivers/event/dsw/dsw_evdev.h b/drivers/event/dsw/dsw_evdev.h
index c026b0a135..5c5699c64f 100644
--- a/drivers/event/dsw/dsw_evdev.h
+++ b/drivers/event/dsw/dsw_evdev.h
@@ -277,6 +277,9 @@ uint16_t dsw_event_enqueue_burst(void *port,
 uint16_t dsw_event_enqueue_new_burst(void *port,
 const struct rte_event events[],
 uint16_t events_len);
+uint16_t dsw_event_enqueue_new_prealloced_burst(void *port,
+   const struct rte_event events[],
+   uint16_t events_len);
 uint16_t dsw_event_enqueue_forward_burst(void *port,
 const struct rte_event events[],
 uint16_t events_len);
diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c
index 09f353b324..b9529bd5d5 100644
--- a/drivers/event/dsw/dsw_event.c
+++ b/drivers/event/dsw/dsw_event.c
@@ -1459,6 +1459,21 @@ dsw_event_enqueue_new_burst(void *port, const struct 
rte_event events[],
   0, 0, 0);
 }
 
+uint16_t
+dsw_event_enqueue_new_prealloced_burst(void *port,
+  const struct rte_event events[],
+  uint16_t events_len)
+{
+   struct dsw_port *source_port = port;
+
+   if (unlikely(events_len > source_port->enqueue_depth))
+   events_len = source_port->enqueue_depth;
+
+   return dsw_event_enqueue_burst_generic(source_port, events,
+  events_len, true, 0, events_len,
+  0, 0);
+}
+
 uint16_t
 dsw_event_enqueue_forward_burst(void *port, const struct rte_event events[],
uint16_t events_len)
@@ -1630,6 +1645,9 @@ int dsw_event_credit_alloc(void *port, unsigned int 
new_event_threshold,
struct dsw_evdev *dsw = source_port->dsw;
bool enough_credits;
 
+   if (new_event_threshold == 0)
+   new_event_threshold = source_port->new_event_threshold;
+
if (dsw_should_backpressure(dsw, new_event_threshold))
return 0;
 
-- 
2.43.0

[RFC 1/4] eventdev: add support for credit preallocation

2025-06-29 Thread Mattias Rönnblom

Optionally split the enqueue operation for new events into two steps;
allocating a "slot" for the event in the event device, and the actual
enqueue operation.

Pre-allocating credits reduces the risk of enqueue failures (i.e.,
backpressure) for new events. This is useful for applications
performing expensive or effectively irreversible processing before the
enqueue operation. In such a scenario, efficiency may be improved and
code complexity reduced, in case the application can know ahead of
time, with some certainty, that the enqueue operation will succeed.

A new function rte_event_credit_alloc() is used to allocate credits.
A new function rte_event_credit_free() may be used, in case the
application decides to not use allocated credits.

A new operation type RTE_EVENT_NEW_PREALLOCED is added, which is the
equivalent to RTE_EVENT_NEW, only the event consumes one of the
pre-allocated credits when the event is successfully enqueued.

Signed-off-by: Mattias Rönnblom 
---
 lib/eventdev/eventdev_pmd.h  |   4 +
 lib/eventdev/eventdev_private.c  |  23 +
 lib/eventdev/eventdev_trace_points.c |   8 ++
 lib/eventdev/rte_eventdev.h  | 135 +++
 lib/eventdev/rte_eventdev_core.h |  10 ++
 lib/eventdev/rte_eventdev_trace_fp.h |  19 
 6 files changed, 199 insertions(+)

diff --git a/lib/eventdev/eventdev_pmd.h b/lib/eventdev/eventdev_pmd.h
index dda8ad82c9..84ec3ea555 100644
--- a/lib/eventdev/eventdev_pmd.h
+++ b/lib/eventdev/eventdev_pmd.h
@@ -172,6 +172,10 @@ struct __rte_cache_aligned rte_eventdev {
/**< Pointer to PMD dequeue burst function. */
event_maintain_t maintain;
/**< Pointer to PMD port maintenance function. */
+   event_credit_alloc_t credit_alloc;
+   /**< Pointer to PMD credit allocation function. */
+   event_credit_free_t credit_free;
+   /**< Pointer to PMD credit release function. */
event_tx_adapter_enqueue_t txa_enqueue_same_dest;
/**< Pointer to PMD eth Tx adapter burst enqueue function with
 * events destined to same Eth port & Tx queue.
diff --git a/lib/eventdev/eventdev_private.c b/lib/eventdev/eventdev_private.c
index dffd2c71d0..ec16125d83 100644
--- a/lib/eventdev/eventdev_private.c
+++ b/lib/eventdev/eventdev_private.c
@@ -34,6 +34,25 @@ dummy_event_maintain(__rte_unused void *port, __rte_unused 
int op)
"maintenance requested for unconfigured event device");
 }
 
+static int
+dummy_event_credit_alloc(__rte_unused void *port,
+__rte_unused unsigned int new_event_threshold,
+__rte_unused unsigned int num_credits)
+{
+   RTE_EDEV_LOG_ERR(
+   "credit allocation request for unconfigured event device");
+   return 0;
+}
+
+static int
+dummy_event_credit_free(__rte_unused void *port,
+__rte_unused unsigned int num_credits)
+{
+   RTE_EDEV_LOG_ERR(
+   "credit return request for unconfigured event device");
+   return 0;
+}
+
 static uint16_t
 dummy_event_tx_adapter_enqueue(__rte_unused void *port,
   __rte_unused struct rte_event ev[],
@@ -118,6 +137,8 @@ event_dev_fp_ops_reset(struct rte_event_fp_ops *fp_op)
.enqueue_forward_burst = dummy_event_enqueue_burst,
.dequeue_burst = dummy_event_dequeue_burst,
.maintain = dummy_event_maintain,
+   .credit_alloc = dummy_event_credit_alloc,
+   .credit_free = dummy_event_credit_free,
.txa_enqueue = dummy_event_tx_adapter_enqueue,
.txa_enqueue_same_dest = 
dummy_event_tx_adapter_enqueue_same_dest,
.ca_enqueue = dummy_event_crypto_adapter_enqueue,
@@ -141,6 +162,8 @@ event_dev_fp_ops_set(struct rte_event_fp_ops *fp_op,
fp_op->enqueue_forward_burst = dev->enqueue_forward_burst;
fp_op->dequeue_burst = dev->dequeue_burst;
fp_op->maintain = dev->maintain;
+   fp_op->credit_alloc = dev->credit_alloc;
+   fp_op->credit_free = dev->credit_free;
fp_op->txa_enqueue = dev->txa_enqueue;
fp_op->txa_enqueue_same_dest = dev->txa_enqueue_same_dest;
fp_op->ca_enqueue = dev->ca_enqueue;
diff --git a/lib/eventdev/eventdev_trace_points.c 
b/lib/eventdev/eventdev_trace_points.c
index ade6723b7b..c563f5cab1 100644
--- a/lib/eventdev/eventdev_trace_points.c
+++ b/lib/eventdev/eventdev_trace_points.c
@@ -50,6 +50,14 @@ RTE_EXPORT_SYMBOL(__rte_eventdev_trace_maintain)
 RTE_TRACE_POINT_REGISTER(rte_eventdev_trace_maintain,
lib.eventdev.maintain)
 
+RTE_EXPORT_SYMBOL(__rte_eventdev_credit_alloc)
+RTE_TRACE_POINT_REGISTER(rte_eventdev_trace_credit_alloc,
+   lib.eventdev.credit_alloc)
+
+RTE_EXPORT_SYMBOL(__rte_eventdev_credit_free)
+RTE_TRACE_POINT_REGISTER(rte_eventdev_trace_credit_free,
+   lib.eventdev.credit_free)
+
 RTE_EXPORT_EXPERIMENTAL_SYMBOL(__rte_eventdev_trace_port_profile_switch, 23.11)
 RTE_TRACE_POINT_REGIST

[RFC 0/4] Add support for event credit preallocation

2025-06-29 Thread Mattias Rönnblom

Events of type RTE_EVENT_OP_NEW are often generated as a result of
some stimuli from the world outside the event machine. Examples of
such input can be a timeout in an application-managed timer wheel, a
control plane message on a lockless ring, an incoming packet
triggering the release of buffered packets, or a descriptor arriving
on some hardware queue.

In non-event-triggered cases, the external-trigger-to-eventdev-event
mechanism serves the same role as various Eventdev adapters, but for
input that does not have native Eventdev support.

The actual RTE_EVENT_OP_NEW event enqueue is often preceded by a
processing. Such processing may be both expensive or effectively
irreversible. In addition, in case the enqueue is likely to fail,
there is not even a point in polling the external source for new
input.

In such a scenario, efficiency could potentially be improved and code
complexity reduced in case the application could know ahead of time,
with some certainty, that the enqueue operation will succeed.

Event devices have a mechanism that puts an upper bound to the number
of in-flight (buffered) events. In many cases (e.g., DLB, DSW, and SW)
there is a credit system putting an upper bound on the number of
in-flight events. A new event consumes a credit, which is returned
to the credit pool when the event is released. In the current
Eventdev API, all this happens "under the hood" and is not visible
to the application.

This patchset optionally splits the enqueue operation into two steps:
1) rte_event_credit_alloc() to allocate "slots" for the events, in the
   form of credits. One credit grants the application the right to
   enqueue one event of the type RTE_EVENT_OP_NEW_PREALLOCED.
2) The actual enqueue operation, with the rte_event.op set to
   RTE_EVENT_OP_NEW_PREALLOCED.

The new operation type RTE_EVENT_OP_NEW_PREALLOCED is identical to
RTE_EVENT_OP_NEW, with the only exception that credit allocation
(either conceptually or literally) has already been successfully
completed.

Whether or not a credit allocation will succeed depends on the
new_event_threshold of the request. In the current Eventdev API,
new_event_threshold is strictly a port-level configuration. Beyond
simply allocating a credit, this patchset also address flexibility in
terms of making the new_event_threshold a per-allocation property.

Control over new_event_threshold is very important to tune system
behavior at overload.

In the general case, failure to allocate a credit is only one reason
an enqueue operation may fail. API semantics-wise, the possession of a
credit does not guarantee that the subsequent enqueue operation will
succeed. Certain event device implementations may come with stronger
guarantees.

In case the application decides not to (or fails to) spend its credits
enqueuing RTE_EVENT_OP_NEW_PREALLOCED events, it may return them using
the new rte_event_credit_free() function.

For performance and API consistency reasons, a new preallocation-optimized
enqueue function rte_event_enqueue_prealloced_burst() is added.

To allow the application to query if the credit management and the new
enqueue function are available on a particular event device, a new
capability RTE_EVENT_DEV_CAP_CREDIT_PREALLOCATION is added.

Mattias Rönnblom (4):
  eventdev: add support for credit preallocation
  event/dsw: add support for credit preallocation
  eventdev: add enqueue optimized for prealloced events
  event/dsw: implement enqueue optimized for prealloced events

 drivers/event/dsw/dsw_evdev.c|   6 +-
 drivers/event/dsw/dsw_evdev.h|   9 ++
 drivers/event/dsw/dsw_event.c|  86 ++--
 drivers/event/dsw/dsw_xstats.c   |   3 +
 lib/eventdev/eventdev_pmd.h  |   6 +
 lib/eventdev/eventdev_private.c  |  24 
 lib/eventdev/eventdev_trace_points.c |   8 ++
 lib/eventdev/rte_eventdev.h  | 193 +++
 lib/eventdev/rte_eventdev_core.h |  12 ++
 lib/eventdev/rte_eventdev_trace_fp.h |  19 +++
 10 files changed, 354 insertions(+), 12 deletions(-)

-- 
2.43.0

[PATCH v4 3/5] net/mlx5: allocate and release unique resources for Tx queues

2025-06-29 Thread Bing Zhao

If the unique umem and MR method is enabled, before starting Tx
queues in device start stage, the memory will be pre-allocated
and the MR will be registered for the Tx queues' usage later.

Signed-off-by: Bing Zhao 
---
 drivers/net/mlx5/mlx5.h |  4 ++
 drivers/net/mlx5/mlx5_trigger.c | 91 +
 2 files changed, 95 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 285c9ba396..c08894cd03 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -2141,6 +2141,10 @@ struct mlx5_priv {
struct {
uint32_t sq_total_size;
uint32_t cq_total_size;
+   void *umem;
+   void *umem_obj;
+   uint32_t sq_cur_off;
+   uint32_t cq_cur_off;
} consec_tx_mem;
RTE_ATOMIC(uint16_t) shared_refcnt; /* HW steering host reference 
counter. */
 };
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 3aa7d01ee2..00ffb39ecb 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -1135,6 +1135,89 @@ mlx5_hw_representor_port_allowed_start(struct 
rte_eth_dev *dev)
 
 #endif
 
+/*
+ * Allocate TxQs unique umem and register its MR.
+ *
+ * @param dev
+ *   Pointer to Ethernet device structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int mlx5_dev_allocate_consec_tx_mem(struct rte_eth_dev *dev)
+{
+   struct mlx5_priv *priv = dev->data->dev_private;
+   size_t alignment;
+   uint32_t total_size;
+   struct mlx5dv_devx_umem *umem_obj = NULL;
+   void *umem_buf = NULL;
+
+   /* Legacy per queue allocation, do nothing here. */
+   if (priv->sh->config.txq_mem_algn == 0)
+   return 0;
+   alignment = (size_t)(1U << priv->sh->config.txq_mem_algn);
+   total_size = priv->consec_tx_mem.sq_total_size + 
priv->consec_tx_mem.cq_total_size;
+   /*
+* Hairpin queues can be skipped later
+* queue size alignment is bigger than doorbell alignment, no need to 
align or
+* round-up again. One queue have two DBs (for CQ + WQ).
+*/
+   total_size += MLX5_DBR_SIZE * priv->txqs_n * 2;
+   umem_buf = mlx5_malloc_numa_tolerant(MLX5_MEM_RTE | MLX5_MEM_ZERO, 
total_size,
+alignment, priv->sh->numa_node);
+   if (!umem_buf) {
+   DRV_LOG(ERR, "Failed to allocate consecutive memory for TxQs.");
+   rte_errno = ENOMEM;
+   return -rte_errno;
+   }
+   umem_obj = mlx5_os_umem_reg(priv->sh->cdev->ctx, (void 
*)(uintptr_t)umem_buf,
+   total_size, IBV_ACCESS_LOCAL_WRITE);
+   if (!umem_obj) {
+   DRV_LOG(ERR, "Failed to register unique umem for all SQs.");
+   rte_errno = errno;
+   if (umem_buf)
+   mlx5_free(umem_buf);
+   return -rte_errno;
+   }
+   priv->consec_tx_mem.umem = umem_buf;
+   priv->consec_tx_mem.sq_cur_off = 0;
+   priv->consec_tx_mem.cq_cur_off = priv->consec_tx_mem.sq_total_size;
+   priv->consec_tx_mem.umem_obj = umem_obj;
+   DRV_LOG(DEBUG, "Allocated umem %p with size %u for %u queues with 
sq_len %u,"
+   " cq_len %u and registered object %p on port %u",
+   umem_buf, total_size, priv->txqs_n, 
priv->consec_tx_mem.sq_total_size,
+   priv->consec_tx_mem.cq_total_size, (void *)umem_obj, 
dev->data->port_id);
+   return 0;
+}
+
+/*
+ * Release TxQs unique umem and register its MR.
+ *
+ * @param dev
+ *   Pointer to Ethernet device structure.
+ * @param on_stop
+ *   If this is on device stop stage.
+ */
+static void mlx5_dev_free_consec_tx_mem(struct rte_eth_dev *dev, bool on_stop)
+{
+   struct mlx5_priv *priv = dev->data->dev_private;
+
+   if (priv->consec_tx_mem.umem_obj) {
+   mlx5_os_umem_dereg(priv->consec_tx_mem.umem_obj);
+   priv->consec_tx_mem.umem_obj = NULL;
+   }
+   if (priv->consec_tx_mem.umem) {
+   mlx5_free(priv->consec_tx_mem.umem);
+   priv->consec_tx_mem.umem = NULL;
+   }
+   /* Queues information will not be reset. */
+   if (on_stop) {
+   /* Reset to 0s for re-setting up queues. */
+   priv->consec_tx_mem.sq_cur_off = 0;
+   priv->consec_tx_mem.cq_cur_off = 0;
+   }
+}
+
 /**
  * DPDK callback to start the device.
  *
@@ -1225,6 +1308,12 @@ mlx5_dev_start(struct rte_eth_dev *dev)
if (ret)
goto error;
}
+   ret = mlx5_dev_allocate_consec_tx_mem(dev);
+   if (ret) {
+   DRV_LOG(ERR, "port %u Tx queues memory allocation failed: %s",
+   dev->data->port_id, strerror(rte_errno));
+   goto error;
+   }
ret = mlx5_txq_start(dev);
if (ret) {

[PATCH v4 1/5] net/mlx5: add new devarg for Tx queue consecutive memory

2025-06-29 Thread Bing Zhao

With this commit, a new device argument is introduced to control
the memory allocation for Tx queues.

By default, without specifying any value. A default alignment with
system page size will be used. All SQ / CQ memory of Tx queues will
be allocated once and a single umem & MR will be used.

When setting to 0, the legacy way of per queue umem allocation will
be selected in the following commit.

If the value is smaller than the system page size, the starting
address alignment will be rounded up to the page size.

The value is a logarithm value based to 2. Refer to the rst file
change for more details.

Signed-off-by: Bing Zhao 
---
 doc/guides/nics/mlx5.rst | 25 +
 drivers/net/mlx5/mlx5.c  | 36 
 drivers/net/mlx5/mlx5.h  |  7 ---
 3 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index c1dcb9ca68..13e46970ab 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1682,6 +1682,31 @@ for an additional list of options shared with other mlx5 
drivers.
 
   By default, the PMD will set this value to 1.
 
+- ``txq_mem_algn`` parameter [int]
+
+  A logarithm base 2 value for the memory starting address alignment
+  for Tx queues' WQ and associated CQ.
+
+  Different CPU architectures and generations may have different cache systems.
+  The memory accessing order may impact the cache misses rate on different 
CPUs.
+  This devarg gives the ability to control the umem alignment for all TxQs 
without
+  rebuilding the application binary.
+
+  The performance can be tuned by specifying this devarg after benchmark 
testing
+  on a specific system and hardware.
+
+  By default, ``txq_mem_algn`` is set to log2(4K), or log2(64K) on some 
specific OS
+  distributions - based on the system page size configuration.
+  All Tx queues will use a unique memory region and umem area. Each TxQ will 
start at
+  an address right after the previous one except the 1st queue that will be 
aligned at
+  the given size of address boundary controlled by this devarg.
+
+  If the value is less then the page size, it will be rounded up.
+  If it is bigger than the maximal queue size, a warning message will appear, 
there will
+  be some waste of memory at the beginning.
+
+  0 indicates legacy per queue memory allocation and separate Memory Regions 
(MR).
+
 
 Multiport E-Switch
 --
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 1bad8a9e90..a364e9e421 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -185,6 +185,14 @@
 /* Device parameter to control representor matching in ingress/egress flows 
with HWS. */
 #define MLX5_REPR_MATCHING_EN "repr_matching_en"
 
+/*
+ * Alignment of the Tx queue starting address,
+ * If not set, using separate umem and MR for each TxQ.
+ * If set, using consecutive memory address and single MR for all Tx queues, 
each TxQ will start at
+ * the alignment specified.
+ */
+#define MLX5_TXQ_MEM_ALGN "txq_mem_algn"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1447,6 +1455,8 @@ mlx5_dev_args_check_handler(const char *key, const char 
*val, void *opaque)
config->cnt_svc.cycle_time = tmp;
} else if (strcmp(MLX5_REPR_MATCHING_EN, key) == 0) {
config->repr_matching = !!tmp;
+   } else if (strcmp(MLX5_TXQ_MEM_ALGN, key) == 0) {
+   config->txq_mem_algn = (uint32_t)tmp;
}
return 0;
 }
@@ -1486,9 +1496,17 @@ mlx5_shared_dev_ctx_args_config(struct 
mlx5_dev_ctx_shared *sh,
MLX5_HWS_CNT_SERVICE_CORE,
MLX5_HWS_CNT_CYCLE_TIME,
MLX5_REPR_MATCHING_EN,
+   MLX5_TXQ_MEM_ALGN,
NULL,
};
int ret = 0;
+   size_t alignment = rte_mem_page_size();
+   uint32_t max_queue_umem_size = MLX5_WQE_SIZE * 
mlx5_dev_get_max_wq_size(sh);
+
+   if (alignment == (size_t)-1) {
+   alignment = (1 << MLX5_LOG_PAGE_SIZE);
+   DRV_LOG(WARNING, "Failed to get page_size, using default %zu 
size.", alignment);
+   }
 
/* Default configuration. */
memset(config, 0, sizeof(*config));
@@ -1501,6 +1519,7 @@ mlx5_shared_dev_ctx_args_config(struct 
mlx5_dev_ctx_shared *sh,
config->cnt_svc.cycle_time = MLX5_CNT_SVC_CYCLE_TIME_DEFAULT;
config->cnt_svc.service_core = rte_get_main_lcore();
config->repr_matching = 1;
+   config->txq_mem_algn = log2above(alignment);
if (mkvlist != NULL) {
/* Process parameters. */
ret = mlx5_kvargs_process(mkvlist, params,
@@ -1567,6 +1586,16 @@ mlx5_shared_dev_ctx_args_config(struct 
mlx5_dev_ctx_shared *sh,
config->hw_fcs_strip = 0;
else
config->hw_fcs_strip = sh->dev_cap.hw_fcs_strip;
+   if (config->txq_mem_algn != 0 && config->txq_mem_a

[PATCH v4 5/5] net/mlx5: use consecutive memory for Tx queue creation

2025-06-29 Thread Bing Zhao

The queue starting addresses offsets of a umem and doorbell offsets
are already passed to the Devx object creation function.

When the queue length is not zero, it means that the memory was
pre-allocated and the new object creation with consecutive memory
should be enabled.

When destroying the SQ / CQ objects, if it is in consecutive mode,
the umem and MR should not be released and the global resources
should only be released when stopping the device.

Signed-off-by: Bing Zhao 
---
 drivers/common/mlx5/mlx5_common_devx.c | 160 +
 drivers/common/mlx5/mlx5_common_devx.h |   2 +
 2 files changed, 110 insertions(+), 52 deletions(-)

diff --git a/drivers/common/mlx5/mlx5_common_devx.c 
b/drivers/common/mlx5/mlx5_common_devx.c
index aace5283e7..e237558ec2 100644
--- a/drivers/common/mlx5/mlx5_common_devx.c
+++ b/drivers/common/mlx5/mlx5_common_devx.c
@@ -30,6 +30,8 @@ mlx5_devx_cq_destroy(struct mlx5_devx_cq *cq)
 {
if (cq->cq)
claim_zero(mlx5_devx_cmd_destroy(cq->cq));
+   if (cq->consec)
+   return;
if (cq->umem_obj)
claim_zero(mlx5_os_umem_dereg(cq->umem_obj));
if (cq->umem_buf)
@@ -93,6 +95,7 @@ mlx5_devx_cq_create(void *ctx, struct mlx5_devx_cq *cq_obj, 
uint16_t log_desc_n,
uint32_t eqn;
uint32_t num_of_cqes = RTE_BIT32(log_desc_n);
int ret;
+   uint32_t umem_offset, umem_id;
 
if (page_size == (size_t)-1 || alignment == (size_t)-1) {
DRV_LOG(ERR, "Failed to get page_size.");
@@ -108,29 +111,44 @@ mlx5_devx_cq_create(void *ctx, struct mlx5_devx_cq 
*cq_obj, uint16_t log_desc_n,
}
/* Allocate memory buffer for CQEs and doorbell record. */
umem_size = sizeof(struct mlx5_cqe) * num_of_cqes;
-   umem_dbrec = RTE_ALIGN(umem_size, MLX5_DBR_SIZE);
-   umem_size += MLX5_DBR_SIZE;
-   umem_buf = mlx5_malloc_numa_tolerant(MLX5_MEM_RTE | MLX5_MEM_ZERO, 
umem_size,
-alignment, socket);
-   if (!umem_buf) {
-   DRV_LOG(ERR, "Failed to allocate memory for CQ.");
-   rte_errno = ENOMEM;
-   return -rte_errno;
-   }
-   /* Register allocated buffer in user space with DevX. */
-   umem_obj = mlx5_os_umem_reg(ctx, (void *)(uintptr_t)umem_buf, umem_size,
-   IBV_ACCESS_LOCAL_WRITE);
-   if (!umem_obj) {
-   DRV_LOG(ERR, "Failed to register umem for CQ.");
-   rte_errno = errno;
-   goto error;
+   if (!attr->q_len) {
+   umem_dbrec = RTE_ALIGN(umem_size, MLX5_DBR_SIZE);
+   umem_size += MLX5_DBR_SIZE;
+   umem_buf = mlx5_malloc_numa_tolerant(MLX5_MEM_RTE | 
MLX5_MEM_ZERO, umem_size,
+alignment, socket);
+   if (!umem_buf) {
+   DRV_LOG(ERR, "Failed to allocate memory for CQ.");
+   rte_errno = ENOMEM;
+   return -rte_errno;
+   }
+   /* Register allocated buffer in user space with DevX. */
+   umem_obj = mlx5_os_umem_reg(ctx, (void *)(uintptr_t)umem_buf, 
umem_size,
+   IBV_ACCESS_LOCAL_WRITE);
+   if (!umem_obj) {
+   DRV_LOG(ERR, "Failed to register umem for CQ.");
+   rte_errno = errno;
+   goto error;
+   }
+   umem_offset = 0;
+   umem_id = mlx5_os_get_umem_id(umem_obj);
+   } else {
+   if (umem_size != attr->q_len) {
+   DRV_LOG(ERR, "Mismatch between saved length and calc 
length of CQ %u-%u",
+   umem_size, attr->q_len);
+   rte_errno = EINVAL;
+   return -rte_errno;
+   }
+   umem_buf = attr->umem;
+   umem_offset = attr->q_off;
+   umem_dbrec = attr->db_off;
+   umem_id = mlx5_os_get_umem_id(attr->umem_obj);
}
/* Fill attributes for CQ object creation. */
attr->q_umem_valid = 1;
-   attr->q_umem_id = mlx5_os_get_umem_id(umem_obj);
-   attr->q_umem_offset = 0;
+   attr->q_umem_id = umem_id;
+   attr->q_umem_offset = umem_offset;
attr->db_umem_valid = 1;
-   attr->db_umem_id = attr->q_umem_id;
+   attr->db_umem_id = umem_id;
attr->db_umem_offset = umem_dbrec;
attr->eqn = eqn;
attr->log_cq_size = log_desc_n;
@@ -142,19 +160,29 @@ mlx5_devx_cq_create(void *ctx, struct mlx5_devx_cq 
*cq_obj, uint16_t log_desc_n,
rte_errno  = ENOMEM;
goto error;
}
-   cq_obj->umem_buf = umem_buf;
-   cq_obj->umem_obj = umem_obj;
+   if (!attr->q_len) {
+   cq_obj->umem_buf = umem_buf;
+   cq_obj->umem_obj = umem_obj;
+   cq

[PATCH v4 4/5] net/mlx5: pass the information in Tx queue start

2025-06-29 Thread Bing Zhao

The actual Devx object of SQs and CQs are only created in the
function mlx5_txq_start() in the device stage.

By changing the 1-level iteration to 2-level iterations, the Tx
queue with a big number of queue depth will be set up firstly.
This will help to split the memory from big trunks to small trunks.

In the testing, such assignment will help to improve the performance
a little bit. All the doorbells will be grouped and padded at the end
of the umem area.

The umem object and offsets information are passed to the Devx
creation function for the further usage.

Signed-off-by: Bing Zhao 
---
 drivers/common/mlx5/mlx5_devx_cmds.h | 10 
 drivers/net/mlx5/mlx5_devx.c | 26 -
 drivers/net/mlx5/mlx5_trigger.c  | 82 +++-
 3 files changed, 77 insertions(+), 41 deletions(-)

diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h 
b/drivers/common/mlx5/mlx5_devx_cmds.h
index 6c726a0d46..f5fda02c1e 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -483,6 +483,11 @@ struct mlx5_devx_create_sq_attr {
uint32_t packet_pacing_rate_limit_index:16;
uint32_t tis_lst_sz:16;
uint32_t tis_num:24;
+   uint32_t q_off;
+   void *umem;
+   void *umem_obj;
+   uint32_t q_len;
+   uint32_t db_off;
struct mlx5_devx_wq_attr wq_attr;
 };
 
@@ -514,6 +519,11 @@ struct mlx5_devx_cq_attr {
uint64_t db_umem_offset;
uint32_t eqn;
uint64_t db_addr;
+   void *umem;
+   void *umem_obj;
+   uint32_t q_off;
+   uint32_t q_len;
+   uint32_t db_off;
 };
 
 /* Virtq attributes structure, used by VIRTQ operations. */
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 3d49e096ef..0ee16ba4f0 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -1493,10 +1493,22 @@ mlx5_txq_create_devx_sq_resources(struct rte_eth_dev 
*dev, uint16_t idx,
mlx5_ts_format_conv(cdev->config.hca_attr.sq_ts_format),
.tis_num = mlx5_get_txq_tis_num(dev, idx),
};
+   uint32_t db_start = priv->consec_tx_mem.sq_total_size + 
priv->consec_tx_mem.cq_total_size;
+   int ret;
 
/* Create Send Queue object with DevX. */
-   return mlx5_devx_sq_create(cdev->ctx, &txq_obj->sq_obj,
-  log_desc_n, &sq_attr, priv->sh->numa_node);
+   if (priv->sh->config.txq_mem_algn) {
+   sq_attr.umem = priv->consec_tx_mem.umem;
+   sq_attr.umem_obj = priv->consec_tx_mem.umem_obj;
+   sq_attr.q_off = priv->consec_tx_mem.sq_cur_off;
+   sq_attr.db_off = db_start + (2 * idx) * MLX5_DBR_SIZE;
+   sq_attr.q_len = txq_data->sq_mem_len;
+   }
+   ret = mlx5_devx_sq_create(cdev->ctx, &txq_obj->sq_obj,
+ log_desc_n, &sq_attr, priv->sh->numa_node);
+   if (!ret && priv->sh->config.txq_mem_algn)
+   priv->consec_tx_mem.sq_cur_off += txq_data->sq_mem_len;
+   return ret;
 }
 #endif
 
@@ -1536,6 +1548,7 @@ mlx5_txq_devx_obj_new(struct rte_eth_dev *dev, uint16_t 
idx)
uint32_t cqe_n, log_desc_n;
uint32_t wqe_n, wqe_size;
int ret = 0;
+   uint32_t db_start = priv->consec_tx_mem.sq_total_size + 
priv->consec_tx_mem.cq_total_size;
 
MLX5_ASSERT(txq_data);
MLX5_ASSERT(txq_obj);
@@ -1557,6 +1570,13 @@ mlx5_txq_devx_obj_new(struct rte_eth_dev *dev, uint16_t 
idx)
rte_errno = EINVAL;
return 0;
}
+   if (priv->sh->config.txq_mem_algn) {
+   cq_attr.umem = priv->consec_tx_mem.umem;
+   cq_attr.umem_obj = priv->consec_tx_mem.umem_obj;
+   cq_attr.q_off = priv->consec_tx_mem.cq_cur_off;
+   cq_attr.db_off = db_start + (2 * idx + 1) * MLX5_DBR_SIZE;
+   cq_attr.q_len = txq_data->cq_mem_len;
+   }
/* Create completion queue object with DevX. */
ret = mlx5_devx_cq_create(sh->cdev->ctx, &txq_obj->cq_obj, log_desc_n,
  &cq_attr, priv->sh->numa_node);
@@ -1641,6 +1661,8 @@ mlx5_txq_devx_obj_new(struct rte_eth_dev *dev, uint16_t 
idx)
 #endif
txq_ctrl->uar_mmap_offset =
mlx5_os_get_devx_uar_mmap_offset(sh->tx_uar.obj);
+   if (priv->sh->config.txq_mem_algn)
+   priv->consec_tx_mem.cq_cur_off += txq_data->cq_mem_len;
ppriv->uar_table[txq_data->idx] = sh->tx_uar.bf_db;
dev->data->tx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STARTED;
return 0;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 00ffb39ecb..855d7518b9 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -51,52 +51,56 @@ static int
 mlx5_txq_start(struct rte_eth_dev *dev)
 {
struct mlx5_priv *priv = dev->data->dev_private;
-   unsigned int i;
+   uint32_t log_max_wqe = log2

[PATCH v4 2/5] net/mlx5: calculate the memory length for all Tx queues

2025-06-29 Thread Bing Zhao

When the alignment is non-zero, it means that the single umem and MR
allocation for all Tx queues will be used.

In this commit, the total length of SQs and associated CQs will be
calculated and saved.

Signed-off-by: Bing Zhao 
---
 drivers/net/mlx5/mlx5.h |  4 +++
 drivers/net/mlx5/mlx5_tx.h  |  2 ++
 drivers/net/mlx5/mlx5_txq.c | 67 +++--
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 6b8d29a2bf..285c9ba396 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -2138,6 +2138,10 @@ struct mlx5_priv {
struct mlx5_nta_sample_ctx *nta_sample_ctx;
 #endif
struct rte_eth_dev *shared_host; /* Host device for HW steering. */
+   struct {
+   uint32_t sq_total_size;
+   uint32_t cq_total_size;
+   } consec_tx_mem;
RTE_ATOMIC(uint16_t) shared_refcnt; /* HW steering host reference 
counter. */
 };
 
diff --git a/drivers/net/mlx5/mlx5_tx.h b/drivers/net/mlx5/mlx5_tx.h
index 55568c41b1..94f2028513 100644
--- a/drivers/net/mlx5/mlx5_tx.h
+++ b/drivers/net/mlx5/mlx5_tx.h
@@ -149,6 +149,7 @@ struct __rte_cache_aligned mlx5_txq_data {
uint16_t inlen_mode; /* Minimal data length to inline. */
uint8_t tx_aggr_affinity; /* TxQ affinity configuration. */
uint32_t qp_num_8s; /* QP number shifted by 8. */
+   uint32_t sq_mem_len; /* Length of TxQ for WQEs */
uint64_t offloads; /* Offloads for Tx Queue. */
struct mlx5_mr_ctrl mr_ctrl; /* MR control descriptor. */
struct mlx5_wqe *wqes; /* Work queue. */
@@ -167,6 +168,7 @@ struct __rte_cache_aligned mlx5_txq_data {
uint64_t ts_mask; /* Timestamp flag dynamic mask. */
uint64_t ts_last; /* Last scheduled timestamp. */
int32_t ts_offset; /* Timestamp field dynamic offset. */
+   uint32_t cq_mem_len; /* Length of TxQ for CQEs */
struct mlx5_dev_ctx_shared *sh; /* Shared context. */
struct mlx5_txq_stats stats; /* TX queue counters. */
struct mlx5_txq_stats stats_reset; /* stats on last reset. */
diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
index 8ee8108497..1948a700f1 100644
--- a/drivers/net/mlx5/mlx5_txq.c
+++ b/drivers/net/mlx5/mlx5_txq.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1032,6 +1033,57 @@ txq_adjust_params(struct mlx5_txq_ctrl *txq_ctrl)
!txq_ctrl->txq.inlen_empw);
 }
 
+/*
+ * Calculate WQ memory length for a Tx queue.
+ *
+ * @param log_wqe_cnt
+ *   Logarithm value of WQE numbers.
+ *
+ * @return
+ *   memory length of this WQ.
+ */
+static uint32_t mlx5_txq_wq_mem_length(uint32_t log_wqe_cnt)
+{
+   uint32_t num_of_wqbbs = 1U << log_wqe_cnt;
+   uint32_t umem_size;
+
+   umem_size = MLX5_WQE_SIZE * num_of_wqbbs;
+   return umem_size;
+}
+
+/*
+ * Calculate CQ memory length for a Tx queue.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ * @param txq_ctrl
+ *   Pointer to the TxQ control structure of the CQ.
+ *
+ * @return
+ *   memory length of this CQ.
+ */
+static uint32_t
+mlx5_txq_cq_mem_length(struct rte_eth_dev *dev, struct mlx5_txq_ctrl *txq_ctrl)
+{
+   uint32_t cqe_n, log_desc_n;
+
+   if (__rte_trace_point_fp_is_enabled() &&
+   txq_ctrl->txq.offloads & RTE_ETH_TX_OFFLOAD_SEND_ON_TIMESTAMP)
+   cqe_n = UINT16_MAX / 2 - 1;
+   else
+   cqe_n = (1UL << txq_ctrl->txq.elts_n) / MLX5_TX_COMP_THRESH +
+   1 + MLX5_TX_COMP_THRESH_INLINE_DIV;
+   log_desc_n = log2above(cqe_n);
+   cqe_n = 1UL << log_desc_n;
+   if (cqe_n > UINT16_MAX) {
+   DRV_LOG(ERR, "Port %u Tx queue %u requests to many CQEs %u.",
+   dev->data->port_id, txq_ctrl->txq.idx, cqe_n);
+   rte_errno = EINVAL;
+   return 0;
+   }
+   return sizeof(struct mlx5_cqe) * cqe_n;
+}
+
 /**
  * Create a DPDK Tx queue.
  *
@@ -1057,6 +1109,7 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t idx, 
uint16_t desc,
struct mlx5_priv *priv = dev->data->dev_private;
struct mlx5_txq_ctrl *tmpl;
uint16_t max_wqe;
+   uint32_t wqebb_cnt, log_desc_n;
 
if (socket != (unsigned int)SOCKET_ID_ANY) {
tmpl = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, sizeof(*tmpl) +
@@ -1099,15 +1152,25 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t idx, 
uint16_t desc,
tmpl->txq.idx = idx;
txq_set_params(tmpl);
txq_adjust_params(tmpl);
+   wqebb_cnt = txq_calc_wqebb_cnt(tmpl);
max_wqe = mlx5_dev_get_max_wq_size(priv->sh);
-   if (txq_calc_wqebb_cnt(tmpl) > max_wqe) {
+   if (wqebb_cnt > max_wqe) {
DRV_LOG(ERR,
"port %u Tx WQEBB count (%d) exceeds the limit (%d),"
" try smaller queue size",
-   dev->data->port_id, txq_calc_wqebb

[PATCH v4 0/5] Use consecutive Tx queues' memory

2025-06-29 Thread Bing Zhao

This patchset will move all the mlx5 Tx queues memory to a
consecutive memory area. All the WQEBBs will be allocated based
on the offset of this memory area.

---
v2:
  1. add a new fix for legacy code of WQE calculation
  2. fix the style
v3:
  1. change the devarg and add description.
  2. reorganize the code with different commits order.
v4:
  1. fix building failure on Windows and OSes with different compilers
  2. update the rst
  3. addressing comments and fix bugs
---

Bing Zhao (5):
  net/mlx5: add new devarg for Tx queue consecutive memory
  net/mlx5: calculate the memory length for all Tx queues
  net/mlx5: allocate and release unique resources for Tx queues
  net/mlx5: pass the information in Tx queue start
  net/mlx5: use consecutive memory for Tx queue creation

 doc/guides/nics/mlx5.rst   |  25 
 drivers/common/mlx5/mlx5_common_devx.c | 160 +++
 drivers/common/mlx5/mlx5_common_devx.h |   2 +
 drivers/common/mlx5/mlx5_devx_cmds.h   |  10 ++
 drivers/net/mlx5/mlx5.c|  36 +
 drivers/net/mlx5/mlx5.h|  15 ++-
 drivers/net/mlx5/mlx5_devx.c   |  26 +++-
 drivers/net/mlx5/mlx5_trigger.c| 173 +++--
 drivers/net/mlx5/mlx5_tx.h |   2 +
 drivers/net/mlx5/mlx5_txq.c|  67 +-
 10 files changed, 418 insertions(+), 98 deletions(-)

-- 
2.34.1

[PATCH v5 3/5] net/mlx5: allocate and release unique resources for Tx queues

2025-06-29 Thread Bing Zhao

If the unique umem and MR method is enabled, before starting Tx
queues in device start stage, the memory will be pre-allocated
and the MR will be registered for the Tx queues' usage later.

Signed-off-by: Bing Zhao 
---
 drivers/net/mlx5/mlx5.h |  4 ++
 drivers/net/mlx5/mlx5_trigger.c | 91 +
 2 files changed, 95 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 285c9ba396..c08894cd03 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -2141,6 +2141,10 @@ struct mlx5_priv {
struct {
uint32_t sq_total_size;
uint32_t cq_total_size;
+   void *umem;
+   void *umem_obj;
+   uint32_t sq_cur_off;
+   uint32_t cq_cur_off;
} consec_tx_mem;
RTE_ATOMIC(uint16_t) shared_refcnt; /* HW steering host reference 
counter. */
 };
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 3aa7d01ee2..00ffb39ecb 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -1135,6 +1135,89 @@ mlx5_hw_representor_port_allowed_start(struct 
rte_eth_dev *dev)
 
 #endif
 
+/*
+ * Allocate TxQs unique umem and register its MR.
+ *
+ * @param dev
+ *   Pointer to Ethernet device structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int mlx5_dev_allocate_consec_tx_mem(struct rte_eth_dev *dev)
+{
+   struct mlx5_priv *priv = dev->data->dev_private;
+   size_t alignment;
+   uint32_t total_size;
+   struct mlx5dv_devx_umem *umem_obj = NULL;
+   void *umem_buf = NULL;
+
+   /* Legacy per queue allocation, do nothing here. */
+   if (priv->sh->config.txq_mem_algn == 0)
+   return 0;
+   alignment = (size_t)1 << priv->sh->config.txq_mem_algn);
+   total_size = priv->consec_tx_mem.sq_total_size + 
priv->consec_tx_mem.cq_total_size;
+   /*
+* Hairpin queues can be skipped later
+* queue size alignment is bigger than doorbell alignment, no need to 
align or
+* round-up again. One queue have two DBs (for CQ + WQ).
+*/
+   total_size += MLX5_DBR_SIZE * priv->txqs_n * 2;
+   umem_buf = mlx5_malloc_numa_tolerant(MLX5_MEM_RTE | MLX5_MEM_ZERO, 
total_size,
+alignment, priv->sh->numa_node);
+   if (!umem_buf) {
+   DRV_LOG(ERR, "Failed to allocate consecutive memory for TxQs.");
+   rte_errno = ENOMEM;
+   return -rte_errno;
+   }
+   umem_obj = mlx5_os_umem_reg(priv->sh->cdev->ctx, (void 
*)(uintptr_t)umem_buf,
+   total_size, IBV_ACCESS_LOCAL_WRITE);
+   if (!umem_obj) {
+   DRV_LOG(ERR, "Failed to register unique umem for all SQs.");
+   rte_errno = errno;
+   if (umem_buf)
+   mlx5_free(umem_buf);
+   return -rte_errno;
+   }
+   priv->consec_tx_mem.umem = umem_buf;
+   priv->consec_tx_mem.sq_cur_off = 0;
+   priv->consec_tx_mem.cq_cur_off = priv->consec_tx_mem.sq_total_size;
+   priv->consec_tx_mem.umem_obj = umem_obj;
+   DRV_LOG(DEBUG, "Allocated umem %p with size %u for %u queues with 
sq_len %u,"
+   " cq_len %u and registered object %p on port %u",
+   umem_buf, total_size, priv->txqs_n, 
priv->consec_tx_mem.sq_total_size,
+   priv->consec_tx_mem.cq_total_size, (void *)umem_obj, 
dev->data->port_id);
+   return 0;
+}
+
+/*
+ * Release TxQs unique umem and register its MR.
+ *
+ * @param dev
+ *   Pointer to Ethernet device structure.
+ * @param on_stop
+ *   If this is on device stop stage.
+ */
+static void mlx5_dev_free_consec_tx_mem(struct rte_eth_dev *dev, bool on_stop)
+{
+   struct mlx5_priv *priv = dev->data->dev_private;
+
+   if (priv->consec_tx_mem.umem_obj) {
+   mlx5_os_umem_dereg(priv->consec_tx_mem.umem_obj);
+   priv->consec_tx_mem.umem_obj = NULL;
+   }
+   if (priv->consec_tx_mem.umem) {
+   mlx5_free(priv->consec_tx_mem.umem);
+   priv->consec_tx_mem.umem = NULL;
+   }
+   /* Queues information will not be reset. */
+   if (on_stop) {
+   /* Reset to 0s for re-setting up queues. */
+   priv->consec_tx_mem.sq_cur_off = 0;
+   priv->consec_tx_mem.cq_cur_off = 0;
+   }
+}
+
 /**
  * DPDK callback to start the device.
  *
@@ -1225,6 +1308,12 @@ mlx5_dev_start(struct rte_eth_dev *dev)
if (ret)
goto error;
}
+   ret = mlx5_dev_allocate_consec_tx_mem(dev);
+   if (ret) {
+   DRV_LOG(ERR, "port %u Tx queues memory allocation failed: %s",
+   dev->data->port_id, strerror(rte_errno));
+   goto error;
+   }
ret = mlx5_txq_start(dev);
if (ret) {

[PATCH v5 5/5] net/mlx5: use consecutive memory for Tx queue creation

2025-06-29 Thread Bing Zhao

The queue starting addresses offsets of a umem and doorbell offsets
are already passed to the Devx object creation function.

When the queue length is not zero, it means that the memory was
pre-allocated and the new object creation with consecutive memory
should be enabled.

When destroying the SQ / CQ objects, if it is in consecutive mode,
the umem and MR should not be released and the global resources
should only be released when stopping the device.

Signed-off-by: Bing Zhao 
---
 drivers/common/mlx5/mlx5_common_devx.c | 160 +
 drivers/common/mlx5/mlx5_common_devx.h |   2 +
 2 files changed, 110 insertions(+), 52 deletions(-)

diff --git a/drivers/common/mlx5/mlx5_common_devx.c 
b/drivers/common/mlx5/mlx5_common_devx.c
index aace5283e7..e237558ec2 100644
--- a/drivers/common/mlx5/mlx5_common_devx.c
+++ b/drivers/common/mlx5/mlx5_common_devx.c
@@ -30,6 +30,8 @@ mlx5_devx_cq_destroy(struct mlx5_devx_cq *cq)
 {
if (cq->cq)
claim_zero(mlx5_devx_cmd_destroy(cq->cq));
+   if (cq->consec)
+   return;
if (cq->umem_obj)
claim_zero(mlx5_os_umem_dereg(cq->umem_obj));
if (cq->umem_buf)
@@ -93,6 +95,7 @@ mlx5_devx_cq_create(void *ctx, struct mlx5_devx_cq *cq_obj, 
uint16_t log_desc_n,
uint32_t eqn;
uint32_t num_of_cqes = RTE_BIT32(log_desc_n);
int ret;
+   uint32_t umem_offset, umem_id;
 
if (page_size == (size_t)-1 || alignment == (size_t)-1) {
DRV_LOG(ERR, "Failed to get page_size.");
@@ -108,29 +111,44 @@ mlx5_devx_cq_create(void *ctx, struct mlx5_devx_cq 
*cq_obj, uint16_t log_desc_n,
}
/* Allocate memory buffer for CQEs and doorbell record. */
umem_size = sizeof(struct mlx5_cqe) * num_of_cqes;
-   umem_dbrec = RTE_ALIGN(umem_size, MLX5_DBR_SIZE);
-   umem_size += MLX5_DBR_SIZE;
-   umem_buf = mlx5_malloc_numa_tolerant(MLX5_MEM_RTE | MLX5_MEM_ZERO, 
umem_size,
-alignment, socket);
-   if (!umem_buf) {
-   DRV_LOG(ERR, "Failed to allocate memory for CQ.");
-   rte_errno = ENOMEM;
-   return -rte_errno;
-   }
-   /* Register allocated buffer in user space with DevX. */
-   umem_obj = mlx5_os_umem_reg(ctx, (void *)(uintptr_t)umem_buf, umem_size,
-   IBV_ACCESS_LOCAL_WRITE);
-   if (!umem_obj) {
-   DRV_LOG(ERR, "Failed to register umem for CQ.");
-   rte_errno = errno;
-   goto error;
+   if (!attr->q_len) {
+   umem_dbrec = RTE_ALIGN(umem_size, MLX5_DBR_SIZE);
+   umem_size += MLX5_DBR_SIZE;
+   umem_buf = mlx5_malloc_numa_tolerant(MLX5_MEM_RTE | 
MLX5_MEM_ZERO, umem_size,
+alignment, socket);
+   if (!umem_buf) {
+   DRV_LOG(ERR, "Failed to allocate memory for CQ.");
+   rte_errno = ENOMEM;
+   return -rte_errno;
+   }
+   /* Register allocated buffer in user space with DevX. */
+   umem_obj = mlx5_os_umem_reg(ctx, (void *)(uintptr_t)umem_buf, 
umem_size,
+   IBV_ACCESS_LOCAL_WRITE);
+   if (!umem_obj) {
+   DRV_LOG(ERR, "Failed to register umem for CQ.");
+   rte_errno = errno;
+   goto error;
+   }
+   umem_offset = 0;
+   umem_id = mlx5_os_get_umem_id(umem_obj);
+   } else {
+   if (umem_size != attr->q_len) {
+   DRV_LOG(ERR, "Mismatch between saved length and calc 
length of CQ %u-%u",
+   umem_size, attr->q_len);
+   rte_errno = EINVAL;
+   return -rte_errno;
+   }
+   umem_buf = attr->umem;
+   umem_offset = attr->q_off;
+   umem_dbrec = attr->db_off;
+   umem_id = mlx5_os_get_umem_id(attr->umem_obj);
}
/* Fill attributes for CQ object creation. */
attr->q_umem_valid = 1;
-   attr->q_umem_id = mlx5_os_get_umem_id(umem_obj);
-   attr->q_umem_offset = 0;
+   attr->q_umem_id = umem_id;
+   attr->q_umem_offset = umem_offset;
attr->db_umem_valid = 1;
-   attr->db_umem_id = attr->q_umem_id;
+   attr->db_umem_id = umem_id;
attr->db_umem_offset = umem_dbrec;
attr->eqn = eqn;
attr->log_cq_size = log_desc_n;
@@ -142,19 +160,29 @@ mlx5_devx_cq_create(void *ctx, struct mlx5_devx_cq 
*cq_obj, uint16_t log_desc_n,
rte_errno  = ENOMEM;
goto error;
}
-   cq_obj->umem_buf = umem_buf;
-   cq_obj->umem_obj = umem_obj;
+   if (!attr->q_len) {
+   cq_obj->umem_buf = umem_buf;
+   cq_obj->umem_obj = umem_obj;
+   cq

[PATCH v5 4/5] net/mlx5: pass the information in Tx queue start

2025-06-29 Thread Bing Zhao

The actual Devx object of SQs and CQs are only created in the
function mlx5_txq_start() in the device stage.

By changing the 1-level iteration to 2-level iterations, the Tx
queue with a big number of queue depth will be set up firstly.
This will help to split the memory from big trunks to small trunks.

In the testing, such assignment will help to improve the performance
a little bit. All the doorbells will be grouped and padded at the end
of the umem area.

The umem object and offsets information are passed to the Devx
creation function for the further usage.

Signed-off-by: Bing Zhao 
---
 drivers/common/mlx5/mlx5_devx_cmds.h | 10 
 drivers/net/mlx5/mlx5_devx.c | 26 -
 drivers/net/mlx5/mlx5_trigger.c  | 82 +++-
 3 files changed, 77 insertions(+), 41 deletions(-)

diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h 
b/drivers/common/mlx5/mlx5_devx_cmds.h
index 6c726a0d46..f5fda02c1e 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -483,6 +483,11 @@ struct mlx5_devx_create_sq_attr {
uint32_t packet_pacing_rate_limit_index:16;
uint32_t tis_lst_sz:16;
uint32_t tis_num:24;
+   uint32_t q_off;
+   void *umem;
+   void *umem_obj;
+   uint32_t q_len;
+   uint32_t db_off;
struct mlx5_devx_wq_attr wq_attr;
 };
 
@@ -514,6 +519,11 @@ struct mlx5_devx_cq_attr {
uint64_t db_umem_offset;
uint32_t eqn;
uint64_t db_addr;
+   void *umem;
+   void *umem_obj;
+   uint32_t q_off;
+   uint32_t q_len;
+   uint32_t db_off;
 };
 
 /* Virtq attributes structure, used by VIRTQ operations. */
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 3d49e096ef..0ee16ba4f0 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -1493,10 +1493,22 @@ mlx5_txq_create_devx_sq_resources(struct rte_eth_dev 
*dev, uint16_t idx,
mlx5_ts_format_conv(cdev->config.hca_attr.sq_ts_format),
.tis_num = mlx5_get_txq_tis_num(dev, idx),
};
+   uint32_t db_start = priv->consec_tx_mem.sq_total_size + 
priv->consec_tx_mem.cq_total_size;
+   int ret;
 
/* Create Send Queue object with DevX. */
-   return mlx5_devx_sq_create(cdev->ctx, &txq_obj->sq_obj,
-  log_desc_n, &sq_attr, priv->sh->numa_node);
+   if (priv->sh->config.txq_mem_algn) {
+   sq_attr.umem = priv->consec_tx_mem.umem;
+   sq_attr.umem_obj = priv->consec_tx_mem.umem_obj;
+   sq_attr.q_off = priv->consec_tx_mem.sq_cur_off;
+   sq_attr.db_off = db_start + (2 * idx) * MLX5_DBR_SIZE;
+   sq_attr.q_len = txq_data->sq_mem_len;
+   }
+   ret = mlx5_devx_sq_create(cdev->ctx, &txq_obj->sq_obj,
+ log_desc_n, &sq_attr, priv->sh->numa_node);
+   if (!ret && priv->sh->config.txq_mem_algn)
+   priv->consec_tx_mem.sq_cur_off += txq_data->sq_mem_len;
+   return ret;
 }
 #endif
 
@@ -1536,6 +1548,7 @@ mlx5_txq_devx_obj_new(struct rte_eth_dev *dev, uint16_t 
idx)
uint32_t cqe_n, log_desc_n;
uint32_t wqe_n, wqe_size;
int ret = 0;
+   uint32_t db_start = priv->consec_tx_mem.sq_total_size + 
priv->consec_tx_mem.cq_total_size;
 
MLX5_ASSERT(txq_data);
MLX5_ASSERT(txq_obj);
@@ -1557,6 +1570,13 @@ mlx5_txq_devx_obj_new(struct rte_eth_dev *dev, uint16_t 
idx)
rte_errno = EINVAL;
return 0;
}
+   if (priv->sh->config.txq_mem_algn) {
+   cq_attr.umem = priv->consec_tx_mem.umem;
+   cq_attr.umem_obj = priv->consec_tx_mem.umem_obj;
+   cq_attr.q_off = priv->consec_tx_mem.cq_cur_off;
+   cq_attr.db_off = db_start + (2 * idx + 1) * MLX5_DBR_SIZE;
+   cq_attr.q_len = txq_data->cq_mem_len;
+   }
/* Create completion queue object with DevX. */
ret = mlx5_devx_cq_create(sh->cdev->ctx, &txq_obj->cq_obj, log_desc_n,
  &cq_attr, priv->sh->numa_node);
@@ -1641,6 +1661,8 @@ mlx5_txq_devx_obj_new(struct rte_eth_dev *dev, uint16_t 
idx)
 #endif
txq_ctrl->uar_mmap_offset =
mlx5_os_get_devx_uar_mmap_offset(sh->tx_uar.obj);
+   if (priv->sh->config.txq_mem_algn)
+   priv->consec_tx_mem.cq_cur_off += txq_data->cq_mem_len;
ppriv->uar_table[txq_data->idx] = sh->tx_uar.bf_db;
dev->data->tx_queue_state[idx] = RTE_ETH_QUEUE_STATE_STARTED;
return 0;
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 00ffb39ecb..855d7518b9 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -51,52 +51,56 @@ static int
 mlx5_txq_start(struct rte_eth_dev *dev)
 {
struct mlx5_priv *priv = dev->data->dev_private;
-   unsigned int i;
+   uint32_t log_max_wqe = log2

[PATCH v5 1/5] net/mlx5: add new devarg for Tx queue consecutive memory

2025-06-29 Thread Bing Zhao

With this commit, a new device argument is introduced to control
the memory allocation for Tx queues.

By default, without specifying any value. A default alignment with
system page size will be used. All SQ / CQ memory of Tx queues will
be allocated once and a single umem & MR will be used.

When setting to 0, the legacy way of per queue umem allocation will
be selected in the following commit.

If the value is smaller than the system page size, the starting
address alignment will be rounded up to the page size.

The value is a logarithm value based to 2. Refer to the rst file
change for more details.

Signed-off-by: Bing Zhao 
---
 doc/guides/nics/mlx5.rst | 25 +
 drivers/net/mlx5/mlx5.c  | 36 
 drivers/net/mlx5/mlx5.h  |  7 ---
 3 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index c1dcb9ca68..13e46970ab 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1682,6 +1682,31 @@ for an additional list of options shared with other mlx5 
drivers.
 
   By default, the PMD will set this value to 1.
 
+- ``txq_mem_algn`` parameter [int]
+
+  A logarithm base 2 value for the memory starting address alignment
+  for Tx queues' WQ and associated CQ.
+
+  Different CPU architectures and generations may have different cache systems.
+  The memory accessing order may impact the cache misses rate on different 
CPUs.
+  This devarg gives the ability to control the umem alignment for all TxQs 
without
+  rebuilding the application binary.
+
+  The performance can be tuned by specifying this devarg after benchmark 
testing
+  on a specific system and hardware.
+
+  By default, ``txq_mem_algn`` is set to log2(4K), or log2(64K) on some 
specific OS
+  distributions - based on the system page size configuration.
+  All Tx queues will use a unique memory region and umem area. Each TxQ will 
start at
+  an address right after the previous one except the 1st queue that will be 
aligned at
+  the given size of address boundary controlled by this devarg.
+
+  If the value is less then the page size, it will be rounded up.
+  If it is bigger than the maximal queue size, a warning message will appear, 
there will
+  be some waste of memory at the beginning.
+
+  0 indicates legacy per queue memory allocation and separate Memory Regions 
(MR).
+
 
 Multiport E-Switch
 --
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 1bad8a9e90..a364e9e421 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -185,6 +185,14 @@
 /* Device parameter to control representor matching in ingress/egress flows 
with HWS. */
 #define MLX5_REPR_MATCHING_EN "repr_matching_en"
 
+/*
+ * Alignment of the Tx queue starting address,
+ * If not set, using separate umem and MR for each TxQ.
+ * If set, using consecutive memory address and single MR for all Tx queues, 
each TxQ will start at
+ * the alignment specified.
+ */
+#define MLX5_TXQ_MEM_ALGN "txq_mem_algn"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1447,6 +1455,8 @@ mlx5_dev_args_check_handler(const char *key, const char 
*val, void *opaque)
config->cnt_svc.cycle_time = tmp;
} else if (strcmp(MLX5_REPR_MATCHING_EN, key) == 0) {
config->repr_matching = !!tmp;
+   } else if (strcmp(MLX5_TXQ_MEM_ALGN, key) == 0) {
+   config->txq_mem_algn = (uint32_t)tmp;
}
return 0;
 }
@@ -1486,9 +1496,17 @@ mlx5_shared_dev_ctx_args_config(struct 
mlx5_dev_ctx_shared *sh,
MLX5_HWS_CNT_SERVICE_CORE,
MLX5_HWS_CNT_CYCLE_TIME,
MLX5_REPR_MATCHING_EN,
+   MLX5_TXQ_MEM_ALGN,
NULL,
};
int ret = 0;
+   size_t alignment = rte_mem_page_size();
+   uint32_t max_queue_umem_size = MLX5_WQE_SIZE * 
mlx5_dev_get_max_wq_size(sh);
+
+   if (alignment == (size_t)-1) {
+   alignment = (1 << MLX5_LOG_PAGE_SIZE);
+   DRV_LOG(WARNING, "Failed to get page_size, using default %zu 
size.", alignment);
+   }
 
/* Default configuration. */
memset(config, 0, sizeof(*config));
@@ -1501,6 +1519,7 @@ mlx5_shared_dev_ctx_args_config(struct 
mlx5_dev_ctx_shared *sh,
config->cnt_svc.cycle_time = MLX5_CNT_SVC_CYCLE_TIME_DEFAULT;
config->cnt_svc.service_core = rte_get_main_lcore();
config->repr_matching = 1;
+   config->txq_mem_algn = log2above(alignment);
if (mkvlist != NULL) {
/* Process parameters. */
ret = mlx5_kvargs_process(mkvlist, params,
@@ -1567,6 +1586,16 @@ mlx5_shared_dev_ctx_args_config(struct 
mlx5_dev_ctx_shared *sh,
config->hw_fcs_strip = 0;
else
config->hw_fcs_strip = sh->dev_cap.hw_fcs_strip;
+   if (config->txq_mem_algn != 0 && config->txq_mem_a

[PATCH v5 2/5] net/mlx5: calculate the memory length for all Tx queues

2025-06-29 Thread Bing Zhao

When the alignment is non-zero, it means that the single umem and MR
allocation for all Tx queues will be used.

In this commit, the total length of SQs and associated CQs will be
calculated and saved.

Signed-off-by: Bing Zhao 
---
 drivers/net/mlx5/mlx5.h |  4 +++
 drivers/net/mlx5/mlx5_tx.h  |  2 ++
 drivers/net/mlx5/mlx5_txq.c | 67 +++--
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 6b8d29a2bf..285c9ba396 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -2138,6 +2138,10 @@ struct mlx5_priv {
struct mlx5_nta_sample_ctx *nta_sample_ctx;
 #endif
struct rte_eth_dev *shared_host; /* Host device for HW steering. */
+   struct {
+   uint32_t sq_total_size;
+   uint32_t cq_total_size;
+   } consec_tx_mem;
RTE_ATOMIC(uint16_t) shared_refcnt; /* HW steering host reference 
counter. */
 };
 
diff --git a/drivers/net/mlx5/mlx5_tx.h b/drivers/net/mlx5/mlx5_tx.h
index 55568c41b1..94f2028513 100644
--- a/drivers/net/mlx5/mlx5_tx.h
+++ b/drivers/net/mlx5/mlx5_tx.h
@@ -149,6 +149,7 @@ struct __rte_cache_aligned mlx5_txq_data {
uint16_t inlen_mode; /* Minimal data length to inline. */
uint8_t tx_aggr_affinity; /* TxQ affinity configuration. */
uint32_t qp_num_8s; /* QP number shifted by 8. */
+   uint32_t sq_mem_len; /* Length of TxQ for WQEs */
uint64_t offloads; /* Offloads for Tx Queue. */
struct mlx5_mr_ctrl mr_ctrl; /* MR control descriptor. */
struct mlx5_wqe *wqes; /* Work queue. */
@@ -167,6 +168,7 @@ struct __rte_cache_aligned mlx5_txq_data {
uint64_t ts_mask; /* Timestamp flag dynamic mask. */
uint64_t ts_last; /* Last scheduled timestamp. */
int32_t ts_offset; /* Timestamp field dynamic offset. */
+   uint32_t cq_mem_len; /* Length of TxQ for CQEs */
struct mlx5_dev_ctx_shared *sh; /* Shared context. */
struct mlx5_txq_stats stats; /* TX queue counters. */
struct mlx5_txq_stats stats_reset; /* stats on last reset. */
diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
index 8ee8108497..1948a700f1 100644
--- a/drivers/net/mlx5/mlx5_txq.c
+++ b/drivers/net/mlx5/mlx5_txq.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1032,6 +1033,57 @@ txq_adjust_params(struct mlx5_txq_ctrl *txq_ctrl)
!txq_ctrl->txq.inlen_empw);
 }
 
+/*
+ * Calculate WQ memory length for a Tx queue.
+ *
+ * @param log_wqe_cnt
+ *   Logarithm value of WQE numbers.
+ *
+ * @return
+ *   memory length of this WQ.
+ */
+static uint32_t mlx5_txq_wq_mem_length(uint32_t log_wqe_cnt)
+{
+   uint32_t num_of_wqbbs = 1U << log_wqe_cnt;
+   uint32_t umem_size;
+
+   umem_size = MLX5_WQE_SIZE * num_of_wqbbs;
+   return umem_size;
+}
+
+/*
+ * Calculate CQ memory length for a Tx queue.
+ *
+ * @param dev
+ *   Pointer to Ethernet device.
+ * @param txq_ctrl
+ *   Pointer to the TxQ control structure of the CQ.
+ *
+ * @return
+ *   memory length of this CQ.
+ */
+static uint32_t
+mlx5_txq_cq_mem_length(struct rte_eth_dev *dev, struct mlx5_txq_ctrl *txq_ctrl)
+{
+   uint32_t cqe_n, log_desc_n;
+
+   if (__rte_trace_point_fp_is_enabled() &&
+   txq_ctrl->txq.offloads & RTE_ETH_TX_OFFLOAD_SEND_ON_TIMESTAMP)
+   cqe_n = UINT16_MAX / 2 - 1;
+   else
+   cqe_n = (1UL << txq_ctrl->txq.elts_n) / MLX5_TX_COMP_THRESH +
+   1 + MLX5_TX_COMP_THRESH_INLINE_DIV;
+   log_desc_n = log2above(cqe_n);
+   cqe_n = 1UL << log_desc_n;
+   if (cqe_n > UINT16_MAX) {
+   DRV_LOG(ERR, "Port %u Tx queue %u requests to many CQEs %u.",
+   dev->data->port_id, txq_ctrl->txq.idx, cqe_n);
+   rte_errno = EINVAL;
+   return 0;
+   }
+   return sizeof(struct mlx5_cqe) * cqe_n;
+}
+
 /**
  * Create a DPDK Tx queue.
  *
@@ -1057,6 +1109,7 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t idx, 
uint16_t desc,
struct mlx5_priv *priv = dev->data->dev_private;
struct mlx5_txq_ctrl *tmpl;
uint16_t max_wqe;
+   uint32_t wqebb_cnt, log_desc_n;
 
if (socket != (unsigned int)SOCKET_ID_ANY) {
tmpl = mlx5_malloc(MLX5_MEM_RTE | MLX5_MEM_ZERO, sizeof(*tmpl) +
@@ -1099,15 +1152,25 @@ mlx5_txq_new(struct rte_eth_dev *dev, uint16_t idx, 
uint16_t desc,
tmpl->txq.idx = idx;
txq_set_params(tmpl);
txq_adjust_params(tmpl);
+   wqebb_cnt = txq_calc_wqebb_cnt(tmpl);
max_wqe = mlx5_dev_get_max_wq_size(priv->sh);
-   if (txq_calc_wqebb_cnt(tmpl) > max_wqe) {
+   if (wqebb_cnt > max_wqe) {
DRV_LOG(ERR,
"port %u Tx WQEBB count (%d) exceeds the limit (%d),"
" try smaller queue size",
-   dev->data->port_id, txq_calc_wqebb

[PATCH v5 0/5] Use consecutive Tx queues' memory

2025-06-29 Thread Bing Zhao

This patchset will move all the mlx5 Tx queues memory to a
consecutive memory area. All the WQEBBs will be allocated based
on the offset of this memory area.

---
v2:
  1. add a new fix for legacy code of WQE calculation
  2. fix the style
v3:
  1. change the devarg and add description.
  2. reorganize the code with different commits order.
v4:
  1. fix building failure on Windows and OSes with different compilers
  2. update the rst
  3. addressing comments and fix bugs
v5:
  1. solve one size_t coverity warning
---

Bing Zhao (5):
  net/mlx5: add new devarg for Tx queue consecutive memory
  net/mlx5: calculate the memory length for all Tx queues
  net/mlx5: allocate and release unique resources for Tx queues
  net/mlx5: pass the information in Tx queue start
  net/mlx5: use consecutive memory for Tx queue creation

 doc/guides/nics/mlx5.rst   |  25 
 drivers/common/mlx5/mlx5_common_devx.c | 160 +++
 drivers/common/mlx5/mlx5_common_devx.h |   2 +
 drivers/common/mlx5/mlx5_devx_cmds.h   |  10 ++
 drivers/net/mlx5/mlx5.c|  36 +
 drivers/net/mlx5/mlx5.h|  15 ++-
 drivers/net/mlx5/mlx5_devx.c   |  26 +++-
 drivers/net/mlx5/mlx5_trigger.c| 173 +++--
 drivers/net/mlx5/mlx5_tx.h |   2 +
 drivers/net/mlx5/mlx5_txq.c|  67 +-
 10 files changed, 418 insertions(+), 98 deletions(-)

-- 
2.34.1

Re: [PATCH v3 0/8] net/r8169: support more cards

2025-06-29 Thread Stephen Hemminger

On Wed, 11 Jun 2025 11:09:53 +0800
Howard Wang  wrote:

> This patch series includes the following updates:
> 
> Add support for the RTL8168 1G NIC series.
> Add support for the RTL8127 10G NIC.
> Add support for the RTL8125CP NIC.
> Update hardware configuration for RTL8125 and RTL8126.
> 
> Howard Wang (8):
>   net/r8169: add support for RTL8168 series
>   net/r8169: update HW configurations for 8125 and 8126
>   net/r8169: add support for RTL8127
>   net/r8169: remove cmac feature for RTL8125AP
>   net/r8169: add RTL8127AP dash support
>   net/r8169: add support for RTL8125CP
>   net/r8169: add support for RTL8127ATF serdes interface
>   net/r8169: update HW configuration for 8127
> 
>  doc/guides/nics/r8169.rst  |9 +-
>  drivers/net/r8169/base/rtl8125a.c  |8 +-
>  drivers/net/r8169/base/rtl8125a.h  |1 -
>  drivers/net/r8169/base/rtl8125a_mcu.c  |   24 +-
>  drivers/net/r8169/base/rtl8125b.c  |9 +-
>  drivers/net/r8169/base/rtl8125b.h  |1 -
>  drivers/net/r8169/base/rtl8125b_mcu.c  |8 -
>  drivers/net/r8169/base/rtl8125bp.c |5 +
>  drivers/net/r8169/base/rtl8125bp_mcu.c |  200 +--
>  drivers/net/r8169/base/rtl8125cp.c |   73 +
>  drivers/net/r8169/base/rtl8125cp_mcu.c |   78 +
>  drivers/net/r8169/base/rtl8125cp_mcu.h |   10 +
>  drivers/net/r8169/base/rtl8125d.c  |  104 +-
>  drivers/net/r8169/base/rtl8125d_mcu.c  | 1479 +-
>  drivers/net/r8169/base/rtl8125d_mcu.h  |2 +-
>  drivers/net/r8169/base/rtl8126a.c  |   17 +-
>  drivers/net/r8169/base/rtl8126a_mcu.c  |  900 ++-
>  drivers/net/r8169/base/rtl8127.c   |  385 +
>  drivers/net/r8169/base/rtl8127_mcu.c   |  601 
>  drivers/net/r8169/base/rtl8127_mcu.h   |   12 +
>  drivers/net/r8169/base/rtl8168ep.c |  221 +++
>  drivers/net/r8169/base/rtl8168ep.h |   15 +
>  drivers/net/r8169/base/rtl8168ep_mcu.c |  177 +++
>  drivers/net/r8169/base/rtl8168fp.c |  195 +++
>  drivers/net/r8169/base/rtl8168fp.h |   14 +
>  drivers/net/r8169/base/rtl8168fp_mcu.c |  270 
>  drivers/net/r8169/base/rtl8168g.c  |  297 
>  drivers/net/r8169/base/rtl8168g.h  |   15 +
>  drivers/net/r8169/base/rtl8168g_mcu.c  | 1936 
>  drivers/net/r8169/base/rtl8168h.c  |  447 ++
>  drivers/net/r8169/base/rtl8168h.h  |   21 +
>  drivers/net/r8169/base/rtl8168h_mcu.c  | 1186 +++
>  drivers/net/r8169/base/rtl8168kb.c |5 +
>  drivers/net/r8169/base/rtl8168m.c  |   19 +
>  drivers/net/r8169/meson.build  |   14 +
>  drivers/net/r8169/r8169_compat.h   |   78 +-
>  drivers/net/r8169/r8169_dash.c |  447 +-
>  drivers/net/r8169/r8169_dash.h |9 +-
>  drivers/net/r8169/r8169_ethdev.c   |  122 +-
>  drivers/net/r8169/r8169_ethdev.h   |   39 +-
>  drivers/net/r8169/r8169_fiber.c|  201 +++
>  drivers/net/r8169/r8169_fiber.h|   42 +
>  drivers/net/r8169/r8169_hw.c   | 1841 +-
>  drivers/net/r8169/r8169_hw.h   |   74 +-
>  drivers/net/r8169/r8169_phy.c  | 1018 ++---
>  drivers/net/r8169/r8169_phy.h  |   16 +-
>  drivers/net/r8169/r8169_rxtx.c |  275 +++-
>  47 files changed, 11315 insertions(+), 1605 deletions(-)
>  create mode 100644 drivers/net/r8169/base/rtl8125cp.c
>  create mode 100644 drivers/net/r8169/base/rtl8125cp_mcu.c
>  create mode 100644 drivers/net/r8169/base/rtl8125cp_mcu.h
>  create mode 100644 drivers/net/r8169/base/rtl8127.c
>  create mode 100644 drivers/net/r8169/base/rtl8127_mcu.c
>  create mode 100644 drivers/net/r8169/base/rtl8127_mcu.h
>  create mode 100644 drivers/net/r8169/base/rtl8168ep.c
>  create mode 100644 drivers/net/r8169/base/rtl8168ep.h
>  create mode 100644 drivers/net/r8169/base/rtl8168ep_mcu.c
>  create mode 100644 drivers/net/r8169/base/rtl8168fp.c
>  create mode 100644 drivers/net/r8169/base/rtl8168fp.h
>  create mode 100644 drivers/net/r8169/base/rtl8168fp_mcu.c
>  create mode 100644 drivers/net/r8169/base/rtl8168g.c
>  create mode 100644 drivers/net/r8169/base/rtl8168g.h
>  create mode 100644 drivers/net/r8169/base/rtl8168g_mcu.c
>  create mode 100644 drivers/net/r8169/base/rtl8168h.c
>  create mode 100644 drivers/net/r8169/base/rtl8168h.h
>  create mode 100644 drivers/net/r8169/base/rtl8168h_mcu.c
>  create mode 100644 drivers/net/r8169/base/rtl8168m.c
>  create mode 100644 drivers/net/r8169/r8169_fiber.c
>  create mode 100644 drivers/net/r8169/r8169_fiber.h
> 

Ok merged into next-net

47 matches

Mail list logo