Hi Dariusz, Khadem,
On Mon, 21 Jul 2025, Dariusz Sosnowski wrote:
On Mon, Jul 21, 2025 at 01:59:59PM +0200, Thomas Monjalon wrote:
21/07/2025 13:46, Ivan Malov:
On Mon, 21 Jul 2025, Dariusz Sosnowski wrote:
+ mlx5 maintainers
Thank you for the patch.
Could you please include other PMD maintainers (or other maintainers, depending
on changed code)
in the future patches?
There is a script which automatically adds maintainers while sending a patch.
It is described in:
https://doc.dpdk.org/guides/contributing/patches.html#sending-patches
On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote:
When the primary process exits, the shared mlx5 state becomes
unavailable to secondary processes. If a secondary process attempts
to query device information (e.g., via testpmd), a NULL dereference
may occur due to missing shared data.
This patch adds a check for shared context availability and fails
gracefully while preventing a crash.
Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop")
Cc: sta...@dpdk.org
Signed-off-by: Khadem Ullah <14pwcse1...@uetpeshawar.edu.pk>
---
drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index 68d1c1bfa7..1848f6536a 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct
rte_eth_dev_info *info)
* Since we need one CQ per QP, the limit is the minimum number
* between the two values.
*/
+ if (priv == NULL || priv->sh == NULL) {
+ DRV_LOG(ERR,
+ "mlx5 shared data unavailable (primary process likely exited)");
+ rte_errno = ENODEV;
+ return -rte_errno;
+ }
I don't think it's an issue on PMD level, but rather on
ethdev/multi-process handling level.
I may be very wrong here, but can't the PMD use its internal 'shared' state to
somehow memorise the fact that a secondary process has attached, in order to not
let the primary process close in the first place (before the secondary process
detaches)? Or is this going to violate some established convention?
It looks to be a good idea.
I agree with idea of adding these checks, but not entirely agree with
it being at driver level, since all drivers would have to duplicate this logic.
Drivers already have to go through ethdev library when port is probed
on secondary process - rte_eth_dev_attach_secondary() must be
called to retrieve port data local to the process and device data shared
between processes.
Memorizing whether a secondary process is using given port can be added
to rte_eth_dev_attach_secondary(), and relevant check for primary
process can then be added to rte_eth_dev_close(), so that all drivers
benefit.
What do you think?
Yes, in general, the idea would be to have a "secondary reference counter" field
in 'rte_eth_dev_data' and have it incremented/decremented/checked in some
generic places. The issue is that, in addition to 'rte_eth_dev_close', a device
can be "closed" via its respective bus 'remove' method. Yes, there are drivers
that invoke 'rte_eth_dev_close' inside the 'remove' implementation, but not all
of them, unfortunately. For example, take a look at 'af_packet' and similar.
On the other hands, there are drivers that use 'rte_eth_dev_create' and its
counterpart 'rte_eth_dev_destroy' in the implementations of bus 'probe/remove',
but that is also not consistent across the 'drivers/net' tree.
Given all these issues and the fact that ethdev is not the only possible device
type, may be 'rte_device' can house the reference counter, with 'rte_dev_probe'
and 'rte_dev_remove' augmented to do the inrement/decrement/validate thing?
And, since 'rte_eth_dev_close' invokes direct ethdev 'close' method, also add a
check to over there (and may be to 'rte_eth_dev_reset', too)? May be I'm wrong.
However, all of that might still make no sense, given the fact that the primary
process can just die. So, may be I am indeed very wrong and, as Stephen has
suggested in [1], this issue it to be addressed by clarifying the documentation.
[1] https://mails.dpdk.org/archives/dev/2025-July/321865.html
Thank you.
When primary process closes the port, ethdev library zeroes and frees
device data shared between processes.
ethdev port data (rte_eth_dev) on secondary is not updated so it now points to
invalid data. rte_eth_dev_info_get() is not the only API call affected.
If the primary process closes the port before exiting
(like testpmd does) and it exits before the secondary,
the any driver call seems invalid because of that use-after-free behavior.
@Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports
in secondary process after primary has gracefully exited is supported?
No there is no statement about whether it is supported or not.
I think we should at least return an error instead of crashing.