On 10/02/2026 17:05, Slava Ovsiienko wrote:
> Hi,
> 

Hi Slava,

> I'm sorry, I have some concern about the patch.
> 

No problem, that's what reviews are for :-) thanks for reviewing.

> How it works, as far as I understand:
> 
> - DPDK simulates interrupts in user mode with epoll_wait()
> - mlx5 PMD emits the async counter query command to the NIC periodically

I didn't think this would happen unless there was something like
hardware offload, but regardless, yes I agree there may be async counter
queries.

> - there might be multiple async query commands in the flight
> - kernel drivers handles the async query completion interrupts, pushes the 
> token to the internal completion queue and unblocks associated fd
> - epoll_wait() sees this unblocked fd and notifies mlx5 PMD about
> - mlx5 PMD reads the completion token from the kernel queue with 
> devx_get_async_cmd_comp()
> 
> The concern scenario, let's assume:
> 
> - we have 2 async query commands in the flight
> - the first async query completes, fd is unblocked, PMD is inviked, the 
> completion is read by PMD and is being handled
> - the second async query completes, fd gets unblocked, the second token is 
> written to the queue
> - the PMD completes the handling of the first completion and reads the queue 
> again (with devx_get_async_cmd_comp() call in the loop)
> - it reads the second token successfully and handles
> - then, on the third call, devx_get_async_cmd_comp() returns EAGAIN, it means 
> queue is empty
> - DPDK calls epoll_wait() again and sees unblocked fd
> - it call mlx5 PMD, and it calls devx_get_async_cmd_comp(), but queue is 
> empty (handled in previous interrupt handling)
> - with the patch we wrongly remove the handler 
> 

I'm not sure, but this ^^^ sounds feasible.

> In my opinion, we should handle flags EPOLLERR | EPOLLHUP | EPOLLRDHU from 
> the epoll_wait()_return also for
> RTE_INTR_HANDLE_EXT and RTE_INTR_HANDLE_DEV_EVENT interrupt types.
> 

That's exactly what I had in v1 of the patch! The issue is that some
clients of eal interrupt may not interpret the condition of
EPOLLHUP/EPOLLRDHUP as an error condition and/or want to do some special
handling.

The example is vhost user server, which puts in place a reconnect
mechanism. If we filter out EPOLLHUP/EPOLLRDHUP events in eal, then
virtio will not receive the callback and vhost server reconnect is
broken. I have some more notes about it in the cover letter.

Trying to base on the read pattern in devx handler was an attempt to
move logic out of eal so different handlers could be flexible in how
they handle this condition.

We do have a distinction in that mlx5 uses RTE_INTR_HANDLE_EXT and
virtio uses RTE_INTR_HANDLE_VDEV but i'm not sure that is generic enough
to base a check/don't check for EPOLLHUP/EPOLLRDHUP events on.

So we'd need to come up with another solution if we wanted to filter
this in eal. Let's think more on this, though we are a bit constrained
by public API as well.

A workaround we can do from application is David's hackā„¢ "-a
0000:00:00.0" to skip initial probe. That will at least prevent the
issue for mlx devices not used in DPDK, which was the scenario reported.

thanks,
Kevin.

> With best regards,
> Slava
>  
> 
> 
> 
>> -----Original Message-----
>> From: Kevin Traynor <[email protected]>
>> Sent: Tuesday, February 10, 2026 5:06 PM
>> To: Stephen Hemminger <[email protected]>
>> Cc: [email protected]; NBU-Contact-Thomas Monjalon (EXTERNAL)
>> <[email protected]>; [email protected]; Dariusz Sosnowski
>> <[email protected]>; Slava Ovsiienko <[email protected]>;
>> [email protected]
>> Subject: Re: [PATCH v2 1/2] net/mlx5: check for no data read in devx 
>> interrupt
>>
>> On 07/02/2026 06:09, Stephen Hemminger wrote:
>>> On Fri,  6 Feb 2026 17:20:53 +0000
>>> Kevin Traynor <[email protected]> wrote:
>>>
>>>> A busy-loop may occur when there are EPOLLERR, EPOLLHUP or
>> EPOLLRDHUP
>>>> epoll events for the devx interrupt fd.
>>>>
>>>> This may happen if the interrupt fd is deleted, if the device is
>>>> unbound from mlx5_core kernel driver or if the device is removed by
>>>> the mlx5 kernel driver as part of LAG setup.
>>>>
>>>> When that occurs, there is no data to be read and in the devx
>>>> interrupt handler an EAGAIN is returned on the first call to
>>>> devx_get_async_cmd_comp, but this is not checked.
>>>>
>>>> As the interrupt is not removed or condition reset, it causes an
>>>> interrupt processing busy-loop, which leads to the dpdk-intr thread
>>>> going to 100% CPU.
>>>>
>>>> e.g.
>>>> epoll_wait
>>>>    (6, [{events=EPOLLIN|EPOLLRDHUP, data={u32=28, u64=28}}], 8, -1) =
>>>> 1 read(28, 0x7f1f5c7fc2f0, 40)
>>>>    = -1 EAGAIN (Resource temporarily unavailable) epoll_wait
>>>>    (6, [{events=EPOLLIN|EPOLLRDHUP, data={u32=28, u64=28}}], 8, -1) =
>>>> 1 read(28, 0x7f1f5c7fc2f0, 40)
>>>>    = -1 EAGAIN (Resource temporarily unavailable)
>>>>
>>>> Add a check for an EAGAIN return from devx_get_async_cmd_comp on the
>>>> first read. If that happens, unregister the callback to prevent looping.
>>>>
>>>> Bugzilla ID: 1873
>>>> Fixes: f15db67df09c ("net/mlx5: accelerate DV flow counter query")
>>>> Cc: [email protected]
>>>>
>>>> Signed-off-by: Kevin Traynor <[email protected]>
>>>
>>> AI spotted this, I didn't...
>>>
>>>
>>> Errors:
>>>
>>>     Line 139: Unnecessary semicolon after closing brace
>>>
>>> c
>>>
>>>    };
>>>
>>> Should be:
>>> c
>>>
>>>    }
>>>
>>>     Lines 142-146: Block comment uses incorrect style Block comments in C
>> code should use /* and */ style, not /** which is reserved for documentation
>> comments.
>>>
>>> c
>>>
>>>    /**
>>>     * no data and EAGAIN indicate there is an error or
>>>     * disconnect state. Unregister callback to prevent
>>>     * interrupt busy-looping.
>>>     */
>>>
>>> Should be:
>>> c
>>>
>>>    /*
>>>     * no data and EAGAIN indicate there is an error or
>>>     * disconnect state. Unregister callback to prevent
>>>     * interrupt busy-looping.
>>>     */
>>>
>>> Warnings:
>>>
>>>     Logic clarity: The variable data_read is set to true inside the while 
>>> loop but
>> never checked when data WAS read. Consider if data_read is the clearest way 
>> to
>> express this condition.
>>>
>>
>> Ack above. Thanks.Will be fixed in v3.
> 

Reply via email to