Hi,
What about checking EPOLLERR | EPOLLHUP | EPOLLRDHU flags for specific fd in
mlx5 habdler?
if devx_get_async_cmd_comp() returns EAGAIN {
if no data were read {
call epoll_wait() for specific fd and zero timeout
check EPOLLERR | EPOLLHUP | EPOLLRDHU flags
if fd is in hanging/error state {
- remove handler
}
}
}
With best regards,
Slava
> -----Original Message-----
> From: Kevin Traynor <[email protected]>
> Sent: Tuesday, February 10, 2026 9:08 PM
> To: Slava Ovsiienko <[email protected]>; Stephen Hemminger
> <[email protected]>
> Cc: [email protected]; NBU-Contact-Thomas Monjalon (EXTERNAL)
> <[email protected]>; [email protected]; Dariusz Sosnowski
> <[email protected]>; [email protected]; Harman Kalra
> <[email protected]>
> Subject: Re: [PATCH v2 1/2] net/mlx5: check for no data read in devx interrupt
>
> On 10/02/2026 17:05, Slava Ovsiienko wrote:
> > Hi,
> >
>
> Hi Slava,
>
> > I'm sorry, I have some concern about the patch.
> >
>
> No problem, that's what reviews are for :-) thanks for reviewing.
>
> > How it works, as far as I understand:
> >
> > - DPDK simulates interrupts in user mode with epoll_wait()
> > - mlx5 PMD emits the async counter query command to the NIC
> > periodically
>
> I didn't think this would happen unless there was something like hardware
> offload, but regardless, yes I agree there may be async counter queries.
>
> > - there might be multiple async query commands in the flight
> > - kernel drivers handles the async query completion interrupts, pushes
> > the token to the internal completion queue and unblocks associated fd
> > - epoll_wait() sees this unblocked fd and notifies mlx5 PMD about
> > - mlx5 PMD reads the completion token from the kernel queue with
> > devx_get_async_cmd_comp()
> >
> > The concern scenario, let's assume:
> >
> > - we have 2 async query commands in the flight
> > - the first async query completes, fd is unblocked, PMD is inviked,
> > the completion is read by PMD and is being handled
> > - the second async query completes, fd gets unblocked, the second
> > token is written to the queue
> > - the PMD completes the handling of the first completion and reads the
> > queue again (with devx_get_async_cmd_comp() call in the loop)
> > - it reads the second token successfully and handles
> > - then, on the third call, devx_get_async_cmd_comp() returns EAGAIN,
> > it means queue is empty
> > - DPDK calls epoll_wait() again and sees unblocked fd
> > - it call mlx5 PMD, and it calls devx_get_async_cmd_comp(), but queue
> > is empty (handled in previous interrupt handling)
> > - with the patch we wrongly remove the handler
> >
>
> I'm not sure, but this ^^^ sounds feasible.
>
> > In my opinion, we should handle flags EPOLLERR | EPOLLHUP | EPOLLRDHU
> > from the epoll_wait()_return also for RTE_INTR_HANDLE_EXT and
> RTE_INTR_HANDLE_DEV_EVENT interrupt types.
> >
>
> That's exactly what I had in v1 of the patch! The issue is that some clients
> of eal
> interrupt may not interpret the condition of EPOLLHUP/EPOLLRDHUP as an
> error condition and/or want to do some special handling.
>
> The example is vhost user server, which puts in place a reconnect mechanism.
> If
> we filter out EPOLLHUP/EPOLLRDHUP events in eal, then virtio will not receive
> the callback and vhost server reconnect is broken. I have some more notes
> about it in the cover letter.
>
> Trying to base on the read pattern in devx handler was an attempt to move
> logic
> out of eal so different handlers could be flexible in how they handle this
> condition.
>
> We do have a distinction in that mlx5 uses RTE_INTR_HANDLE_EXT and virtio
> uses RTE_INTR_HANDLE_VDEV but i'm not sure that is generic enough to base a
> check/don't check for EPOLLHUP/EPOLLRDHUP events on.
>
> So we'd need to come up with another solution if we wanted to filter this in
> eal.
> Let's think more on this, though we are a bit constrained by public API as
> well.
>
> A workaround we can do from application is David's hack⢠"-a 0000:00:00.0"
> to skip initial probe. That will at least prevent the issue for mlx devices
> not used
> in DPDK, which was the scenario reported.
>
> thanks,
> Kevin.
>
> > With best regards,
> > Slava
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: Kevin Traynor <[email protected]>
> >> Sent: Tuesday, February 10, 2026 5:06 PM
> >> To: Stephen Hemminger <[email protected]>
> >> Cc: [email protected]; NBU-Contact-Thomas Monjalon (EXTERNAL)
> >> <[email protected]>; [email protected]; Dariusz Sosnowski
> >> <[email protected]>; Slava Ovsiienko <[email protected]>;
> >> [email protected]
> >> Subject: Re: [PATCH v2 1/2] net/mlx5: check for no data read in devx
> >> interrupt
> >>
> >> On 07/02/2026 06:09, Stephen Hemminger wrote:
> >>> On Fri, 6 Feb 2026 17:20:53 +0000
> >>> Kevin Traynor <[email protected]> wrote:
> >>>
> >>>> A busy-loop may occur when there are EPOLLERR, EPOLLHUP or
> >> EPOLLRDHUP
> >>>> epoll events for the devx interrupt fd.
> >>>>
> >>>> This may happen if the interrupt fd is deleted, if the device is
> >>>> unbound from mlx5_core kernel driver or if the device is removed by
> >>>> the mlx5 kernel driver as part of LAG setup.
> >>>>
> >>>> When that occurs, there is no data to be read and in the devx
> >>>> interrupt handler an EAGAIN is returned on the first call to
> >>>> devx_get_async_cmd_comp, but this is not checked.
> >>>>
> >>>> As the interrupt is not removed or condition reset, it causes an
> >>>> interrupt processing busy-loop, which leads to the dpdk-intr thread
> >>>> going to 100% CPU.
> >>>>
> >>>> e.g.
> >>>> epoll_wait
> >>>> (6, [{events=EPOLLIN|EPOLLRDHUP, data={u32=28, u64=28}}], 8, -1)
> >>>> =
> >>>> 1 read(28, 0x7f1f5c7fc2f0, 40)
> >>>> = -1 EAGAIN (Resource temporarily unavailable) epoll_wait
> >>>> (6, [{events=EPOLLIN|EPOLLRDHUP, data={u32=28, u64=28}}], 8, -1)
> >>>> =
> >>>> 1 read(28, 0x7f1f5c7fc2f0, 40)
> >>>> = -1 EAGAIN (Resource temporarily unavailable)
> >>>>
> >>>> Add a check for an EAGAIN return from devx_get_async_cmd_comp on
> >>>> the first read. If that happens, unregister the callback to prevent
> >>>> looping.
> >>>>
> >>>> Bugzilla ID: 1873
> >>>> Fixes: f15db67df09c ("net/mlx5: accelerate DV flow counter query")
> >>>> Cc: [email protected]
> >>>>
> >>>> Signed-off-by: Kevin Traynor <[email protected]>
> >>>
> >>> AI spotted this, I didn't...
> >>>
> >>>
> >>> Errors:
> >>>
> >>> Line 139: Unnecessary semicolon after closing brace
> >>>
> >>> c
> >>>
> >>> };
> >>>
> >>> Should be:
> >>> c
> >>>
> >>> }
> >>>
> >>> Lines 142-146: Block comment uses incorrect style Block comments
> >>> in C
> >> code should use /* and */ style, not /** which is reserved for
> >> documentation comments.
> >>>
> >>> c
> >>>
> >>> /**
> >>> * no data and EAGAIN indicate there is an error or
> >>> * disconnect state. Unregister callback to prevent
> >>> * interrupt busy-looping.
> >>> */
> >>>
> >>> Should be:
> >>> c
> >>>
> >>> /*
> >>> * no data and EAGAIN indicate there is an error or
> >>> * disconnect state. Unregister callback to prevent
> >>> * interrupt busy-looping.
> >>> */
> >>>
> >>> Warnings:
> >>>
> >>> Logic clarity: The variable data_read is set to true inside the
> >>> while loop but
> >> never checked when data WAS read. Consider if data_read is the
> >> clearest way to express this condition.
> >>>
> >>
> >> Ack above. Thanks.Will be fixed in v3.
> >