On Wed, 4 Feb 2026 15:50:07 +0000
Cliff Burdick <[email protected]> wrote:

> Fixes since v3:
> * Fixed version in RTE_EXPORT_EXPERIMENTAL_SYMBOL
> 
> Add support for kernel dmabuf feature and integrate it in the mlx5 driver.
> This feature is needed to support GPUDirect on newer kernels.
> 
> I apologize for all the patches. Still trying to learn how to submit these.
> 
> Cliff Burdick (2):
>   eal: support dmabuf
>   common/mlx5: support dmabuf
> 
>  .mailmap                                      |   1 +
>  doc/guides/rel_notes/release_26_03.rst        |   6 +
>  drivers/common/mlx5/linux/meson.build         |   2 +
>  drivers/common/mlx5/linux/mlx5_common_verbs.c |  48 ++++-
>  drivers/common/mlx5/linux/mlx5_glue.c         |  19 ++
>  drivers/common/mlx5/linux/mlx5_glue.h         |   3 +
>  drivers/common/mlx5/mlx5_common.c             |  42 ++++-
>  drivers/common/mlx5/mlx5_common_mr.c          | 113 +++++++++++-
>  drivers/common/mlx5/mlx5_common_mr.h          |  17 +-
>  drivers/common/mlx5/windows/mlx5_common_os.c  |   8 +-
>  drivers/crypto/mlx5/mlx5_crypto.h             |   1 +
>  drivers/crypto/mlx5/mlx5_crypto_gcm.c         |   3 +-
>  lib/eal/common/eal_common_memory.c            | 165 +++++++++++++++++-
>  lib/eal/common/eal_memalloc.h                 |  21 +++
>  lib/eal/common/malloc_heap.c                  |  27 +++
>  lib/eal/common/malloc_heap.h                  |   5 +
>  lib/eal/include/rte_memory.h                  | 145 +++++++++++++++
>  17 files changed, 612 insertions(+), 14 deletions(-)
> 

I don't think anyone look at the details here.
If they did they would see the same thing as what AI did.


Review: [PATCH v4 1/2] eal: support dmabuf
        [PATCH v4 2/2] common/mlx5: support dmabuf

Good approach using a side-table to avoid ABI changes to
rte_memseg_list. The dmabuf support for mlx5 is well-structured
with proper compile-time gating via HAVE_IBV_REG_DMABUF_MR.

Patch 1/2 - eal: support dmabuf

Error: rte_extmem_register is broken by rte_extmem_register_dmabuf.

rte_extmem_register now calls rte_extmem_register_dmabuf with
fd=-1. But rte_extmem_register_dmabuf rejects dmabuf_fd < 0:

  int
  rte_extmem_register_dmabuf(void *va_addr, size_t len,
      int dmabuf_fd, uint64_t dmabuf_offset,
      rte_iova_t iova_addrs[], unsigned int n_pages, size_t page_sz)
  {
      if (dmabuf_fd < 0) {
          rte_errno = EINVAL;
          return -1;
      }
      ...

So every call to rte_extmem_register will now fail with EINVAL.
rte_extmem_register should call extmem_register directly (with
fd=-1), not rte_extmem_register_dmabuf:

  int
  rte_extmem_register(void *va_addr, size_t len, rte_iova_t iova_addrs[],
      unsigned int n_pages, size_t page_sz)
  {
      return extmem_register(va_addr, len, -1, 0, iova_addrs,
          n_pages, page_sz);
  }

Error: Input validation from rte_extmem_register is lost.

The original rte_extmem_register validated va_addr, page_sz, len,
alignment, and n_pages before doing anything. The new
extmem_register function has no input validation -- it only has
the locking and memseg creation logic. The parameter checks that
were in the original function (va_addr == NULL, page_sz == 0,
len == 0, power-of-2, alignment) need to be in extmem_register
or duplicated in both rte_extmem_register and
rte_extmem_register_dmabuf.

Error: eal_memseg_list_init uses type_msl_idx to index dmabuf_info
but this is not the same as the memseg list index in mcfg->memsegs.

eal_memseg_list_init initializes dmabuf_info[type_msl_idx], but
type_msl_idx is a per-type index used for naming, not the global
memseg list position. When malloc_heap_create_external_seg_dmabuf
later sets dmabuf_info[msl_idx] where msl_idx = msl - mcfg->memsegs,
these are different index spaces. The init in eal_memseg_list_init
may write to wrong slots, and external segments may use indices
that were never initialized by eal_memseg_list_init. The static
dmabuf_info array is zero-initialized at program start (fd=0,
offset=0), but fd=0 is a valid file descriptor (stdin), so fd
should be initialized to -1 for all entries, or the
eal_memseg_list_init initialization should be removed and
initialization done only in malloc_heap_create_external_seg and
malloc_heap_create_external_seg_dmabuf.

Warning: malloc_heap_create_external_seg now redundantly
initializes dmabuf_info.

malloc_heap_create_external_seg calls
eal_memseg_list_set_dmabuf_info(i, -1, 0) at the end, but
malloc_heap_create_external_seg_dmabuf also calls
malloc_heap_create_external_seg (which sets fd=-1) and then
immediately overwrites with the actual fd. This works but is
confusing -- the redundant initialization in the base function
was added for non-dmabuf paths but makes the dmabuf path do a
useless set-then-overwrite.

Warning: dmabuf_info is process-local static storage.

The side-table approach means dmabuf metadata is not shared with
secondary processes via shared memory. The commit message mentions
rte_extmem_attach for cross-process use, but secondary processes
will not have the dmabuf_info populated. If a secondary process
calls rte_memseg_list_get_dmabuf_fd, it will always get -1. This
limitation should be documented in the Doxygen for
rte_extmem_register_dmabuf.

Warning: dmabuf fd lifetime / ownership is not documented.

The API does not specify whether DPDK takes ownership of the
dmabuf_fd (will it close it?) or whether the caller must keep it
open. Since ibv_reg_dmabuf_mr likely holds a reference through
the kernel, the fd can probably be closed after registration,
but this should be explicitly documented in the Doxygen. Also,
rte_extmem_unregister does not clean up the dmabuf_info entry
(reset fd to -1), so stale metadata remains after unregistration.

Warning: the public API functions in eal_memalloc.h have the same
names as the public API functions in rte_memory.h but different
signatures.

eal_memalloc.h declares:
  int eal_memseg_list_get_dmabuf_fd(int list_idx);
  int eal_memseg_list_get_dmabuf_offset(int list_idx, uint64_t *offset);

rte_memory.h declares:
  int rte_memseg_list_get_dmabuf_fd(const struct rte_memseg_list *msl);
  int rte_memseg_list_get_dmabuf_offset(const struct rte_memseg_list *msl, ...);

These are distinct functions, but the eal_memalloc.h header claims
to declare:
  "Get dma-buf fd for a memseg list."
  "Get dma-buf offset for a memseg list."
...with names prefixed eal_ vs rte_. This is fine functionally but
the internal and public function signatures should use the same
error convention. The internal eal_memseg_list_get_dmabuf_fd
returns -EINVAL on error, while the public
rte_memseg_list_get_dmabuf_fd_unsafe returns -1 with rte_errno.
The public function does not call the internal function at all --
it accesses dmabuf_info directly. The internal functions in
eal_memalloc.h appear unused.

Patch 2/2 - common/mlx5: support dmabuf

Warning: mlx5_os_set_reg_mr_cb signature change is an internal
ABI break.

The function signature changes from 2 to 3 parameters. This is
an internal symbol so it's not a public ABI break, but all
callers must be updated atomically. The Windows and Linux
implementations are both updated, and the crypto caller is
updated, so this appears complete. Verify no other callers exist
outside this patch.

Warning: duplicate dmabuf detection logic in mlx5_common.c and
mlx5_common_mr.c.

The pattern of checking msl->external, getting dmabuf_fd, getting
dmabuf_offset, and calculating the adjusted offset is repeated
nearly identically in mlx5_common_dev_dma_map and
mlx5_mr_mempool_register_primary. This should be factored into a
helper function to avoid the code duplication and ensure consistent
behavior.

Info: mlx5_common_verbs_reg_dmabuf_mr passes addr as iova.

The call is:
  reg_dmabuf_mr_cb(pd, dmabuf_offset, len, addr, dmabuf_fd, ...)

where addr is the user-space virtual address. For ibv_reg_dmabuf_mr,
the iova parameter is the "IO virtual address the device will use
to access the region." Using the user-space VA as iova means the
MR maps virtual addresses 1:1. This is the common pattern for
DPDK but worth noting that it won't work if the device expects
different IO addresses.

Reviewed-by: Stephen Hemminger <[email protected]>

Reply via email to