On Mon, Jun 22, 2026 at 4:13 AM David Laight
<[email protected]> wrote:
>

Hi David,

Thank you for your review. You raised many good points regarding
optimizations here. I'll switch to using 2G as the max entry size
(`SZ_2G` from `linux/sizes.h`), and remove divisions and
multiplications. I'll also replace the `for()` loop with `while
(length)`, and drop `min_t()` in favor of `min()` by casting `SZ_2G`
to `size_t`. I'll send out a v2 with these changes shortly.

Thanks,
David

> > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > first entry, resulting in non-page-aligned DMA addresses for all
> > subsequent entries.
>
> How did you find this?
> It requires a single buffer over 4GB - seems highly unlikely.

It was observed during experiments with buffers over 8GB on an accelerator.

> >
> > While the underlying IOMMU mapping may be contiguous, hardware
> > DMA engines often require explicit address alignment (e.g., page,
> > cacheline, or storage sector boundaries). Passing unaligned
> > addresses and lengths can cause explicit failures in DMA descriptor
> > creation or silent data corruption if lower unaligned bits are
> > truncated.
> >
> > Fix this by splitting the scatterlist by the largest possible page
> > aligned chunk within `UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`).
> > This ensures all scatterlist DMA addresses and lengths remain page
> > aligned and satisfy hardware constraints.
>
> It would almost certainly better to spilt into 2G chunks.
> That removes any need for any divisions.

I agree. 2G naturally aligns with most hardware boundaries, while also
allowing compiler optimizations with simple bit shifts.

>
> > Page-aligned entries allow the system to cleanly chunk payloads into
> > PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
> > As a result, this may help reduce TLP fragmentation in P2P transfers
> > and alleviate potential congestion within a logical PCIe switch
> > partition, especially when Relaxed Ordering is not possible due to
> > hardware constraints.
> >
> > Reported-by: sashiko-bot <[email protected]>
> > Closes: 
> > https://lore.kernel.org/all/[email protected]/
> > Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping 
> > routine")
> > Cc: [email protected]
> > Signed-off-by: David Hu <[email protected]>
> > ---
> >  drivers/dma-buf/dma-buf-mapping.c | 13 ++++++++-----
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/dma-buf/dma-buf-mapping.c 
> > b/drivers/dma-buf/dma-buf-mapping.c
> > index 794acff2546a..f2bde38fdb1f 100644
> > --- a/drivers/dma-buf/dma-buf-mapping.c
> > +++ b/drivers/dma-buf/dma-buf-mapping.c
> > @@ -5,6 +5,9 @@
> >   */
> >  #include <linux/dma-buf-mapping.h>
> >  #include <linux/dma-resv.h>
> > +#include <linux/align.h>
> > +
> > +#define MAX_ENT_SZ ALIGN_DOWN(UINT_MAX, PAGE_SIZE)
>
> >
> >  static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t 
> > length,
> >                                        dma_addr_t addr)
> > @@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct 
> > scatterlist *sgl, size_t length,
> >       unsigned int len, nents;
> >       int i;
> >
> > -     nents = DIV_ROUND_UP(length, UINT_MAX);
> > +     nents = DIV_ROUND_UP(length, MAX_ENT_SZ);
> >       for (i = 0; i < nents; i++) {
>
> Why not change that to 'while (length) {' to avoid the division above.

Sounds good, will do.

>
> > -             len = min_t(size_t, length, UINT_MAX);
> > +             len = min_t(size_t, length, MAX_ENT_SZ);
>
> I bet that doesn't need to be min_t()

Agreed.


>
> >               length -= len;
> >               /*
> >                * DMABUF abuses scatterlist to create a scatterlist
> > @@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct 
> > scatterlist *sgl, size_t length,
> >                * does not require the CPU list for mapping or unmapping.
> >                */
> >               sg_set_page(sgl, NULL, 0, 0);
> > -             sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
> > +             sg_dma_address(sgl) = addr + (dma_addr_t)i * MAX_ENT_SZ;
> >               sg_dma_len(sgl) = len;
>
> Replace the multiply with 'addr += len'.

Will update this as well.

>
> -- David
>
> >               sgl = sg_next(sgl);
> >       }
> > @@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state 
> > *state,
> >
> >       if (!state || !dma_use_iova(state)) {
> >               for (i = 0; i < nr_ranges; i++)
> > -                     nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> > +                     nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ);
> >       } else {
> >               /*
> >                * In IOVA case, there is only one SG entry which spans
> >                * for whole IOVA address space, but we need to make sure
> >                * that it fits sg->length, maybe we need more.
> >                */
> > -             nents = DIV_ROUND_UP(size, UINT_MAX);
> > +             nents = DIV_ROUND_UP(size, MAX_ENT_SZ);
> >       }
> >
> >       return nents;
>

Reply via email to