On Tue, Feb 17, 2026 at 02:24:38PM +0100, Larysa Zaremba wrote:
> Aside from the issue described below, tailroom calculation does not account
> for pages being split between frags, e.g. in i40e, enetc and
> AF_XDP ZC with smaller chunks. These series address the problem by
> calculating modulo (skb_frag_off() % rxq->frag_size) in order to get
> data offset within a smaller block of memory. Please note, xskxceiver
> tail grow test passes without modulo e.g. in xdpdrv mode on i40e,
> because there is not enough descriptors to get to flipped buffers.
> 
> Many ethernet drivers report xdp Rx queue frag size as being the same as
> DMA write size. However, the only user of this field, namely
> bpf_xdp_frags_increase_tail(), clearly expects a truesize.
> 
> Such difference leads to unspecific memory corruption issues under certain
> circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
> running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
> all DMA-writable space in 2 buffers. This would be fine, if only
> rxq->frag_size was properly set to 4K, but value of 3K results in a
> negative tailroom, because there is a non-zero page offset.
> 
> We are supposed to return -EINVAL and be done with it in such case,
> but due to tailroom being stored as an unsigned int, it is reported to be
> somewhere near UINT_MAX, resulting in a tail being grown, even if the
> requested offset is too much(it is around 2K in the abovementioned test).
> This later leads to all kinds of unspecific calltraces.
> 
> [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 
> 00007f41615a6a00 error 6
> [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 
> sp 00007f415bffecf0 error 4
> [ 7340.338179]  in libc.so.6[61c9d,7f4161aaf000+160000]
> [ 7340.339230]  in xskxceiver[42b5,400000+69000]
> [ 7340.340300]  likely on CPU 6 (core 0, socket 6)
> [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 
> c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 
> 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
> [ 7340.340888]  likely on CPU 3 (core 0, socket 3)
> [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff 
> ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 
> 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
> [ 7340.404334] Oops: general protection fault, probably for non-canonical 
> address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
> [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 
> 6.19.0-rc1+ #21 PREEMPT(lazy)
> [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.17.0-5.fc42 04/01/2014
> [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
> [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 
> 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 
> 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
> [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
> [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 
> 0000000000000010
> [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 
> 000382fc7fffffff
> [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: 
> ffffcc5c04f7f7d0
> [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: 
> ffffcc5c04f7f7d0
> [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: 
> ffff891f47789500
> [ 7340.418229] FS:  0000000000000000(0000) GS:ffff891ffdfaa000(0000) 
> knlGS:0000000000000000
> [ 7340.419489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 
> 0000000000772ef0
> [ 7340.421237] PKRU: 55555554
> [ 7340.421623] Call Trace:
> [ 7340.421987]  <TASK>
> [ 7340.422309]  ? softleaf_from_pte+0x77/0xa0
> [ 7340.422855]  swap_pte_batch+0xa7/0x290
> [ 7340.423363]  zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
> [ 7340.424102]  zap_pte_range+0x281/0x580
> [ 7340.424607]  zap_pmd_range.isra.0+0xc9/0x240
> [ 7340.425177]  unmap_page_range+0x24d/0x420
> [ 7340.425714]  unmap_vmas+0xa1/0x180
> [ 7340.426185]  exit_mmap+0xe1/0x3b0
> [ 7340.426644]  __mmput+0x41/0x150
> [ 7340.427098]  exit_mm+0xb1/0x110
> [ 7340.427539]  do_exit+0x1b2/0x460
> [ 7340.427992]  do_group_exit+0x2d/0xc0
> [ 7340.428477]  get_signal+0x79d/0x7e0
> [ 7340.428957]  arch_do_signal_or_restart+0x34/0x100
> [ 7340.429571]  exit_to_user_mode_loop+0x8e/0x4c0
> [ 7340.430159]  do_syscall_64+0x188/0x6b0
> [ 7340.430672]  ? __do_sys_clone3+0xd9/0x120
> [ 7340.431212]  ? switch_fpu_return+0x4e/0xd0
> [ 7340.431761]  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
> [ 7340.432498]  ? do_syscall_64+0xbb/0x6b0
> [ 7340.433015]  ? __handle_mm_fault+0x445/0x690
> [ 7340.433582]  ? count_memcg_events+0xd6/0x210
> [ 7340.434151]  ? handle_mm_fault+0x212/0x340
> [ 7340.434697]  ? do_user_addr_fault+0x2b4/0x7b0
> [ 7340.435271]  ? clear_bhb_loop+0x30/0x80
> [ 7340.435788]  ? clear_bhb_loop+0x30/0x80
> [ 7340.436299]  ? clear_bhb_loop+0x30/0x80
> [ 7340.436812]  ? clear_bhb_loop+0x30/0x80
> [ 7340.437323]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 7340.437973] RIP: 0033:0x7f4161b14169
> [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
> [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 
> 00000000000000ca
> [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 
> 00007f4161b14169
> [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 
> 00007f415bfff990
> [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 
> 00000000ffffffff
> [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 
> 0000000000000000
> [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 
> 00007f415bfff6c0
> [ 7340.444586]  </TASK>
> [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common 
> intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat 
> fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt 
> iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr 
> libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon 
> joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs 
> ghash_clmulni_intel serio_raw qemu_fw_cfg
> [ 7340.449650] ---[ end trace 0000000000000000 ]---
> 
> The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> drivers to not do this. Therefore, make tailroom a signed int and produce a
> warning when it is negative to prevent such mistakes in the future.
> 
> The issue can also be easily reproduced with ice driver, by applying
> the following diff to xskxceiver and enjoying a kernel panic in xdpdrv mode:
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c 
> b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> index 5af28f359cfd..042d587fa7ef 100644
> --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> @@ -2541,8 +2541,8 @@ int testapp_adjust_tail_grow_mb(struct test_spec *test)
>  {
>         test->mtu = MAX_ETH_JUMBO_SIZE;
>         /* Grow by (frag_size - last_frag_Size) - 1 to stay inside the last 
> fragment */
> -       return testapp_adjust_tail(test, (XSK_UMEM__MAX_FRAME_SIZE / 2) - 1,
> -                                  XSK_UMEM__LARGE_FRAME_SIZE * 2);
> +       return testapp_adjust_tail(test, XSK_UMEM__MAX_FRAME_SIZE * 100,
> +                                  6912);
>  }
> 
>  int testapp_tx_queue_consumer(struct test_spec *test)
> 
> If we print out the values involved in the tailroom calculation:
> 
> tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> 
> 4294967040 = 3456 - 3456 - 256
> 
> I personally reproduced and verified the issue in ice and i40e,
> aside from WiP ixgbevf implementation.

May I ask what was the testing approach against ice on your side?  When I
run test_xsk.sh against tree with your series applied, I get a panic shown
below [1]. This comes from a test that modifies descriptor count on rings
and the trick is that it might be passing when running as a standalone
test but in the test suite it causes problems. It comes from a fact that
we copy xdp_rxq between old and new ice_rx_ring, core sees the xdp_rxq
already registered, does unregister by itself but it bails out on
page_pool pointer being invalid (as these two xdp_rxqs pointed to same pp
and it got destroyed). So small diff below [0] allows me to go through
xskxceiver test suite executed from test_xsk.sh.

Thanks,
MF

[0]:
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c 
b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 969d4f8f9c02..06986adb2005 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -3328,6 +3328,7 @@ ice_set_ringparam(struct net_device *netdev, struct 
ethtool_ringparam *ring,
                rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
                rx_rings[i].desc = NULL;
                rx_rings[i].xdp_buf = NULL;
+               xdp_rxq_info_unreg(&rx_rings[i].xdp_rxq);
 
                /* this is to allow wr32 to have something to write to
                 * during early allocation of Rx buffers

[1]:
[ 2596.560462] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 2596.568466] #PF: supervisor read access in kernel mode
[ 2596.574686] #PF: error_code(0x0000) - not-present page
[ 2596.580942] PGD 118694067 P4D 0
[ 2596.585322] Oops: Oops: 0000 [#1] SMP NOPTI
[ 2596.590694] CPU: 2 UID: 0 PID: 5117 Comm: xskxceiver Tainted: G    B   W  O  
      6.19.0+ #198 PREEMPT(full)
[ 2596.602065] Tainted: [B]=BAD_PAGE, [W]=WARN, [O]=OOT_MODULE
[ 2596.609049] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS 
SE5C620.86B.01.01.0004.2110190142 10/19/2021
[ 2596.621632] RIP: 0010:xdp_unreg_mem_model+0x86/0xc0
[ 2596.628195] Code: 0f 44 d7 f6 c2 01 75 37 41 0f b7 4c 24 16 48 89 ce 48 f7 
de 3b 5c 32 04 75 1d 48 89 d3 48 29 cb 48 85 d2 74 2f e8 9a 9e 4c ff <48> 8b 7b 
08 5b 5d 41 5c e9 6d eb 00 00 48 8b 12 f6 c2 01 74 d5 48
[ 2596.650847] RSP: 0018:ffa000001ffe3a90 EFLAGS: 00010246
[ 2596.658128] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff1100808e308ea1
[ 2596.667403] RDX: ff1100808e308ea1 RSI: 00000000000001cc RDI: ff11000130150000
[ 2596.676719] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000001
[ 2596.686060] R10: ff1100011084a2c0 R11: 0000000000000000 R12: ff1100011541ce40
[ 2596.695445] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
[ 2596.704866] FS:  00007f6044013c40(0000) GS:ff11007efbb1b000(0000) 
knlGS:0000000000000000
[ 2596.715336] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2596.723510] CR2: 0000000000000008 CR3: 00000001e9052004 CR4: 0000000000773ef0
[ 2596.733162] PKRU: 55555554
[ 2596.738407] Call Trace:
[ 2596.743398]  <TASK>
[ 2596.748045]  __xdp_rxq_info_reg+0xb7/0xf0
[ 2596.755108]  ice_vsi_cfg_rxq+0x668/0x6b0 [ice]
[ 2596.762499]  ice_vsi_cfg_rxqs+0x29/0x80 [ice]
[ 2596.769555]  ice_up+0xe/0x30 [ice]
[ 2596.775673]  ice_set_ringparam+0x662/0x7e0 [ice]
[ 2596.783066]  ethtool_set_ringparam+0xb3/0x110
[ 2596.790189]  __dev_ethtool+0x1200/0x2d90
[ 2596.796916]  ? update_se+0xc1/0x120
[ 2596.803224]  ? update_load_avg+0x73/0x220
[ 2596.810079]  ? xas_load+0x9/0xc0
[ 2596.816172]  ? xa_load+0x71/0xb0
[ 2596.822273]  ? avc_has_extended_perms+0xcf/0x4a0
[ 2596.829822]  ? __kmalloc_cache_noprof+0x11a/0x400
[ 2596.837493]  dev_ethtool+0xa6/0x170
[ 2596.843976]  dev_ioctl+0x2d9/0x510
[ 2596.850388]  sock_do_ioctl+0xa8/0x110
[ 2596.857078]  sock_ioctl+0x234/0x320
[ 2596.863614]  __x64_sys_ioctl+0x92/0xe0
[ 2596.870444]  do_syscall_64+0xa4/0xc80
[ 2596.877212]  entry_SYSCALL_64_after_hwframe+0x71/0x79
[ 2596.885426] RIP: 0033:0x7f6043f24e1d
[ 2596.892186] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 
10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 
00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[ 2596.917864] RSP: 002b:00007ffd329f5e50 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[ 2596.929028] RAX: ffffffffffffffda RBX: 00007ffd329f6208 RCX: 00007f6043f24e1d
[ 2596.939757] RDX: 00007ffd329f5ed0 RSI: 0000000000008946 RDI: 0000000000000013
[ 2596.950460] RBP: 00007ffd329f5ea0 R08: 0000000000000000 R09: 0000000000000007
[ 2596.961597] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
[ 2596.972256] R13: 0000000000000000 R14: 000055a88e016338 R15: 00007f6044100000
[ 2596.982917]  </TASK>
[ 2596.988534] Modules linked in: ice(O) ipmi_ssif 8021q garp stp mrp llc 
intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp 
nls_iso8859_1 kvm_intel kvm irqbypass mei_me ioatdma mei wmi dca ipmi_si 
ipmi_msghandler acpi_power_meter acpi_pad input_leds hid_generic 
ghash_clmulni_intel idpf i40e libeth_xdp libeth ahci libie libie_fwlog 
libie_adminq libahci aesni_intel gf128mul [last unloaded: irdma]
[ 2597.040161] CR2: 0000000000000008
[ 2597.046911] ---[ end trace 0000000000000000 ]---
[ 2597.117161] RIP: 0010:xdp_unreg_mem_model+0x86/0xc0
[ 2597.125432] Code: 0f 44 d7 f6 c2 01 75 37 41 0f b7 4c 24 16 48 89 ce 48 f7 
de 3b 5c 32 04 75 1d 48 89 d3 48 29 cb 48 85 d2 74 2f e8 9a 9e 4c ff <48> 8b 7b 
08 5b 5d 41 5c e9 6d eb 00 00 48 8b 12 f6 c2 01 74 d5 48
[ 2597.151379] RSP: 0018:ffa000001ffe3a90 EFLAGS: 00010246
[ 2597.160243] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff1100808e308ea1
[ 2597.171798] RDX: ff1100808e308ea1 RSI: 00000000000001cc RDI: ff11000130150000
[ 2597.182587] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000001
[ 2597.193333] R10: ff1100011084a2c0 R11: 0000000000000000 R12: ff1100011541ce40
[ 2597.204055] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
[ 2597.214732] FS:  00007f6044013c40(0000) GS:ff11007efbb1b000(0000) 
knlGS:0000000000000000
[ 2597.226440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2597.235842] CR2: 0000000000000008 CR3: 00000001e9052004 CR4: 0000000000773ef0
[ 2597.246692] PKRU: 55555554
[ 2597.253088] note: xskxceiver[5117] exited with irqs disabled

> 
> v3->v2:
> * unregister XDP RxQ info for subfunction in ice
> * remove rx_buf_len variable in ice
> * add missing ifdefed empty definition xsk_pool_get_rx_frag_step()
> * move xsk_pool_get_rx_frag_step() call from idpf to libeth
> * simplify conditions when determining frag_size in idpf
> * correctly init xdp_frame_sz for non-main VSI in i40e
> 
> v1->v2:
> * add modulo to calculate offset within chunk
> * add helper for AF_XDP ZC queues
> * fix the problem in ZC mode in i40e, ice and idpf
> * verify solution in i40e
> * fix RxQ info registering in i40e
> * fix splitq handling in idpf
> * do not use word truesize unless the value used is named trusize
> 
> Larysa Zaremba (9):
>   xdp: use modulo operation to calculate XDP frag tailroom
>   xsk: introduce helper to determine rxq->frag_size
>   ice: fix rxq info registering in mbuf packets
>   ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz
>   i40e: fix registering XDP RxQ info
>   i40e: use xdp.frame_sz as XDP RxQ info frag_size
>   libeth, idpf: use truesize as XDP RxQ info frag_size
>   net: enetc: use truesize as XDP RxQ info frag_size
>   xdp: produce a warning when calculated tailroom is negative
> 
>  drivers/net/ethernet/freescale/enetc/enetc.c |  2 +-
>  drivers/net/ethernet/intel/i40e/i40e_main.c  | 40 +++++++++++---------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c  |  5 ++-
>  drivers/net/ethernet/intel/ice/ice_base.c    | 33 +++++-----------
>  drivers/net/ethernet/intel/ice/ice_txrx.c    |  3 +-
>  drivers/net/ethernet/intel/ice/ice_xsk.c     |  3 ++
>  drivers/net/ethernet/intel/idpf/xdp.c        |  6 ++-
>  drivers/net/ethernet/intel/idpf/xsk.c        |  1 +
>  drivers/net/ethernet/intel/libeth/xsk.c      |  1 +
>  include/net/libeth/xsk.h                     |  3 ++
>  include/net/xdp_sock_drv.h                   | 10 +++++
>  net/core/filter.c                            |  6 ++-
>  12 files changed, 66 insertions(+), 47 deletions(-)
> 
> -- 
> 2.52.0
> 

Reply via email to