Donet Tom <[email protected]> writes:

> Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while
> KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with
> 4K pages, both values match (8KB), so allocation and reserved space
> are consistent.
>
> However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB,
> while the reserved trap area remains 8KB. This mismatch causes the
> kernel to crash when running rocminfo or rccl unit tests.
>


#define AMDGPU_VA_RESERVED_TRAP_SIZE            (2ULL << 12)
#define AMDGPU_VA_RESERVED_TRAP_START(adev)     
(AMDGPU_VA_RESERVED_SEQ64_START(adev) \
                                                 - AMDGPU_VA_RESERVED_TRAP_SIZE)
#define AMDGPU_VA_RESERVED_BOTTOM               (1ULL << 16)
#define AMDGPU_VA_RESERVED_TOP                  (AMDGPU_VA_RESERVED_TRAP_SIZE + 
\
                                                 AMDGPU_VA_RESERVED_SEQ64_SIZE 
+ \
                                                 AMDGPU_VA_RESERVED_CSA_SIZE)

#define AMDGPU_VA_RESERVED_TRAP_START(adev)     
(AMDGPU_VA_RESERVED_SEQ64_START(adev) \
                                                 - AMDGPU_VA_RESERVED_TRAP_SIZE)

In kfd_init_apertures_v9()...

        /*
         * Place TBA/TMA on opposite side of VM hole to prevent
         * stray faults from triggering SVM on these pages.
         */
        pdd->qpd.cwsr_base = AMDGPU_VA_RESERVED_TRAP_START(pdd->dev->adev);


& In  kfd_process_device_init_cwsr_dgpu()...

        /* cwsr_base is only set for dGPU */
        ret = kfd_process_alloc_gpuvm(pdd, qpd->cwsr_base,
                                      KFD_CWSR_TBA_TMA_SIZE, flags, &mem, 
&kaddr);


This shows that it expects KFD_CWSW_TBA_TMA_SIZE (2 * PAGE_SIZE) size of
region, from cwsr_base. However the AMDGPU_VA_RESERVED_TRAP_SIZE only
reserves 8KB. This would work on 4K pagesize systems but on non-4K
pagesize (say 64K), this would fail, since it could overflow into the
SEQ64 region.

Hence the fix in this looks right to me. Although I am not an expert on
the amd gpu driver side, so I would let the experts review this as well.

But FWIW - 

Reviewed-by: Ritesh Harjani (IBM) <[email protected]>


> Kernel attempted to read user page (2) - exploit attempt? (uid: 1001)
> BUG: Kernel NULL pointer dereference on read at 0x00000002
> Faulting instruction address: 0xc0000000002c8a64
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E
> 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006
> of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries
> NIP:  c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730
> REGS: c0000001e0957580 TRAP: 0300 Tainted: G E
> MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268
> XER: 00000036
> CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000
> IRQMASK: 1
> GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100
> c00000013d814540
> GPR04: 0000000000000002 c00000013d814550 0000000000000045
> 0000000000000000
> GPR08: c00000013444d000 c00000013d814538 c00000013d814538
> 0000000084002268
> GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff
> 0000000000020000
> GPR16: 0000000000000000 0000000000000002 c00000015f653000
> 0000000000000000
> GPR20: c000000138662400 c00000013d814540 0000000000000000
> c00000013d814500
> GPR24: 0000000000000000 0000000000000002 c0000001e0957888
> c0000001e0957878
> GPR28: c00000013d814548 0000000000000000 c00000013d814540
> c0000001e0957888
> NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0
> LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00
> Call Trace:
> 0xc0000001e0957890 (unreliable)
> __mutex_lock.constprop.0+0x58/0xd00
> amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu]
> kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu]
> kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu]
> kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu]
> kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu]
> kfd_ioctl+0x514/0x670 [amdgpu]
> sys_ioctl+0x134/0x180
> system_call_exception+0x114/0x300
> system_call_vectored_common+0x15c/0x2ec
>
> This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE,
> ensuring that the reserved trap area matches the allocation size
> across all page sizes.
>
> cc: [email protected]

Cc: makes sense. So that the older kernel versions would get this fix too!

> Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM 
> hole")
> Signed-off-by: Donet Tom <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index 139642eacdd0..a5eae49f9471 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -173,7 +173,7 @@ struct amdgpu_bo_vm;
>  #define AMDGPU_VA_RESERVED_SEQ64_SIZE                (2ULL << 20)
>  #define AMDGPU_VA_RESERVED_SEQ64_START(adev) 
> (AMDGPU_VA_RESERVED_CSA_START(adev) \
>                                                - 
> AMDGPU_VA_RESERVED_SEQ64_SIZE)
> -#define AMDGPU_VA_RESERVED_TRAP_SIZE         (2ULL << 12)
> +#define AMDGPU_VA_RESERVED_TRAP_SIZE         (2ULL << PAGE_SHIFT)
>  #define AMDGPU_VA_RESERVED_TRAP_START(adev)  
> (AMDGPU_VA_RESERVED_SEQ64_START(adev) \
>                                                - AMDGPU_VA_RESERVED_TRAP_SIZE)
>  #define AMDGPU_VA_RESERVED_BOTTOM            (1ULL << 16)
> -- 
> 2.52.0

Reply via email to