Am 08.11.24 um 12:26 schrieb Lazar, Lijo:
On 11/8/2024 4:29 PM, Liu, Monk wrote:
[AMD Official Use Only - AMD Internal Distribution Only]
To be clear for the mb() approach: Even if we insert mb() in prior to
amdgpu_ring_set_wptr(ring), GPU might still fetch stalled data from PQ due to
USWC attributes.
Inserting an mb() doesn't cause WC buffer flush is a wrong assumption.
All prior loads/stores are supposed to be globally visible. Hence mb()
followed by a write pointer update also should guarantee the same (From
Arch manual).
The MFENCE instruction establishes a memory fence for both loads and
stores. The processor ensures that no load
or store after MFENCE will become globally visible *until all loads and
stores before MFENCE are globally visible.*
Ignoring the amdgpu driver part of it - if mb() is not working as
expected as you claim that means something is wrong with the system.
USWC or WB for ring type may still be a separate discussion.
Yeah completely agree. Additional to that the combine behavior between
USWC and WB is the same, so the argumentation is clearly not correct.
What happens instead is that the broken KVM patch applies the guest
caching attributes to the ring buffer instead of the host ones.
Making an educated guess I would say that the root of the problem is
that the guest is using some incorrect caching attributes for some
reason, which is exactly the same reason why that KVM patch was revert
upstream as well.
Regards,
Christian.
Thanks,
Lijo
The issue here is not the re-ordering but the stalled PQ.
Monk Liu | Cloud GPU & Virtualization | AMD
-----Original Message-----
From: Liu, Monk
Sent: Friday, November 8, 2024 6:22 PM
To: Koenig, Christian <[email protected]>; Lazar, Lijo <[email protected]>; Alex
Deucher <[email protected]>; Zhao, Victor <[email protected]>
Cc: [email protected]; Yang, Philip <[email protected]>; Kuehling,
Felix <[email protected]>
Subject: RE: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool
Christian/Lijo
We verified all approaches, and we know what works and not works, obviously the
mb() doesn't work.
The way of mb() between set wptr_polling and kicking off doorbell is
theoretically correct, and I'm not object to do so... but this won't resolve
the issue we hit.
First of all, USWC will have some chance that the data is still in CPU's WC
storage place and not flushed to the memory and even with MB() get rid of
re-ordering GPU might still have a chance to read stalled data from ring buffer
as long as it is mapped USWC.
This is why only cache plus snoop memory can get rid of inconsistence issues.
Besides, do you know what's the rational to keep all GFX KCQ cache+snoop but
only HIQ/KIQ from KFD configured to USWC ?
For performance concern that looks to me always the second priority compared to
"correct" especially under the case HIQ contributes very little to ROCM
performance when switching to cache mapping.
Monk Liu | Cloud GPU & Virtualization | AMD
-----Original Message-----
From: Koenig, Christian <[email protected]>
Sent: Thursday, November 7, 2024 7:57 PM
To: Lazar, Lijo <[email protected]>; Alex Deucher <[email protected]>; Zhao,
Victor <[email protected]>
Cc: [email protected]; Liu, Monk <[email protected]>; Yang, Philip
<[email protected]>; Kuehling, Felix <[email protected]>
Subject: Re: [PATCH 2/2] drm/amdkfd: use cache GTT buffer for PQ and wb pool
Am 07.11.24 um 06:58 schrieb Lazar, Lijo:
On 11/6/2024 8:42 PM, Alex Deucher wrote:
On Wed, Nov 6, 2024 at 1:49 AM Victor Zhao <[email protected]> wrote:
From: Monk Liu <[email protected]>
As cache GTT buffer is snooped, this way the coherence between CPU
write and GPU fetch is guaranteed, but original code uses WC +
unsnooped for HIQ PQ(ring buffer) which introduces coherency issues:
MEC fetches a stall data from PQ and leads to MEC hang.
Can you elaborate on this? I can see CPU reads being slower because
the memory is uncached, but the ring buffer is mostly writes anyway.
IIRC, the driver uses USWC for most if not all of the other ring
buffers managed by the kernel. Why aren't those a problem?
We have this on other rings -
mb();
amdgpu_ring_set_wptr(ring);
I think the solution should be to use barrier before write pointer
updates rather than relying on PCIe snooping.
Yeah, completely agree as well. The barrier also takes care of preventing the
compiler from re-ordering writes.
Regards,
Christian.
Thanks,
Lijo
Alex
Signed-off-by: Monk Liu <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 1f1d79ac5e6c..fb087a0ff5bc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -779,7 +779,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
if (amdgpu_amdkfd_alloc_gtt_mem(
kfd->adev, size, &kfd->gtt_mem,
&kfd->gtt_start_gpu_addr, &kfd->gtt_start_cpu_ptr,
- false, true)) {
+ false, false)) {
dev_err(kfd_device, "Could not allocate %d bytes\n", size);
goto alloc_gtt_mem_failure;
}
--
2.34.1