[AMD Official Use Only]
Yes, it is a sync issue b/w SVMRange unmap deferred list and get-attribute
call. There is time slot when Process vma has been unmapped from CPU side, but
prange is not removed in deferred list. If get-attribute is called in this
time slot, then there will be a problem.
This issue happens when KFDSVMRangeTest,SetGetAttributesTest runs after
KFDSVMRangeTest.BasicSystemMemTest test.
TEST_F(KFDSVMRangeTest, SetGetAttributesTest) {
TEST_REQUIRE_ENV_CAPABILITIES(ENVCAPS_64BITLINUX);
TEST_START(TESTPROFILE_RUNALL)
......
HSAint32 enable = -1;
EXPECT_SUCCESS(hsaKmtGetXNACKMode(&enable));
expectedDefaultResults[4] =
(enable)?HSA_SVM_ATTR_ACCESS:HSA_SVM_ATTR_NO_ACCESS;
sysBuffer = new HsaSVMRange(BufSize); // It returns the same addr as the
last test case since the addr is already munmap for CPU.
char *pBuf = sysBuffer->As<char *>();
LOG() << "Get default atrributes" << std::endl;
memcpy(outputAttributes, inputAttributes, nAttributes *
sizeof(HSA_SVM_ATTRIBUTE));
EXPECT_SUCCESS(hsaKmtSVMGetAttr(pBuf, BufSize, // This is the problematic
get-attribute. It returns the the prange attributes in the last test case since
prange is not removed.
nAttributes, outputAttributes));
sysBufer address is an unregistered SVM range. Thus hsaKmtSVMGetAttr should
return the default attributes. But there are some corner cases, deferred list
from last test case is not finished when get attribute is called, so svm range
list remains the contents of the last test. In such cases, hsaKmtSVMGetAttr
returns the attributes in the last test instead of default values. (The two
test cases are in the same process, mmap in HsaSVMRange constructor return the
same addr) and makes test case fail.
-----Original Message-----
From: Kuehling, Felix <[email protected]>
Sent: Tuesday, August 10, 2021 11:57 PM
To: Zhang, Yifan <[email protected]>; [email protected]
Subject: Re: [PATCH] drm/amdkfd: fix random
KFDSVMRangeTest.SetGetAttributesTest test failure
Am 2021-08-10 um 12:43 a.m. schrieb Yifan Zhang:
> KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
>
> Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18
> tests from 1 test case.
> [----------] Global test environment set-up.
> [----------] 18 tests from KFDSVMRangeTest
> [ RUN ] KFDSVMRangeTest.BasicSystemMemTest
> [ OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
> [ RUN ] KFDSVMRangeTest.SetGetAttributesTest
> [ ] Get default atrributes
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154
> : Failure Value of: expectedDefaultResults[i]
> Actual: 4294967295
> Expected: outputAttributes[i].value
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154
> : Failure Value of: expectedDefaultResults[i]
> Actual: 4294967295
> Expected: outputAttributes[i].value
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152
> : Failure Value of: expectedDefaultResults[i]
> Actual: 4
> Expected: outputAttributes[i].type
> Which is: 2
> [ ] Setting/Getting atrributes
> [ FAILED ]
>
> the root cause is that svm work queue has not finished when
> svm_range_get_attr is called, thus some garbage svm interval tree data
> make svm_range_get_attr get wrong result. Flush work queue before iterate svm
> interval tree.
>
> Signed-off-by: Yifan Zhang <[email protected]>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index f811a3a24cd2..192e9401bed5 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -3072,6 +3072,9 @@ svm_range_get_attr(struct kfd_process *p, uint64_t
> start, uint64_t size,
> pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start,
> start + size - 1, nattr);
>
> + /* flush pending deferred work */
> + flush_work(&p->svms.deferred_list_work);
> +
There is still a race condition here. More work can be added to the
deferred_list_work after the flush call.
Work gets added to the deferred_list asynchronously, for example in MMU
notifiers. Trying to synchronize with asynchronous events is inherently
problematic. It appears that the test is making some assumptions about things
happening asynchronously (page faults or MMU notifiers) and that's probably a
problem with the test, not with the driver.
Alternatively, there may be a problem with a set-attribute call that leaves
some operations on the deferred list and results in unexpected get-attribute
results. If that's the problem, we may need to add a flush-call to the end of
the set-attributes function.
Can you provide more details about the exact sequence of set-attribute and
get-attribute calls that is causing the problem?
Regards,
Felix
> mmap_read_lock(mm);
> r = svm_range_is_valid(p, start, size);
> mmap_read_unlock(mm);