amdgpu: fix VA hole handling on Vega10 v2

Christian König Wed, 22 Nov 2017 04:41:16 -0800

Am 20.11.2017 um 15:46 schrieb Michel Dänzer:

On 20/11/17 02:20 PM, Christian König wrote:

Am 20.11.2017 um 11:36 schrieb Michel Dänzer:

On 18/11/17 03:31 PM, Christian König wrote:

Am 17.11.2017 um 17:09 schrieb Michel Dänzer:

On 17/11/17 11:28 AM, Christian König wrote:

Ping? Michel, Alex can somebody take a look?

Patch 2 is


Reviewed-by: Michel Dänzer <[email protected]>


With patches 1 & 3, it's not 100% clear to me what the idea is behind
the handling of the hole on the kernel and userspace side. Maybe you
can
add some explanation in code comments or the commit logs?

Yeah, that is actually a bit of a mess because the hardware
documentation wasn't very clear on how this works.

How about this as extra code comment on patch 1 to the assignment of
dev_info.virtual_address_max:

/*
   * Old userspace isn't aware of the VA hole on Vega10. So in theory an
client could get invalid VA addresses assigned.
   * To fix this and keep backward compatibility we limit the VA space
reported in this field to the range below the hole.
   */

The last patch is then to report the VA space above the hole, cause that
is actually what libdrm should use.

The crux is when I put the VA space above the hole directly into the old
fields older versions of libdrm would break and we can't do that.

That much was actually clear. :) Maybe some questions will expose what
I'm missing:


Patch 1:

@@ -880,14 +880,14 @@  static int amdgpu_cs_ib_vm_chunk(struct
amdgpu_device *adev,
               if (chunk->chunk_id != AMDGPU_CHUNK_ID_IB)
                   continue;
   -            r = amdgpu_cs_find_mapping(p, chunk_ib->va_start,
-                           &aobj, &m);
+            va_start = chunk_ib->va_start & AMDGPU_VA_HOLE_MASK;
+            r = amdgpu_cs_find_mapping(p, va_start, &aobj, &m);

I don't understand how userspace can make use of the address space above
the hole, since AMDGPU_VA_HOLE_MASK is 0x0000ffffffffffffULL and
AMDGPU_VA_HOLE_END is 0xffff800000000000ULL, so masking with
AMDGPU_VA_HOLE_MASK always results in an address below or inside the
hole?

Yes, and exactly that is intended here.

See for programming the MC and filling the page tables you handle this
as if there isn't a hole in the address space.

Only when you start to use the address from userspace you will find that
you need to sign extend bit 47 into bits 48-63.

Applying the AMDGPU_VA_HOLE_MASK to and address masks out the upper 16
bits and so removes the sign extension again.

@@ -566,6 +566,17 @@  int amdgpu_gem_va_ioctl(struct drm_device *dev,
void *data,
           return -EINVAL;
       }
   +    if (args->va_address >= AMDGPU_VA_HOLE_START &&
+        args->va_address < AMDGPU_VA_HOLE_END) {
+        dev_dbg(&dev->pdev->dev,
+            "va_address 0x%LX is in VA hole 0x%LX-0x%LX\n",
+            args->va_address, AMDGPU_VA_HOLE_START,
+            AMDGPU_VA_HOLE_END);
+        return -EINVAL;
+    }
+
+    args->va_address &= AMDGPU_VA_HOLE_MASK;

Ignoring the above, I think the masking with AMDGPU_VA_HOLE_MASK would
need to be done before checking for the hole, otherwise the masked
address could be inside the hole?

The masking just removes the sign extension which created the hole in
the first place.

The VM and MC is programmed as if there isn't a hole.

Patch 3:

@@ -577,10 +578,17 @@  static int amdgpu_info_ioctl(struct drm_device
*dev, void *data, struct drm_file
               dev_info.ids_flags |= AMDGPU_IDS_FLAGS_FUSION;
           if (amdgpu_sriov_vf(adev))
               dev_info.ids_flags |= AMDGPU_IDS_FLAGS_PREEMPTION;
+
+        vm_size = adev->vm_manager.max_pfn * AMDGPU_GPU_PAGE_SIZE;
           dev_info.virtual_address_offset = AMDGPU_VA_RESERVED_SIZE;
           dev_info.virtual_address_max =
-            min(adev->vm_manager.max_pfn * AMDGPU_GPU_PAGE_SIZE,
-                AMDGPU_VA_HOLE_START);
+            min(vm_size, AMDGPU_VA_HOLE_START);
+
+        vm_size -= AMDGPU_VA_RESERVED_SIZE;
+        if (vm_size > AMDGPU_VA_HOLE_START) {
+            dev_info.high_va_offset = AMDGPU_VA_HOLE_END;
+            dev_info.high_va_max = AMDGPU_VA_HOLE_END | vm_size;
+        }

This mostly makes sense, except for the

         dev_info.high_va_max = AMDGPU_VA_HOLE_END | vm_size;

line. Can you explain how simply or'ing together the address of the end
of the hole and the VM size works correctly in all cases?

As an extreme example, with vm_size == 1, this value would compute as
0xffff800000000001ULL, which seems wrong or at least weird.

We check above if vm_size is larger than AMDGPU_VA_HOLE_START, in other
words larger than 0x8000000000.

Since vm_size is once more in the value range the MC expects it (without
the hole) that should give us the right result to sign extend bit 47
into bits 48-63, e.g. 0xffff800000000.


And yes that initially confused me as well. Especially that you need to
program the MC and fill the page tables as if there wouldn't be a hole
in the address space.

Applying "& AMDGPU_VA_HOLE_MASK" and "| AMDGPU_VA_HOLE_END" just
converts between the representation with and without the hole in it.

Thanks for the explanation, I get it now.

For the benefit of future readers of this code, it would be nice to have
a short version of this explanation as a comment, e.g. above the
AMDGPU_VA_HOLE_* defines.

I've added a comment and pushed the resulting patch with a CC: stabletag added.


What about the libdrm side of the change? Anybody who could take a look?

Thanks,
Christian.
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 1/3] drm/amdgpu: fix VA hole handling on Vega10 v2

Reply via email to