GPU fault 146 0x0e903e0c VM fault read from VCE0 on Radeon Pro WX 2100 - possible driver bug or configuration error?

Joshua M. Boniface Sun, 06 May 2018 20:49:54 -0700

Hello List!

I'm encountering a strange potential bug with by brand-new Radeon Pro WX 2100 
when passed through to a VM in VFIO mode. This might be a driver bug, or a 
misconfiguration, but I'm looking for any advice you can offer!


The basic system setup is:

* Hypervisor is Debian 9.X running on a Dell C6100 blade (Intel E5649 CPU)
* GPU PCI device passed through via VFIO to a KVM/QEMU virtual machine - I have 
set this up with other cards, specifically a Radeon R9 270X and a Radeon HD6450 
without trouble in the past.
* To work properly on my older CPUs, I'm setting "options vfio_iommu_type1 
allow_unsafe_interrupts=1" in my modprobe on the hypervisor.
* The VM is running Debian 9.X with the latest AMDGPU-PRO driver (which I 
understand is unsupported, but the drivers install fine, and the same problem 
happens with the open-source driver in the kernel as well)
* Inside the VM I've installed the standard VAAPI utilities to support 
transcode offloading on the GPU for ffmpeg; this configuration is completely 
headless aside from a virtual Cirrus display in the VM.

First, which might be related, I'm able to get info from vaconfig only if I 
manually export the radeonsi driver as the one that should be used, which did 
not happen with my HD6450 (it detected this by default):

# uname -a
Linux transcoder1 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1 (2018-04-29) x86_64 
GNU/Linux 
# vainfo                                                                        
                                                                                
                                                                               
error: XDG_RUNTIME_DIR not set in the environment.
error: can't connect to X server!
libva info: VA-API version 0.39.4
libva info: va_getDriverName() returns -1
libva error: va_getDriverName() failed with unknown libva 
error,driver_name=(null)
vaInitialize failed with error code -1 (unknown libva error),exit
# export LIBVA_DRIVER_NAME=radeonsi
# vainfo
error: XDG_RUNTIME_DIR not set in the environment.
error: can't connect to X server!
libva info: VA-API version 0.39.4
libva info: va_getDriverName() returns -1
libva info: User requested driver 'radeonsi'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_0_39
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.39 (libva 1.7.3)
vainfo: Driver version: mesa gallium vaapi
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileNone                   : VAEntrypointVideoProc

This setup doesn't crash when playing a 1080p x264 video file via ffmpeg, 
however the output is badly corrupted (wrong colours, encoding failures, etc.), 
so there seems to be a problem in general. And the moment I try to decode a 4K 
HVEC video, the GPU crashes with the following error:

[  175.464769] amdgpu 0000:00:06.0: GPU fault detected: 146 0x0e92be14
[  175.466660] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   
0x001011D2
[  175.468868] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 
0x030BE014
[  175.471056] amdgpu 0000:00:06.0: VM fault (0x14, vmid 1) at page 1053138, 
write from 'VCE0' (0x56434530) (190)
[  175.473920] amdgpu 0000:00:06.0: GPU fault detected: 146 0x0e92be14
[  175.475704] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   
0x001011D4
[  175.477808] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 
0x030BE014
[  175.479900] amdgpu 0000:00:06.0: VM fault (0x14, vmid 1) at page 1053140, 
write from 'VCE0' (0x56434530) (190)
[  175.517997] amdgpu 0000:00:06.0: GPU fault detected: 146 0x0e903e0c
[  175.519798] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   
0x001011D2
[  175.522011] amdgpu 0000:00:06.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 
0x0203E00C
[  175.524224] amdgpu 0000:00:06.0: VM fault (0x0c, vmid 1) at page 1053138, 
read from 'VCE0' (0x56434530) (62)

After this point, the entire hypervisor host needs to be rebooted to return the 
GPU to a "working" state (i.e. so it won't lock up the hypervisor if I reboot 
the VM). And if I leave it long enough, eventually the hypervisor will simply 
lock up completely.

I've encountered this bug with both the latest open-source AMDGPU driver in 
Linux kernels 4.17rc3 and 4.15, as well as with the AMDGPU PRO driver on Linux 
kernel 4.9 inside the VM as demonstrated above; the crash message is identical 
in every case. Trying various different VAAPI drivers, other than radeonsi, has 
no effect, and the r600 driver is in fact far worse, throwing dozens of VM 
faults instead of the three seen above.

I'm at a loss to determine what could possibly be wrong here as I've tried 
tweaking almost everything I could think of based on the advice I've been able 
to find online so far, which is sparse.

I'm willing to provide any further info which may help, especially regarding 
the passthrough, and any advice anyone could give would be helpful!

Joshua M. Boniface
Linux System Ærchitect - Boniface Labs
Sigmentation fault: core dumped
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

GPU fault 146 0x0e903e0c VM fault read from VCE0 on Radeon Pro WX 2100 - possible driver bug or configuration error?

Reply via email to