Hello List!
I'm encountering a strange potential bug with by brand-new Radeon Pro WX 2100
when passed through to a VM in VFIO mode. This might be a driver bug, or a
misconfiguration, but I'm looking for any advice you can offer!
The basic system setup is:
* Hypervisor is Debian 9.X running on a Dell C6100 blade (Intel E5649 CPU)
* GPU PCI device passed through via VFIO to a KVM/QEMU virtual machine - I have
set this up with other cards, specifically a Radeon R9 270X and a Radeon HD6450
without trouble in the past.
* To work properly on my older CPUs, I'm setting "options vfio_iommu_type1
allow_unsafe_interrupts=1" in my modprobe on the hypervisor.
* The VM is running Debian 9.X with the latest AMDGPU-PRO driver (which I
understand is unsupported, but the drivers install fine, and the same problem
happens with the open-source driver in the kernel as well)
* Inside the VM I've installed the standard VAAPI utilities to support
transcode offloading on the GPU for ffmpeg; this configuration is completely
headless aside from a virtual Cirrus display in the VM.
First, which might be related, I'm able to get info from vaconfig only if I
manually export the radeonsi driver as the one that should be used, which did
not happen with my HD6450 (it detected this by default):
# uname -a
Linux transcoder1 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1 (2018-04-29) x86_64
GNU/Linux
# vainfo
error: XDG_RUNTIME_DIR not set in the environment.
error: can't connect to X server!
libva info: VA-API version 0.39.4
libva info: va_getDriverName() returns -1
libva error: va_getDriverName() failed with unknown libva
error,driver_name=(null)
vaInitialize failed with error code -1 (unknown libva error),exit
# export LIBVA_DRIVER_NAME=radeonsi
# vainfo
error: XDG_RUNTIME_DIR not set in the environment.
error: can't connect to X server!
libva info: VA-API version 0.39.4
libva info: va_getDriverName() returns -1
libva info: User requested driver 'radeonsi'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_0_39
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.39 (libva 1.7.3)
vainfo: Driver version: mesa gallium vaapi
vainfo: Supported profile and entrypoints
VAProfileMPEG2Simple : VAEntrypointVLD
VAProfileMPEG2Main : VAEntrypointVLD
VAProfileVC1Simple : VAEntrypointVLD
VAProfileVC1Main : VAEntrypointVLD
VAProfileVC1Advanced : VAEntrypointVLD
VAProfileH264ConstrainedBaseline: VAEntrypointVLD
VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
VAProfileH264Main : VAEntrypointVLD
VAProfileH264Main : VAEntrypointEncSlice
VAProfileH264High : VAEntrypointVLD
VAProfileH264High : VAEntrypointEncSlice
VAProfileHEVCMain : VAEntrypointVLD
VAProfileHEVCMain10 : VAEntrypointVLD
VAProfileNone : VAEntrypointVideoProc
This setup doesn't crash when playing a 1080p x264 video file via ffmpeg,
however the output is badly corrupted (wrong colours, encoding failures, etc.),
so there seems to be a problem in general. And the moment I try to decode a 4K
HVEC video, the GPU crashes with the following error:
[ 175.464769] amdgpu 0000:00:06.0: GPU fault detected: 146 0x0e92be14
[ 175.466660] amdgpu 0000:00:06.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR
0x001011D2
[ 175.468868] amdgpu 0000:00:06.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x030BE014
[ 175.471056] amdgpu 0000:00:06.0: VM fault (0x14, vmid 1) at page 1053138,
write from 'VCE0' (0x56434530) (190)
[ 175.473920] amdgpu 0000:00:06.0: GPU fault detected: 146 0x0e92be14
[ 175.475704] amdgpu 0000:00:06.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR
0x001011D4
[ 175.477808] amdgpu 0000:00:06.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x030BE014
[ 175.479900] amdgpu 0000:00:06.0: VM fault (0x14, vmid 1) at page 1053140,
write from 'VCE0' (0x56434530) (190)
[ 175.517997] amdgpu 0000:00:06.0: GPU fault detected: 146 0x0e903e0c
[ 175.519798] amdgpu 0000:00:06.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR
0x001011D2
[ 175.522011] amdgpu 0000:00:06.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS
0x0203E00C
[ 175.524224] amdgpu 0000:00:06.0: VM fault (0x0c, vmid 1) at page 1053138,
read from 'VCE0' (0x56434530) (62)
After this point, the entire hypervisor host needs to be rebooted to return the
GPU to a "working" state (i.e. so it won't lock up the hypervisor if I reboot
the VM). And if I leave it long enough, eventually the hypervisor will simply
lock up completely.
I've encountered this bug with both the latest open-source AMDGPU driver in
Linux kernels 4.17rc3 and 4.15, as well as with the AMDGPU PRO driver on Linux
kernel 4.9 inside the VM as demonstrated above; the crash message is identical
in every case. Trying various different VAAPI drivers, other than radeonsi, has
no effect, and the r600 driver is in fact far worse, throwing dozens of VM
faults instead of the three seen above.
I'm at a loss to determine what could possibly be wrong here as I've tried
tweaking almost everything I could think of based on the advice I've been able
to find online so far, which is sparse.
I'm willing to provide any further info which may help, especially regarding
the passthrough, and any advice anyone could give would be helpful!
Joshua M. Boniface
Linux System Ærchitect - Boniface Labs
Sigmentation fault: core dumped
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx