On Mon,  9 Feb 2026 21:50:48 +0530
Narayana Murty N <[email protected]> wrote:

> Add vfio_ensure_d0_state() to safely transition PCI devices from D3hot/D3cold
> to D0 before QEMU guest access, preventing config space inaccessibility and
> tg3 IRQ crashes during VFIO realize.
> 
> Key changes:
> - D3hot: Direct PMCSR write (offset 0x44) to force PowerState=00 (D0)
> - D3cold: pm_runtime_resume() + pm_runtime_get_sync() for full power restore
> - Polling loop verifies D0 transition completion
> - No-op for already D0 devices
> 
> Fixes PowerPC EEH races where devices enter low-power states during VFIO
> handover, causing config space access failures.
> 
> Signed-off-by: Narayana Murty N <[email protected]>
> ---
>  hw/vfio/pci.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)

NAK.  This is very broken.

QEMU cannot write to arbitrary sysfs attributes.  QEMU should not write
power state controls to sysfs nor impose a device power state policy.
vbasedev->fd is more than likely invalid where we're performing an
ioctl test, making the entire premise of the test invalid.

When the device is opened by QEMU, vfio-pci will issues a
pm_runtime_resume_and_get(), incrementing the PM usage counter and
waking the device.  This should properly bring the device to the D0
power state and keep it there regardless of any ill-timed race to low
power state.  If it does not, then fix it in the kernel or block
vfio-pci from using low power states, ie. disable_idle_d3.  Thanks,

Alex

> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c734472721..851cd789aa 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3392,6 +3392,51 @@ bool vfio_pci_interrupt_setup(VFIOPCIDevice *vdev, 
> Error **errp)
>      return true;
>  }
>  
> +static int write_sysfs(const char *path, const char *value)
> +{
> +    FILE *f = fopen(path, "w");
> +    if (!f) {
> +        return -1;
> +    }
> +    int ret = fprintf(f, "%s", value);
> +    fclose(f);
> +    return (ret > 0) ? 0 : -1;
> +}
> +
> +static void vfio_ensure_d0_state(VFIOPCIDevice *vdev)
> +{
> +    VFIODevice *vbasedev = &vdev->vbasedev;
> +    char sysfs_power_path[PATH_MAX];
> +
> +    /*
> +     * Test config region accessibility (D3cold-safe, no PCI config
> +     * reads!)
> +     */
> +    struct vfio_region_info reg_info = {
> +        .argsz = sizeof(reg_info),
> +        .index = VFIO_PCI_CONFIG_REGION_INDEX,
> +        .offset = 0,
> +        .size = 0
> +    };
> +
> +    if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info) < 0) {
> +        warn_report("vfio: %s config region probe failed (D3cold): %s",
> +                    vbasedev->name, strerror(errno));
> +
> +        /* D3cold confirmed → sysfs power control (EEH-safe) */
> +        snprintf(sysfs_power_path, sizeof(sysfs_power_path),
> +                 "/sys/bus/pci/devices/%s/power/control", vbasedev->name;
> +
> +        /* Force runtime resume */
> +        if (write_sysfs(sysfs_power_path, "on") == 0) {
> +            g_usleep(10000);  /* 10ms settle */
> +            write_sysfs(sysfs_power_path, "auto");
> +            info_report("vfio: %s D3cold → D0 via sysfs", vbasedev->name);
> +        }
> +    }
> +    return;
> +}
> +
>  static void vfio_pci_realize(PCIDevice *pdev, Error **errp)
>  {
>      ERRP_GUARD();
> @@ -3401,6 +3446,13 @@ static void vfio_pci_realize(PCIDevice *pdev, Error 
> **errp)
>      char uuid[UUID_STR_LEN];
>      g_autofree char *name = NULL;
>  
> +    /*
> +     * ensure the power state of the pci device to D0,
> +     * otherwise it will set to D0, before accessing the
> +     * config space.
> +     */
> +    vfio_ensure_d0_state(vdev);
> +
>      if (vbasedev->fd < 0 && !vbasedev->sysfsdev) {
>          if (!(~vdev->host.domain || ~vdev->host.bus ||
>                ~vdev->host.slot || ~vdev->host.function)) {


Reply via email to