On Thu, Aug 21, 2025 at 08:01:39PM +0530, Riana Tauro wrote: > Hi Maxime > > This patch has the changes suggested wrt to documentation in v7. Have added > whatever Rodrigo suggested in the doc. Please let us know if the changes are > okay and if the patch can be merged.
Maxime already told he is okay with the changes: https://lore.kernel.org/dri-devel/5hkngbuzoryldvjrtjwalxhosdhtweeinpjpyguzltjmee7mpu@vw44iwczytw5/ > > This needs a drm-misc maintainer ack to go ahead. But we still need some formal ack here to move ahead with this patch indeed. > > Thanks > Riana > > On 8/17/2025 9:55 PM, Rodrigo Vivi wrote: > > On Thu, Aug 14, 2025 at 05:44:32PM +0530, Riana Tauro wrote: > > > Address the need for a recovery method (firmware flash on Firmware errors) > > > introduced in the later patches of Xe KMD. > > > Whenever XE KMD detects a firmware error, a firmware flash is required to > > > recover the device to normal operation. > > > > > > The initial proposal to use 'firmware-flash' as a recovery method was > > > not applicable to other drivers and could cause multiple recovery > > > methods specific to vendors to be added. > > > To address this a more generic 'vendor-specific' method is introduced, > > > guiding users to refer to vendor specific documentation and system logs > > > for detailed vendor specific recovery procedure. > > > > > > Add a recovery method 'WEDGED=vendor-specific' for such errors. > > > Vendors must provide additional recovery documentation if this method > > > is used. > > > > > > It is the responsibility of the consumer to refer to the correct vendor > > > specific documentation and usecase before attempting a recovery. > > > > > > For example: If driver is XE KMD, the consumer must refer > > > to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'. > > > > > > v2: fix documentation (Raag) > > > v3: add more details to commit message (Sima, Rodrigo, Raag) > > > add an example script to the documentation (Raag) > > > v4: use consistent naming (Raag) > > > v5: fix commit message > > > v6: add more documentation > > > > > > Cc: André Almeida <andrealm...@igalia.com> > > > Cc: Christian König <christian.koe...@amd.com> > > > Cc: David Airlie <airl...@gmail.com> > > > Cc: Simona Vetter <simona.vet...@ffwll.ch> > > > > Cc: Maxime Ripard <mrip...@kernel.org> > > > > Folks, is it clear now? can we move ahead and get this through drm-xe-next? > > > > > Signed-off-by: Raag Jadav <raag.ja...@intel.com> > > > Signed-off-by: Riana Tauro <riana.ta...@intel.com> > > > Reviewed-by: Rodrigo Vivi <rodrigo.v...@intel.com> > > > --- > > > Documentation/gpu/drm-uapi.rst | 47 +++++++++++++++++++++++++++++----- > > > drivers/gpu/drm/drm_drv.c | 2 ++ > > > include/drm/drm_device.h | 4 +++ > > > 3 files changed, 46 insertions(+), 7 deletions(-) > > > > > > diff --git a/Documentation/gpu/drm-uapi.rst > > > b/Documentation/gpu/drm-uapi.rst > > > index 843facf01b2d..669a6b9da0b2 100644 > > > --- a/Documentation/gpu/drm-uapi.rst > > > +++ b/Documentation/gpu/drm-uapi.rst > > > @@ -418,13 +418,12 @@ needed. > > > Recovery > > > -------- > > > -Current implementation defines three recovery methods, out of which, > > > drivers > > > +Current implementation defines four recovery methods, out of which, > > > drivers > > > can use any one, multiple or none. Method(s) of choice will be sent in > > > the > > > uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of > > > less to > > > -more side-effects. If driver is unsure about recovery or method is > > > unknown > > > -(like soft/hard system reboot, firmware flashing, physical device > > > replacement > > > -or any other procedure which can't be attempted on the fly), > > > ``WEDGED=unknown`` > > > -will be sent instead. > > > +more side-effects. See the section `Vendor Specific Recovery`_ > > > +for ``WEDGED=vendor-specific``. If driver is unsure about recovery or > > > +method is unknown, ``WEDGED=unknown`` will be sent instead. > > > Userspace consumers can parse this event and attempt recovery as per the > > > following expectations. > > > @@ -435,6 +434,7 @@ following expectations. > > > none optional telemetry collection > > > rebind unbind + bind driver > > > bus-reset unbind + bus reset/re-enumeration + bind > > > + vendor-specific vendor specific recovery method > > > unknown consumer policy > > > =============== ======================================== > > > @@ -446,6 +446,35 @@ telemetry information (devcoredump, syslog). This is > > > useful because the first > > > hang is usually the most critical one which can result in consequential > > > hangs or > > > complete wedging. > > > + > > > +Vendor Specific Recovery > > > +------------------------ > > > + > > > +When ``WEDGED=vendor-specific`` is sent, it indicates that the device > > > requires > > > +a recovery procedure specific to the hardware vendor and is not one of > > > the > > > +standardized approaches. > > > + > > > +``WEDGED=vendor-specific`` may be used to indicate different cases > > > within a > > > +single vendor driver, each requiring a distinct recovery procedure. > > > +In such scenarios, the vendor driver must provide comprehensive > > > documentation > > > +that describes each case, includes additional hints to identify specific > > > case and > > > +outlines the corresponding recovery procedures. The documentation > > > includes: > > > + > > > +Case - A list of all cases that sends the ``WEDGED=vendor-specific`` > > > recovery method. > > > + > > > +Hints - Additional Information to assist the userspace consumer in > > > identifying and > > > +differentiating between different cases. This can be exposed through > > > sysfs, debugfs, > > > +traces, dmesg etc. > > > + > > > +Recovery Procedure - Clear instructions and guidance for recovering each > > > case. > > > +This may include userspace scripts, tools needed for the recovery > > > procedure. > > > + > > > +It is the responsibility of the admin/userspace consumer to identify the > > > case and > > > +verify additional identification hints before attempting a recovery > > > procedure. > > > + > > > +Example: If the device uses the Xe driver, then userspace consumer > > > should refer to > > > +:ref:`Xe Device Wedging <xe-device-wedging>` for the detailed > > > documentation. > > > + > > > Task information > > > ---------------- > > > @@ -472,8 +501,12 @@ erroring out, all device memory should be unmapped > > > and file descriptors should > > > be closed to prevent leaks or undefined behaviour. The idea here is to > > > clear the > > > device of all user context beforehand and set the stage for a clean > > > recovery. > > > -Example > > > -------- > > > +For ``WEDGED=vendor-specific`` recovery method, it is the responsibility > > > of the > > > +consumer to check the driver documentation and the usecase before > > > attempting > > > +a recovery. > > > + > > > +Example - rebind > > > +---------------- > > > Udev rule:: > > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > > > index cdd591b11488..0ac723a46a91 100644 > > > --- a/drivers/gpu/drm/drm_drv.c > > > +++ b/drivers/gpu/drm/drm_drv.c > > > @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned > > > int opt) > > > return "rebind"; > > > case DRM_WEDGE_RECOVERY_BUS_RESET: > > > return "bus-reset"; > > > + case DRM_WEDGE_RECOVERY_VENDOR: > > > + return "vendor-specific"; > > > default: > > > return NULL; > > > } > > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h > > > index a33aedd5e9ec..59fd3f4d5995 100644 > > > --- a/include/drm/drm_device.h > > > +++ b/include/drm/drm_device.h > > > @@ -26,10 +26,14 @@ struct pci_controller; > > > * Recovery methods for wedged device in order of less to more > > > side-effects. > > > * To be used with drm_dev_wedged_event() as recovery @method. Callers > > > can > > > * use any one, multiple (or'd) or none depending on their needs. > > > + * > > > + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst > > > for more > > > + * details. > > > */ > > > #define DRM_WEDGE_RECOVERY_NONE BIT(0) /* optional telemetry > > > collection */ > > > #define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind driver > > > */ > > > #define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset bus > > > device + bind */ > > > +#define DRM_WEDGE_RECOVERY_VENDOR BIT(3) /* vendor specific > > > recovery method */ > > > /** > > > * struct drm_wedge_task_info - information about the guilty task of a > > > wedge dev > > > -- > > > 2.47.1 > > > >