Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
Pre-existing issues: - [High] Emitting a DRM wedged uevent before the device is registered in sysfs causes the uevent to be silently dropped, losing the cold-reset notification. - [High] Concurrent modification of unprotected `xe->wedged.method` can cause the `COLD_RESET` recovery method to be overwritten. -- commit d8f11e3cba2116967c1b1af489f30a2fd996f664 Author: Mallesh Koujalagi <[email protected]> drm/xe: Handle PUNIT errors by requesting cold-reset recovery When PUNIT errors are detected that persist across warm resets, this commit marks the device as wedged with DRM_WEDGE_RECOVERY_COLD_RESET and notifies userspace that a complete device power cycle is required. > diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c > --- a/drivers/gpu/drm/xe/xe_ras.c > +++ b/drivers/gpu/drm/xe/xe_ras.c [ ... ] > @@ -207,6 +209,12 @@ static enum xe_ras_recovery_action > handle_core_compute_errors(struct xe_ras_erro > return XE_RAS_RECOVERY_ACTION_RECOVERED; > } > > +static void punit_error_handler(struct xe_device *xe) > +{ > + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET); [Severity: High] This is a pre-existing issue, but does concurrent modification of the unprotected xe->wedged.method here cause the COLD_RESET recovery method to be overwritten? If a concurrent event (such as a GT hang timeout) triggers xe_device_declare_wedged() concurrently, there appears to be a race condition. The concurrent thread may evaluate if (!xe->wedged.method) as true just before punit_error_handler() executes xe_device_set_wedged_method() with DRM_WEDGE_RECOVERY_COLD_RESET. The concurrent thread would then execute: xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET); Will this overwrite the COLD_RESET method and result in an incorrect recovery uevent being broadcast to userspace? > + xe_device_declare_wedged(xe); [Severity: High] This isn't a bug introduced by this patch, but does emitting a DRM wedged uevent before the device is registered in sysfs cause the uevent to be silently dropped? During driver initialization, xe_device_probe() calls xe_ras_init(), which processes preexisting hardware errors. If a PUNIT error is detected at this stage, punit_error_handler() marks the device as wedged and calls xe_device_declare_wedged(), which in turn calls drm_dev_wedged_event(). Since drm_dev_register() has not yet been called at this stage of the probe, device_add() has not executed, leaving state_in_sysfs = 0 for the kobject. The core kobject_uevent_env() explicitly drops events for unlinked kobjects. Does this cause the WEDGED=cold-reset uevent notification intended for userspace to be permanently lost, leaving the device unusable without userspace automation knowing it needs to power-cycle the slot? > +} > + > static enum xe_ras_recovery_action handle_soc_internal_errors(struct > xe_device *xe, -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=4
