On Wed, Oct 29, 2025 at 02:00:38AM +0000, Zhang, Hawking wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> + [1]@Zhou1, Tao and [2]@Liu, Xiang(Dean) for the awareness.
>
>
>
> RE - AMD folks, would you consider this to replace the current debugfs you
>
> have?
>
>
>
> [Hawking]:
>
>
>
> Replacing the debugfs is not the primary concern.
My initial plan was to go with debugfs like you are doing, but
I keep hearing complains that debugfs is not global and we need
to take into account some cases where debugfs is not available
in production images.
> The main concern is
> whether drm_ras can effectively support the necessary RAS information for
>
> all device vendors, as this largely depends on the design of the hardware
>
> and firmware.
>
I fully agree. This is the main reason I'm doing my best to make the drm-ras
the most generic and expansible as possible.
node registration with different node types, and names.
I imagined something like:
[{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
based on the format that the current non-standard-cper tracefs uses, with
the FRU + CPER. But we could avoid the FRU and make the FRU as node name.
>
>
> AMD is currently evaluating the proposed interface for error logging.
>
The design of the details and the implementation is pretty much open for
discussion
at this point.
What I'm really looking forward is:
to know if the path is acceptable overall
even if different drivers are opting for different node types?
Is there any blocker on using this drm-ras/netlink for the CPER?
Thanks,
Rodrigo.
>
>
> Regards,
>
> Hawking
>
>
>
> -----Original Message-----
>
> From: Rodrigo Vivi <[email protected]>
>
> Sent: Wednesday, October 29, 2025 03:13
>
> To: [email protected]; [email protected]; Dave
>
> Airlie <[email protected]>; Joonas Lahtinen
>
> <[email protected]>; Simona Vetter <[email protected]>;
>
> Zhang, Hawking <[email protected]>; Deucher, Alexander
>
> <[email protected]>; Zack McKevitt
>
> <[email protected]>; Lukas Wunner <[email protected]>;
>
> Aravind Iddamsetty <[email protected]>
>
> Cc: Zhang, Hawking <[email protected]>; Deucher, Alexander
>
> <[email protected]>; Zack McKevitt
>
> <[email protected]>; Lukas Wunner <[email protected]>; Dave
>
> Airlie <[email protected]>; Simona Vetter <[email protected]>;
>
> Aravind Iddamsetty <[email protected]>; Joonas Lahtinen
>
> <[email protected]>
>
> Subject: DRM_RAS for CPER Error logging?!
>
>
>
> On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:
>
>
>
> Hey Dave, Sima, AMD folks, Qualcomm folks,
>
>
>
> I have a key question to you below here.
>
>
>
> > This work is a continuation of the great work started by Aravind ([1]
>
> > and [2]) in order to fulfill the RAS requirements and proposal as
>
> > previously discussed and agreed in the Linux Plumbers accelerator's bof
>
> of 2022 [3].
>
> >
>
> > [1]:
>
> >
>
> [3]https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.idd
>
> > [4][email protected]/
>
> > [2]:
>
> >
>
> [5]https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux
>
> > .intel.com/
>
> > [3]:
>
> >
>
> [6]https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary
>
> > .html
>
> >
>
> > During the past review round, Lukas pointed out that netlink had
>
> > evolved in parallel during these years and that now, any new usage of
>
> > netlink families would require the usage of the YAML description and
>
> scripts.
>
> >
>
> > With this new requirement in place, the family name is hardcoded in
>
> > the yaml file, so we are forced to have a single family name for the
>
> > entire drm, and then we now we are forced to have a registration.
>
> >
>
> > So, while doing the registration, we now created the concept of
>
> drm-ras-node.
>
> > For now the only node type supported is the agreed error-counter. But
>
> > that could be expanded for other cases like telemetry, requested by
>
> > Zack for the qualcomm accel driver.
>
> >
>
> > In this first version, only querying counter is supported. But also
>
> > this is expandable to future introduction of multicast notification and
>
> also clearing the counters.
>
> >
>
> > This design with multiple nodes per device is already flexible enough
>
> > for driver to decide if it wants to handle error per device, or per IP
>
> > block, or per error category. I believe this fully attend to the
>
> > requested AMD feedback in the earlier reviews.
>
> >
>
> > So, my proposal is to start simple with this case as is, and then
>
> > iterate over with the drm-ras in tree so we evolve together according
>
> > to various driver's RAS needs.
>
> >
>
> > I have provided a documentation and the first Xe implementation of the
>
> > counter as reference.
>
> >
>
> > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool
>
> > that entirely exercises this new API, hence I hope this can be the
>
> > reference code for the uAPI usage, while we continue with the plan of
>
> > introducing IGT tests and tools for this and adjusting the internal
>
> > vendor tools to open with open source developments and changing them to
>
> support these flows.
>
> >
>
> > Example on MTL:
>
> >
>
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
>
> > --spec Documentation/netlink/specs/drm_ras.yaml \
>
> > --dump list-nodes
>
> > [{'device-name': '00:02.0',
>
> > 'node-id': 0,
>
> > 'node-name': 'non-fatal',
>
> > 'node-type': 'error-counter'},
>
> > {'device-name': '00:02.0',
>
> > 'node-id': 1,
>
> > 'node-name': 'correctable',
>
> > 'node-type': 'error-counter'}]
>
>
>
> As you can see on the drm-ras patch, we now have only a single family
>
> called 'drm-ras', with that we have to register entry points, called
>
> 'nodes'
>
> and for now only one type is existing: 'error-counter'
>
>
>
> As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022
>
> [3].
>
>
>
> Zack already indicated that for Qualcomm he doesn't need the error
>
> counters, but another type, perhaps telemetry.
>
>
>
> I need your feedback and input on yet another case here that goes side by
>
> side with error-counters: Error logging.
>
>
>
> One of the RAS requirements that we have is to emit CPER logs in certain
>
> cases. AMD is currently using debugfs for printing the CPER entries that
>
> accumulates in a ringbuffer. (iiuc).
>
>
>
> Some folks are asking us to emit the CPER in the tracefs because debugfs
>
> might not be available in some enterprise production images.
>
>
>
> However, there's a concern on the tracefs usage for the error-logging
>
> case.
>
> There is no active query path in the tracefs. If user needs to poll for
>
> the latest CPER records it would need to pig-back on some other API that
>
> would force the emit-trace(cper).
>
>
>
> I believe that the cleanest way is to have another drm-ras node type named
>
> 'error-logging' with a single operation that is query-logs, that would be
>
> a dump of the available ring-buffer with latest known cper records. Is
>
> this acceptable?
>
>
>
> AMD folks, would you consider this to replace the current debugfs you
>
> have?
>
>
>
> Please let me know your thoughts.
>
>
>
> We won't have an example for now, but it would be something like:
>
>
>
> Thanks,
>
> Rodrigo.
>
>
>
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>
> --spec Documentation/netlink/specs/drm_ras.yaml \
>
> --dump list-nodes
>
> [{'device-name': '00:02.0',
>
> 'node-id': 0,
>
> 'node-name': 'non-fatal',
>
> 'node-type': 'error-counter'},
>
> {'device-name': '00:02.0',
>
> 'node-id': 1,
>
> 'node-name': 'correctable',
>
> 'node-type': 'error-counter'}
>
> 'device-name': '00:02.0',
>
> 'node-id': 2,
>
> 'node-name': 'non-fatal',
>
> 'node-type': 'error-logging'},
>
> {'device-name': '00:02.0',
>
> 'node-id': 3,
>
> 'node-name': 'correctable',
>
> 'node-type': 'error-logging'}]
>
>
>
> $ sudo ./tools/net/ynl/pyynl/cli.py \
>
> --spec Documentation/netlink/specs/drm_ras.yaml \
>
> --dump get-logs --json '{"node-id":3}'
>
> [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
>
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
>
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
>
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$},
>
> {'FRU': 'String with device info', 'CPER': !@#$#!@#$}, ]
>
>
>
> Of course, details of the error-logging fields along with the CPER binary
>
> is yet to be defined.
>
>
>
> Oh, and the nodes names and split is device specific. The infra is
>
> flexible enough. Driver can do whatever it makes sense for their device.
>
>
>
> Any feedback or comment is really appreciated.
>
>
>
> Thanks in advance,
>
> Rodrigo.
>
>
>
> >
>
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
>
> > --spec Documentation/netlink/specs/drm_ras.yaml \
>
> > --dump get-error-counters --json '{"node-id":1}'
>
> > [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},
>
> > {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},
>
> > {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},
>
> > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},
>
> > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},
>
> > {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]
>
> >
>
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
>
> > --spec Documentation/netlink/specs/drm_ras.yaml \
>
> > --do query-error-counter --json '{"node-id": 0, "error-id": 12}'
>
> > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}
>
> >
>
> > $ sudo ./tools/net/ynl/pyynl/cli.py \
>
> > --spec Documentation/netlink/specs/drm_ras.yaml \
>
> > --do query-error-counter --json '{"node-id": 1, "error-id": 16}'
>
> > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}
>
> >
>
> > Thanks,
>
> > Rodrigo.
>
> >
>
> > Cc: Hawking Zhang <[7][email protected]>
>
> > Cc: Alex Deucher <[8][email protected]>
>
> > Cc: Zack McKevitt <[9][email protected]>
>
> > Cc: Lukas Wunner <[10][email protected]>
>
> > Cc: Dave Airlie <[11][email protected]>
>
> > Cc: Simona Vetter <[12][email protected]>
>
> > Cc: Aravind Iddamsetty <[13][email protected]>
>
> > Cc: Joonas Lahtinen <[14][email protected]>
>
> > Signed-off-by: Rodrigo Vivi <[15][email protected]>
>
> >
>
> > Rodrigo Vivi (2):
>
> > drm/ras: Introduce the DRM RAS infrastructure over generic netlink
>
> > drm/xe: Introduce the usage of drm_ras with supported HW errors
>
> >
>
> > Documentation/gpu/drm-ras.rst | 109 +++++++
>
> > Documentation/netlink/specs/drm_ras.yaml | 130 ++++++++
>
> > drivers/gpu/drm/Kconfig | 9 +
>
> > drivers/gpu/drm/Makefile | 1 +
>
> > drivers/gpu/drm/drm_drv.c | 6 +
>
> > drivers/gpu/drm/drm_ras.c | 357 +++++++++++++++++++++
>
> > drivers/gpu/drm/drm_ras_genl_family.c | 42 +++
>
> > drivers/gpu/drm/drm_ras_nl.c | 54 ++++
>
> > drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 22 ++
>
> > drivers/gpu/drm/xe/xe_hw_error.c | 155 ++++++++-
>
> > include/drm/drm_ras.h | 76 +++++
>
> > include/drm/drm_ras_genl_family.h | 17 +
>
> > include/drm/drm_ras_nl.h | 24 ++
>
> > include/uapi/drm/drm_ras.h | 49 +++
>
> > 14 files changed, 1049 insertions(+), 2 deletions(-) create mode
>
> > 100644 Documentation/gpu/drm-ras.rst create mode 100644
>
> > Documentation/netlink/specs/drm_ras.yaml
>
> > create mode 100644 drivers/gpu/drm/drm_ras.c create mode 100644
>
> > drivers/gpu/drm/drm_ras_genl_family.c
>
> > create mode 100644 drivers/gpu/drm/drm_ras_nl.c create mode 100644
>
> > include/drm/drm_ras.h create mode 100644
>
> > include/drm/drm_ras_genl_family.h create mode 100644
>
> > include/drm/drm_ras_nl.h create mode 100644
>
> > include/uapi/drm/drm_ras.h
>
> >
>
> > --
>
> > 2.51.0
>
> >
>
>
> References
>
> Visible links
> 1. mailto:[email protected]
> 2. mailto:[email protected]
> 3. https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.idd
> 4. mailto:[email protected]/
> 5. https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux
> 6. https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary
> 7. mailto:[email protected]
> 8. mailto:[email protected]
> 9. mailto:[email protected]
> 10. mailto:[email protected]
> 11. mailto:[email protected]
> 12. mailto:[email protected]
> 13. mailto:[email protected]
> 14. mailto:[email protected]
> 15. mailto:[email protected]