On Thu, Oct 30, 2025 at 10:47:18AM -0400, Rodrigo Vivi wrote: > On Tue, Oct 28, 2025 at 03:13:15PM -0400, Rodrigo Vivi wrote: > > On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote: > > > > Hey Dave, Sima, AMD folks, Qualcomm folks, > > + Netlink list and maintainers to get some feedback on the netlink usage > proposed here.
The netdev mailing list blocked my bounces of the original discussions, so for the overall context: Usage of netlink as a drm-ras solution (with error counters in mind): https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html Proposal for error-counters with drm-ras generic netlink: https://lore.kernel.org/dri-devel/[email protected]/ Question about the error-logging RAS sub-case with CPER over this drm-ras netlink: https://lore.kernel.org/dri-devel/[email protected]/ > > Specially to check if there's any concern with CPER blob going through > netlink or if there's any size limitation or concern. > > > > > I have a key question to you below here. > > > > > This work is a continuation of the great work started by Aravind ([1] and > > > [2]) > > > in order to fulfill the RAS requirements and proposal as previously > > > discussed > > > and agreed in the Linux Plumbers accelerator's bof of 2022 [3]. > > > > > > [1]: > > > https://lore.kernel.org/dri-devel/[email protected]/ > > > [2]: > > > https://lore.kernel.org/all/[email protected]/ > > > [3]: > > > https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html > > > > > > During the past review round, Lukas pointed out that netlink had evolved > > > in parallel during these years and that now, any new usage of netlink > > > families > > > would require the usage of the YAML description and scripts. > > > > > > With this new requirement in place, the family name is hardcoded in the > > > yaml file, > > > so we are forced to have a single family name for the entire drm, and > > > then we now > > > we are forced to have a registration. > > > > > > So, while doing the registration, we now created the concept of > > > drm-ras-node. > > > For now the only node type supported is the agreed error-counter. But > > > that could > > > be expanded for other cases like telemetry, requested by Zack for the > > > qualcomm accel > > > driver. > > > > > > In this first version, only querying counter is supported. But also this > > > is expandable > > > to future introduction of multicast notification and also clearing the > > > counters. > > > > > > This design with multiple nodes per device is already flexible enough for > > > driver > > > to decide if it wants to handle error per device, or per IP block, or per > > > error > > > category. I believe this fully attend to the requested AMD feedback in > > > the earlier > > > reviews. > > > > > > So, my proposal is to start simple with this case as is, and then iterate > > > over > > > with the drm-ras in tree so we evolve together according to various > > > driver's RAS > > > needs. > > > > > > I have provided a documentation and the first Xe implementation of the > > > counter > > > as reference. > > > > > > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool > > > that entirely > > > exercises this new API, hence I hope this can be the reference code for > > > the uAPI > > > usage, while we continue with the plan of introducing IGT tests and tools > > > for this > > > and adjusting the internal vendor tools to open with open source > > > developments and > > > changing them to support these flows. > > > > > > Example on MTL: > > > > > > $ sudo ./tools/net/ynl/pyynl/cli.py \ > > > --spec Documentation/netlink/specs/drm_ras.yaml \ > > > --dump list-nodes > > > [{'device-name': '00:02.0', > > > 'node-id': 0, > > > 'node-name': 'non-fatal', > > > 'node-type': 'error-counter'}, > > > {'device-name': '00:02.0', > > > 'node-id': 1, > > > 'node-name': 'correctable', > > > 'node-type': 'error-counter'}] > > > > As you can see on the drm-ras patch, we now have only a single family called > > 'drm-ras', with that we have to register entry points, called 'nodes' > > and for now only one type is existing: 'error-counter' > > > > As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022 > > [3]. > > > > Zack already indicated that for Qualcomm he doesn't need the error counters, > > but another type, perhaps telemetry. > > > > I need your feedback and input on yet another case here that goes side > > by side with error-counters: Error logging. > > > > One of the RAS requirements that we have is to emit CPER logs in certain > > cases. AMD is currently using debugfs for printing the CPER entries that > > accumulates in a ringbuffer. (iiuc). > > > > Some folks are asking us to emit the CPER in the tracefs because > > debugfs might not be available in some enterprise production images. > > > > However, there's a concern on the tracefs usage for the error-logging case. > > There is no active query path in the tracefs. If user needs to poll for > > the latest CPER records it would need to pig-back on some other API > > that would force the emit-trace(cper). > > > > I believe that the cleanest way is to have another drm-ras node type > > named 'error-logging' with a single operation that is query-logs, > > that would be a dump of the available ring-buffer with latest known > > cper records. Is this acceptable? > > > > AMD folks, would you consider this to replace the current debugfs you > > have? > > > > Please let me know your thoughts. > > > > We won't have an example for now, but it would be something like: > > > > Thanks, > > Rodrigo. > > > > $ sudo ./tools/net/ynl/pyynl/cli.py \ > > --spec Documentation/netlink/specs/drm_ras.yaml \ > > --dump list-nodes > > [{'device-name': '00:02.0', > > 'node-id': 0, > > 'node-name': 'non-fatal', > > 'node-type': 'error-counter'}, > > {'device-name': '00:02.0', > > 'node-id': 1, > > 'node-name': 'correctable', > > 'node-type': 'error-counter'} > > 'device-name': '00:02.0', > > 'node-id': 2, > > 'node-name': 'non-fatal', > > 'node-type': 'error-logging'}, > > {'device-name': '00:02.0', > > 'node-id': 3, > > 'node-name': 'correctable', > > 'node-type': 'error-logging'}] > > > > $ sudo ./tools/net/ynl/pyynl/cli.py \ > > --spec Documentation/netlink/specs/drm_ras.yaml \ > > --dump get-logs --json '{"node-id":3}' > > [{'FRU': 'String with device info', 'CPER': !@#$#!@#$}, > > {'FRU': 'String with device info', 'CPER': !@#$#!@#$}, > > {'FRU': 'String with device info', 'CPER': !@#$#!@#$}, > > {'FRU': 'String with device info', 'CPER': !@#$#!@#$}, > > {'FRU': 'String with device info', 'CPER': !@#$#!@#$}, > > ] > > > > Of course, details of the error-logging fields along with the CPER binary > > is yet to be defined. > > > > Oh, and the nodes names and split is device specific. The infra is flexible > > enough. Driver can do whatever it makes sense for their device. > > > > Any feedback or comment is really appreciated. > > > > Thanks in advance, > > Rodrigo. > > > > > > > > $ sudo ./tools/net/ynl/pyynl/cli.py \ > > > --spec Documentation/netlink/specs/drm_ras.yaml \ > > > --dump get-error-counters --json '{"node-id":1}' > > > [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0}, > > > {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0}, > > > {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0}, > > > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}, > > > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}, > > > {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}] > > > > > > $ sudo ./tools/net/ynl/pyynl/cli.py \ > > > --spec Documentation/netlink/specs/drm_ras.yaml \ > > > --do query-error-counter --json '{"node-id": 0, "error-id": 12}' > > > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0} > > > > > > $ sudo ./tools/net/ynl/pyynl/cli.py \ > > > --spec Documentation/netlink/specs/drm_ras.yaml \ > > > --do query-error-counter --json '{"node-id": 1, "error-id": 16}' > > > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0} > > > > > > Thanks, > > > Rodrigo. > > > > > > Cc: Hawking Zhang <[email protected]> > > > Cc: Alex Deucher <[email protected]> > > > Cc: Zack McKevitt <[email protected]> > > > Cc: Lukas Wunner <[email protected]> > > > Cc: Dave Airlie <[email protected]> > > > Cc: Simona Vetter <[email protected]> > > > Cc: Aravind Iddamsetty <[email protected]> > > > Cc: Joonas Lahtinen <[email protected]> > > > Signed-off-by: Rodrigo Vivi <[email protected]> > > > > > > Rodrigo Vivi (2): > > > drm/ras: Introduce the DRM RAS infrastructure over generic netlink > > > drm/xe: Introduce the usage of drm_ras with supported HW errors > > > > > > Documentation/gpu/drm-ras.rst | 109 +++++++ > > > Documentation/netlink/specs/drm_ras.yaml | 130 ++++++++ > > > drivers/gpu/drm/Kconfig | 9 + > > > drivers/gpu/drm/Makefile | 1 + > > > drivers/gpu/drm/drm_drv.c | 6 + > > > drivers/gpu/drm/drm_ras.c | 357 +++++++++++++++++++++ > > > drivers/gpu/drm/drm_ras_genl_family.c | 42 +++ > > > drivers/gpu/drm/drm_ras_nl.c | 54 ++++ > > > drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 22 ++ > > > drivers/gpu/drm/xe/xe_hw_error.c | 155 ++++++++- > > > include/drm/drm_ras.h | 76 +++++ > > > include/drm/drm_ras_genl_family.h | 17 + > > > include/drm/drm_ras_nl.h | 24 ++ > > > include/uapi/drm/drm_ras.h | 49 +++ > > > 14 files changed, 1049 insertions(+), 2 deletions(-) > > > create mode 100644 Documentation/gpu/drm-ras.rst > > > create mode 100644 Documentation/netlink/specs/drm_ras.yaml > > > create mode 100644 drivers/gpu/drm/drm_ras.c > > > create mode 100644 drivers/gpu/drm/drm_ras_genl_family.c > > > create mode 100644 drivers/gpu/drm/drm_ras_nl.c > > > create mode 100644 include/drm/drm_ras.h > > > create mode 100644 include/drm/drm_ras_genl_family.h > > > create mode 100644 include/drm/drm_ras_nl.h > > > create mode 100644 include/uapi/drm/drm_ras.h > > > > > > -- > > > 2.51.0 > > >
