On Wed, Oct 29, 2025 at 02:00:38AM +0000, Zhang, Hawking wrote:
>    [AMD Official Use Only - AMD Internal Distribution Only]                   
>   
>    + [1]@Zhou1, Tao and [2]@Liu, Xiang(Dean) for the awareness.               
>   
>                                                                               
>   
>    RE - AMD folks, would you consider this to replace the current debugfs you 
>   
>    have?                                                                      
>   
>                                                                               
>   
>    [Hawking]:                                                                 
>   
>                                                                               
>   
>    Replacing the debugfs is not the primary concern.

My initial plan was to go with debugfs like you are doing, but
I keep hearing complains that debugfs is not global and we need
to take into account some cases where debugfs is not available
in production images.

> The main concern is        
>    whether drm_ras can effectively support the necessary RAS information for  
>   
>    all device vendors, as this largely depends on the design of the hardware  
>   
>    and firmware.                                                              
>   

I fully agree. This is the main reason I'm doing my best to make the drm-ras
the most generic and expansible as possible.

node registration with different node types, and names.

I imagined something like:

[{'FRU': 'String with device info', 'CPER': !@#$#!@#$},

based on the format that the current non-standard-cper tracefs uses, with
the FRU + CPER. But we could avoid the FRU and make the FRU as node name.

>                                                                               
>   
>    AMD is currently evaluating the proposed interface for error logging.      
>   

The design of the details and the implementation is pretty much open for 
discussion
at this point.

What I'm really looking forward is:

to know if the path is acceptable overall
even if different drivers are opting for different node types?

Is there any blocker on using this drm-ras/netlink for the CPER?

Thanks,
Rodrigo.


>                                                                               
>   
>    Regards,                                                                   
>   
>    Hawking                                                                    
>   
>                                                                               
>   
>    -----Original Message-----                                                 
>   
>    From: Rodrigo Vivi <[email protected]>                                
>   
>    Sent: Wednesday, October 29, 2025 03:13                                    
>   
>    To: [email protected]; [email protected]; Dave  
>   
>    Airlie <[email protected]>; Joonas Lahtinen                                
>   
>    <[email protected]>; Simona Vetter <[email protected]>; 
>   
>    Zhang, Hawking <[email protected]>; Deucher, Alexander                 
>   
>    <[email protected]>; Zack McKevitt                                 
>   
>    <[email protected]>; Lukas Wunner <[email protected]>;       
>   
>    Aravind Iddamsetty <[email protected]>                    
>   
>    Cc: Zhang, Hawking <[email protected]>; Deucher, Alexander             
>   
>    <[email protected]>; Zack McKevitt                                 
>   
>    <[email protected]>; Lukas Wunner <[email protected]>; Dave  
>   
>    Airlie <[email protected]>; Simona Vetter <[email protected]>;        
>   
>    Aravind Iddamsetty <[email protected]>; Joonas Lahtinen   
>   
>    <[email protected]>                                          
>   
>    Subject: DRM_RAS for CPER Error logging?!                                  
>   
>                                                                               
>   
>    On Mon, Sep 29, 2025 at 05:44:12PM -0400, Rodrigo Vivi wrote:              
>   
>                                                                               
>   
>    Hey Dave, Sima, AMD folks, Qualcomm folks,                                 
>   
>                                                                               
>   
>    I have a key question to you below here.                                   
>   
>                                                                               
>   
>    > This work is a continuation of the great work started by Aravind ([1]    
>   
>    > and [2]) in order to fulfill the RAS requirements and proposal as        
>   
>    > previously discussed and agreed in the Linux Plumbers accelerator's bof  
>   
>    of 2022 [3].                                                               
>   
>    >                                                                          
>   
>    > [1]:                                                                     
>   
>    >                                                                          
>   
>    [3]https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.idd  
>   
>    > [4][email protected]/                                              
>   
>    > [2]:                                                                     
>   
>    >                                                                          
>   
>    [5]https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux  
>   
>    > .intel.com/                                                              
>   
>    > [3]:                                                                     
>   
>    >                                                                          
>   
>    [6]https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary  
>   
>    > .html                                                                    
>   
>    >                                                                          
>   
>    > During the past review round, Lukas pointed out that netlink had         
>   
>    > evolved in parallel during these years and that now, any new usage of    
>   
>    > netlink families would require the usage of the YAML description and     
>   
>    scripts.                                                                   
>   
>    >                                                                          
>   
>    > With this new requirement in place, the family name is hardcoded in      
>   
>    > the yaml file, so we are forced to have a single family name for the     
>   
>    > entire drm, and then we now we are forced to have a registration.        
>   
>    >                                                                          
>   
>    > So, while doing the registration, we now created the concept of          
>   
>    drm-ras-node.                                                              
>   
>    > For now the only node type supported is the agreed error-counter. But    
>   
>    > that could be expanded for other cases like telemetry, requested by      
>   
>    > Zack for the qualcomm accel driver.                                      
>   
>    >                                                                          
>   
>    > In this first version, only querying counter is supported. But also      
>   
>    > this is expandable to future introduction of multicast notification and  
>   
>    also clearing the counters.                                                
>   
>    >                                                                          
>   
>    > This design with multiple nodes per device is already flexible enough    
>   
>    > for driver to decide if it wants to handle error per device, or per IP   
>   
>    > block, or per error category. I believe this fully attend to the         
>   
>    > requested AMD feedback in the earlier reviews.                           
>   
>    >                                                                          
>   
>    > So, my proposal is to start simple with this case as is, and then        
>   
>    > iterate over with the drm-ras in tree so we evolve together according    
>   
>    > to various driver's RAS needs.                                           
>   
>    >                                                                          
>   
>    > I have provided a documentation and the first Xe implementation of the   
>   
>    > counter as reference.                                                    
>   
>    >                                                                          
>   
>    > Also, it is worth to mention that we have a in-tree pyynl/cli.py tool    
>   
>    > that entirely exercises this new API, hence I hope this can be the       
>   
>    > reference code for the uAPI usage, while we continue with the plan of    
>   
>    > introducing IGT tests and tools for this and adjusting the internal      
>   
>    > vendor tools to open with open source developments and changing them to  
>   
>    support these flows.                                                       
>   
>    >                                                                          
>   
>    > Example on MTL:                                                          
>   
>    >                                                                          
>   
>    > $ sudo ./tools/net/ynl/pyynl/cli.py \                                    
>   
>    >   --spec Documentation/netlink/specs/drm_ras.yaml \                      
>   
>    >   --dump list-nodes                                                      
>   
>    > [{'device-name': '00:02.0',                                              
>   
>    >   'node-id': 0,                                                          
>   
>    >   'node-name': 'non-fatal',                                              
>   
>    >   'node-type': 'error-counter'},                                         
>   
>    >  {'device-name': '00:02.0',                                              
>   
>    >   'node-id': 1,                                                          
>   
>    >   'node-name': 'correctable',                                            
>   
>    >   'node-type': 'error-counter'}]                                         
>   
>                                                                               
>   
>    As you can see on the drm-ras patch, we now have only a single family      
>   
>    called 'drm-ras', with that we have to register entry points, called       
>   
>    'nodes'                                                                    
>   
>    and for now only one type is existing: 'error-counter'                     
>   
>                                                                               
>   
>    As I believe it was agreed in the Linux Plumbers accelerator's bof of 2022 
>   
>    [3].                                                                       
>   
>                                                                               
>   
>    Zack already indicated that for Qualcomm he doesn't need the error         
>   
>    counters, but another type, perhaps telemetry.                             
>   
>                                                                               
>   
>    I need your feedback and input on yet another case here that goes side by  
>   
>    side with error-counters: Error logging.                                   
>   
>                                                                               
>   
>    One of the RAS requirements that we have is to emit CPER logs in certain   
>   
>    cases. AMD is currently using debugfs for printing the CPER entries that   
>   
>    accumulates in a ringbuffer. (iiuc).                                       
>   
>                                                                               
>   
>    Some folks are asking us to emit the CPER in the tracefs because debugfs   
>   
>    might not be available in some enterprise production images.               
>   
>                                                                               
>   
>    However, there's a concern on the tracefs usage for the error-logging      
>   
>    case.                                                                      
>   
>    There is no active query path in the tracefs. If user needs to poll for    
>   
>    the latest CPER records it would need to pig-back on some other API that   
>   
>    would force the emit-trace(cper).                                          
>   
>                                                                               
>   
>    I believe that the cleanest way is to have another drm-ras node type named 
>   
>    'error-logging' with a single operation that is query-logs, that would be  
>   
>    a dump of the available ring-buffer with latest known cper records. Is     
>   
>    this acceptable?                                                           
>   
>                                                                               
>   
>    AMD folks, would you consider this to replace the current debugfs you      
>   
>    have?                                                                      
>   
>                                                                               
>   
>    Please let me know your thoughts.                                          
>   
>                                                                               
>   
>    We won't have an example for now, but it would be something like:          
>   
>                                                                               
>   
>    Thanks,                                                                    
>   
>    Rodrigo.                                                                   
>   
>                                                                               
>   
>    $ sudo ./tools/net/ynl/pyynl/cli.py \                                      
>   
>      --spec Documentation/netlink/specs/drm_ras.yaml \                        
>   
>      --dump list-nodes                                                        
>   
>    [{'device-name': '00:02.0',                                                
>   
>      'node-id': 0,                                                            
>   
>      'node-name': 'non-fatal',                                                
>   
>      'node-type': 'error-counter'},                                           
>   
>    {'device-name': '00:02.0',                                                 
>   
>      'node-id': 1,                                                            
>   
>      'node-name': 'correctable',                                              
>   
>      'node-type': 'error-counter'}                                            
>   
>    'device-name': '00:02.0',                                                  
>   
>      'node-id': 2,                                                            
>   
>      'node-name': 'non-fatal',                                                
>   
>      'node-type': 'error-logging'},                                           
>   
>    {'device-name': '00:02.0',                                                 
>   
>      'node-id': 3,                                                            
>   
>      'node-name': 'correctable',                                              
>   
>      'node-type': 'error-logging'}]                                           
>   
>                                                                               
>   
>    $ sudo ./tools/net/ynl/pyynl/cli.py \                                      
>   
>       --spec Documentation/netlink/specs/drm_ras.yaml \                       
>   
>       --dump get-logs --json '{"node-id":3}'                                  
>   
>    [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},                    
>   
>    {'FRU': 'String with device info', 'CPER': !@#$#!@#$},                     
>   
>    {'FRU': 'String with device info', 'CPER': !@#$#!@#$},                     
>   
>    {'FRU': 'String with device info', 'CPER': !@#$#!@#$},                     
>   
>    {'FRU': 'String with device info', 'CPER': !@#$#!@#$}, ]                   
>   
>                                                                               
>   
>    Of course, details of the error-logging fields along with the CPER binary  
>   
>    is yet to be defined.                                                      
>   
>                                                                               
>   
>    Oh, and the nodes names and split is device specific. The infra is         
>   
>    flexible enough. Driver can do whatever it makes sense for their device.   
>   
>                                                                               
>   
>    Any feedback or comment is really appreciated.                             
>   
>                                                                               
>   
>    Thanks in advance,                                                         
>   
>    Rodrigo.                                                                   
>   
>                                                                               
>   
>    >                                                                          
>   
>    > $ sudo ./tools/net/ynl/pyynl/cli.py \                                    
>   
>    >   --spec Documentation/netlink/specs/drm_ras.yaml \                      
>   
>    >   --dump get-error-counters --json '{"node-id":1}'                       
>   
>    > [{'error-id': 0, 'error-name': 'GT Error', 'error-value': 0},            
>   
>    >  {'error-id': 4, 'error-name': 'Display Error', 'error-value': 0},       
>   
>    >  {'error-id': 8, 'error-name': 'GSC Error', 'error-value': 0},           
>   
>    >  {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0},      
>   
>    >  {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0},          
>   
>    >  {'error-id': 17, 'error-name': 'CSC Error', 'error-value': 0}]          
>   
>    >                                                                          
>   
>    > $ sudo ./tools/net/ynl/pyynl/cli.py \                                    
>   
>    >   --spec Documentation/netlink/specs/drm_ras.yaml \                      
>   
>    >   --do query-error-counter --json '{"node-id": 0, "error-id": 12}'       
>   
>    > {'error-id': 12, 'error-name': 'SG Unit Error', 'error-value': 0}        
>   
>    >                                                                          
>   
>    > $ sudo ./tools/net/ynl/pyynl/cli.py \                                    
>   
>    >   --spec Documentation/netlink/specs/drm_ras.yaml \                      
>   
>    >   --do query-error-counter --json '{"node-id": 1, "error-id": 16}'       
>   
>    > {'error-id': 16, 'error-name': 'SoC Error', 'error-value': 0}            
>   
>    >                                                                          
>   
>    > Thanks,                                                                  
>   
>    > Rodrigo.                                                                 
>   
>    >                                                                          
>   
>    > Cc: Hawking Zhang <[7][email protected]>                             
>   
>    > Cc: Alex Deucher <[8][email protected]>                          
>   
>    > Cc: Zack McKevitt <[9][email protected]>                 
>   
>    > Cc: Lukas Wunner <[10][email protected]>                                   
>   
>    > Cc: Dave Airlie <[11][email protected]>                                  
>   
>    > Cc: Simona Vetter <[12][email protected]>                           
>   
>    > Cc: Aravind Iddamsetty <[13][email protected]>          
>   
>    > Cc: Joonas Lahtinen <[14][email protected]>                
>   
>    > Signed-off-by: Rodrigo Vivi <[15][email protected]>                 
>   
>    >                                                                          
>   
>    > Rodrigo Vivi (2):                                                        
>   
>    >   drm/ras: Introduce the DRM RAS infrastructure over generic netlink     
>   
>    >   drm/xe: Introduce the usage of drm_ras with supported HW errors        
>   
>    >                                                                          
>   
>    >  Documentation/gpu/drm-ras.rst              | 109 +++++++                
>   
>    >  Documentation/netlink/specs/drm_ras.yaml   | 130 ++++++++               
>   
>    >  drivers/gpu/drm/Kconfig                    |   9 +                      
>   
>    >  drivers/gpu/drm/Makefile                   |   1 +                      
>   
>    >  drivers/gpu/drm/drm_drv.c                  |   6 +                      
>   
>    >  drivers/gpu/drm/drm_ras.c                  | 357 +++++++++++++++++++++  
>   
>    >  drivers/gpu/drm/drm_ras_genl_family.c      |  42 +++                    
>   
>    >  drivers/gpu/drm/drm_ras_nl.c               |  54 ++++                   
>   
>    >  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  22 ++                     
>   
>    >  drivers/gpu/drm/xe/xe_hw_error.c           | 155 ++++++++-              
>   
>    >  include/drm/drm_ras.h                      |  76 +++++                  
>   
>    >  include/drm/drm_ras_genl_family.h          |  17 +                      
>   
>    >  include/drm/drm_ras_nl.h                   |  24 ++                     
>   
>    >  include/uapi/drm/drm_ras.h                 |  49 +++                    
>   
>    >  14 files changed, 1049 insertions(+), 2 deletions(-)  create mode       
>   
>    > 100644 Documentation/gpu/drm-ras.rst  create mode 100644                 
>   
>    > Documentation/netlink/specs/drm_ras.yaml                                 
>   
>    >  create mode 100644 drivers/gpu/drm/drm_ras.c  create mode 100644        
>   
>    > drivers/gpu/drm/drm_ras_genl_family.c                                    
>   
>    >  create mode 100644 drivers/gpu/drm/drm_ras_nl.c  create mode 100644     
>   
>    > include/drm/drm_ras.h  create mode 100644                                
>   
>    > include/drm/drm_ras_genl_family.h  create mode 100644                    
>   
>    > include/drm/drm_ras_nl.h  create mode 100644                             
>   
>    > include/uapi/drm/drm_ras.h                                               
>   
>    >                                                                          
>   
>    > --                                                                       
>   
>    > 2.51.0                                                                   
>   
>    >                                                                          
>   
> 
> References
> 
>    Visible links
>    1. mailto:[email protected]
>    2. mailto:[email protected]
>    3. https://lore.kernel.org/dri-devel/20250730064956.1385855-1-aravind.idd
>    4. mailto:[email protected]/
>    5. https://lore.kernel.org/all/4cbdfcc5-5020-a942-740e-a602d4c00cc2@linux
>    6. https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary
>    7. mailto:[email protected]
>    8. mailto:[email protected]
>    9. mailto:[email protected]
>   10. mailto:[email protected]
>   11. mailto:[email protected]
>   12. mailto:[email protected]
>   13. mailto:[email protected]
>   14. mailto:[email protected]
>   15. mailto:[email protected]

Reply via email to