Re: DRM_RAS for CPER Error logging?!

Rodrigo Vivi Mon, 10 Nov 2025 12:35:44 -0800

On Mon, Nov 10, 2025 at 01:34:22PM +1000, Dave Airlie wrote:
> On Thu, 6 Nov 2025 at 23:16, Rodrigo Vivi <[email protected]> wrote:
> >
> > On Wed, Oct 29, 2025 at 02:00:38AM +0000, Zhang, Hawking wrote:
> > >    [AMD Official Use Only - AMD Internal Distribution Only]
> > >    + [1]@Zhou1, Tao and [2]@Liu, Xiang(Dean) for the awareness.
> > >
> > >    RE - AMD folks, would you consider this to replace the current debugfs 
> > > you
> > >    have?
> > >
> > >    [Hawking]:
> > >
> > >    Replacing the debugfs is not the primary concern.
> >
> > My initial plan was to go with debugfs like you are doing, but
> > I keep hearing complains that debugfs is not global and we need
> > to take into account some cases where debugfs is not available
> > in production images.
> >
> > > The main concern is
> > >    whether drm_ras can effectively support the necessary RAS information 
> > > for
> > >    all device vendors, as this largely depends on the design of the 
> > > hardware
> > >    and firmware.
> >
> > I fully agree. This is the main reason I'm doing my best to make the drm-ras
> > the most generic and expansible as possible.
> >
> > node registration with different node types, and names.
> >
> > I imagined something like:
> >
> > [{'FRU': 'String with device info', 'CPER': !@#$#!@#$},
> >
> > based on the format that the current non-standard-cper tracefs uses, with
> > the FRU + CPER. But we could avoid the FRU and make the FRU as node name.
> >
> > >
> > >    AMD is currently evaluating the proposed interface for error logging.
> >
> > The design of the details and the implementation is pretty much open for 
> > discussion
> > at this point.
> >
> > What I'm really looking forward is:
> >
> > to know if the path is acceptable overall
> > even if different drivers are opting for different node types?
> >
> > Is there any blocker on using this drm-ras/netlink for the CPER?
> 
> sorry for delay on this, I just had to read what CPER was :-)
> 
> I'm not offended by the idea of using tracefs here,


Right, that was my first thought as well.
Perhaps we simply use the

log_non_standard_event(sec_type, fru_id, fru_text, sec_sev, cper_data, 
cper_length)

provided directly by dirvers/ras/ras.c

But one limitation with that is that it is from HW/FW -> Kernel -> User Space.

There is no way for user space to query for the current/last log available.

I mean, we would only generate the CPER when passing certain threshold to avoid
flood in case of memory error storm. So, in this case, there's the need for user
to query the most recent log.

I believe it gets a bit ugly if we tell admin that in order to get the most
recent cper log you need to query the error counter through the netlink, and
up to every single error counter query we also emit the tracefs event.

Then I thought about using the netlink to query the cper, but with a separate
node, exclusively for error-log instead of abusing the error counter API.

But if you believe it is okay to emit tracefs on every counter check, then
we can take that path.

> I definitely think
> debugfs is a bad idea coming from the enterprise distro land where we
> don't like having it.

Yeap, this is why I thought that AMD was trying to find alternatives to
their debugfs solution. But the debugfs solution does have this possibility
of query...

> 
> I'm ccing a few other people that might have opinions on exposing CPER
> compatible logs for RAS purposes from devices, I assume there might be
> more than GPUs wanting to do something like this,

Thank you!

> 
> Dave.

Re: DRM_RAS for CPER Error logging?!

Reply via email to