We have been trying to find documentation on how to tell Xen to forward MCE
information to the linux kernel in Dom0 in order to let a system administrator
be able to get notified when his system has bad memory. However from what I
can tell this has not been documented anywhere.
If anyone knows of documentation (or knows the answer) of what someone is
supposed to do in order to monitor the corrected errors and monitor the
uncorrected errors when they are running modern xen, it would be appreciated.
To clarify, (and for people not familiar):
When running old xen ( example: Xen 4.1) on a system, linux in dom0 would
load the edac modules. example: amd64_edac_mod , edac_mce_amd , and edac_core
Once the modules loaded, the error counts for ECC memory, and PCI, could be
found in these "files":
/sys/devices/system/edac/mc/mc0/ce_count
/sys/devices/system/edac/mc/mc0/ue_count
/sys/devices/system/edac/pci/pci0/npe_count
/sys/devices/system/edac/pci/pci0/pe_count
However, in 2009-02, "cegger" wrote MCA/MCE_in_Xen, a proposal for having
xen start checking the information
Xen started accessing the EDAC information (now called "MCE") at some point
after that, which blocks the linux kernel in dom0 from accessing it.
(I also found what appears to be related sides from a presentation from
2012 at:
https://lkml.iu.edu/hypermail/linux/kernel/1206.3/01304/xen_vMCE_design_%28v0_2%29.pdf
)
Now, The linux kernel compile option: CONFIG_XEN_MCE_LOG=y is documented
as: "Allow kernel fetching MCE error from Xen platform and converting it into
Linux mcelog format for mcelog tools".
I imagine there must be some way on the xen side for this to work for
CONFIG_XEN_MCE_LOG to have gotten into the linux kernel and be enabled by
default in distributions.
(notes: mcelog seems to have been replaced with "ras daemon", but I
believe that it pulls information using the same kernel APT as "mcelog") (so I
believe the final output of if you are having memory errors is pulled by doing
"ras-mc-ctl --errors" now instead of looking in /sys/devices/system/edac/mc and
/sys/devices/system/edac/pci)
I suspect that to check if it was working on a modern system, one would do
"ras-mc-ctl --status" and get something implying that the xen mce interface is
working instead of just saying "ras-mc-ctl: drivers not loaded."
Somewhere it was said that adding the xen boot parameter "mce=1" to grub
would cause xen to forward the info to the linux kernel, but that conflicts
with recent changes to the documentation. Also, tested by setting "mce=1" and
nothing appears to change.
Any help is appreciated.