[Qemu-devel] [Bug 1307225] Re: Running a virtual machine on a Haswell system produces machine check events

Sander Brandenburg Fri, 26 Sep 2014 07:26:30 -0700

I think this is related to the Haswell erratum 131 of the 'Intel® Xeon® 
Processor E3-1200  v3 Product Family Specification Update' at:
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v3-spec-update.pdf


  HSW131. Spurious Corrected Errors May be Reported
  Problem: Due this erratum, spurious corrected errors may be logged in the 
IA32_MC0_STATUS 
    register with the valid field (bit 63) set, the uncorrected error field 
(bit 61) not set, a 
    Model Specific Error Code (bits [31:16]) of 0x000F, and an MCA Error Code 
(bits 
    [15:0]) of 0x0005. If CMCI is enabled, these spurious corrected errors also 
signal interrupts.
  Implication: When this erratum occurs, software may see corrected errors that 
are benign. These 
    corrected errors may be safely ignored.
  Workaround: None identified.
  Status: For the steppings affected, see the Summary Table of Changes


I propose to work around this by mce=ignore_ce, as this is a spurious 
'corrected error':
>From Documentation/x86/x86_64/boot-options.txt:
   mce=ignore_ce
                Disable features for corrected errors, e.g. polling timer
                and CMCI.  All events reported as corrected are not cleared
                by OS and remained in its error banks.
                Usually this disablement is not recommended, however if
                there is an agent checking/clearing corrected errors
                (e.g. BIOS or hardware monitoring applications), conflicting
                with OS's error handling, and you cannot deactivate the agent,
                then this option will be a help.

But I have not tried this yet.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1307225

Title:
  Running a virtual machine on a Haswell system produces machine check
  events

Status in QEMU:
  New

Bug description:
  I'm running a virtual Windows SBS 2003 installation on a Xeon E3
  Haswell system running Gentoo Linux. First, I used Qemu 1.5.3 (the
  latest stable version on Gentoo). I got a lot of machine check events
  ("mce: [Hardware Error]: Machine check events logged") in dmesg that
  always looked like (using mcelog):

  Hardware event. This is not a software error.
  MCE 0
  CPU 3 BANK 0
  TIME 1397455091 Mon Apr 14 07:58:11 2014
  MCG status:
  MCi status:
  Corrected error
  Error enabled
  MCA: Internal parity error
  STATUS 90000040000f0005 MCGSTATUS 0
  MCGCAP c09 APICID 6 SOCKETID 0
  CPUID Vendor Intel Family 6 Model 60

  I found this discussion on the vmware community:
  https://communities.vmware.com/thread/452344

  It seems that this is (at least partly) caused by the Qemu machine. I
  switched to Qemu 1.7.0, the first version to use "pc-i440fx-1.7". With
  this version, the errors almost disappeared, but from time to time, I
  still get machine check events. Anyways, they so not seem to affect
  neither the vm, nor the host.

  The Haswell machine has been set up and running for several days
  without a single error message. They only appear when the VM is
  running. so I think this is actually some problem with the Haswell
  architecture (and not a real hardware error).

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1307225/+subscriptions

[Qemu-devel] [Bug 1307225] Re: Running a virtual machine on a Haswell system produces machine check events

Reply via email to