On 11/16/2016 8:25 PM, Bjorn Helgaas wrote:
Hi Yishai,
Johannes has been working on an mlx4 initialization problem on an
IBM x3850 X6. The underlying problem is a PCI core issue -- we're
setting RCB in the Mellanox device, which means it thinks it can
generate 128-byte Completions, even though the Root Port above it
can't handle them. That issue is
https://bugzilla.kernel.org/show_bug.cgi?id=187781
The machine crashed when this happened, apparently not because of any
error reported via AER, but because mlx4 contains a BUG_ON, probably
the one in mlx4_enter_error_state().
That one happens if pci_channel_offline() returns false. Is this
telling us about a problem in PCI error handling, or is it just a case
where mlx4 isn't as smart as it could be?
Yes, we expect at that step a problem/bug in the PCI layer that should
be fixed (e.g. reporting online but really is offline, etc.), can you
please evaluate and confirm that ?