Am 2021-02-25 13:18, schrieb Vladimir Oltean:
From: Vladimir Oltean <vladimir.olt...@nxp.com>
Michael reports that since linux-next-20210211, the AER messages for
ECC
errors have started reappearing, and this time they can be reliably
reproduced with the first ping on one of his LS1028A boards.
$ ping 1[ 33.258069] pcieport 0000:00:1f.0: AER: Multiple Corrected
error received: 0000:00:00.0
72.16.0.1
PING [ 33.267050] pcieport 0000:00:1f.0: AER: can't find device of
ID0000
172.16.0.1 (172.16.0.1): 56 data bytes
64 bytes from 172.16.0.1: seq=0 ttl=64 time=17.124 ms
64 bytes from 172.16.0.1: seq=1 ttl=64 time=0.273 ms
$ devmem 0x1f8010e10 32
0xC0000006
It isn't clear why this is necessary, but it seems that for the errors
to go away, we must clear the entire RFS and RSS memory, not just for
the ports in use.
Sadly the code is structured in such a way that we can't have unified
logic for the used and unused ports. For the minimal initialization of
an unused port, we need just to enable and ioremap the PF memory space,
and a control buffer descriptor ring. Unused ports must then free the
CBDR because the driver will exit, but used ports can not pick up from
where that code path left, since the CBDR API does not reinitialize a
ring when setting it up, so its producer and consumer indices are out
of
sync between the software and hardware state. So a separate
enetc_init_unused_port function was created, and it gets called right
after the PF memory space is enabled.
Note that we need access from enetc_pf.c to the CBDR creation and
deletion methods, which were for some reason put in enetc.c. While
changing their definitions to be non-static, also move them to
enetc_cbdr.c which seems like a better place to hold these.
Fixes: 07bf34a50e32 ("net: enetc: initialize the RFS and RSS memories")
Reported-by: Michael Walle <mich...@walle.cc>
Cc: Jesse Brandeburg <jesse.brandeb...@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.olt...@nxp.com>
I had this patch in my tree for a while now. As we've learned, it
really depends on a particular power-up state for the error to happen.
So take this with a grain of salt: I haven't seen the error anymore,
albeit multiple power-cycles. Thus:
Tested-by: Michael Walle <mich...@walle.cc>