On 18.10.2018 07:58, Jonathan Woithe wrote: > On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote: >> Holger Hoffstätte <hol...@applied-asynchrony.com> : >> [...] >>> I continued to use the BQL patch in my private tree after it was reverted >>> and also had occasional timeouts, but *only* after I started playing >>> with ethtool to change offload settings. Without offloads or the BQL patch >>> everything has been rock-solid since then. >>> The other weird problem was that timeouts would occur on an otherwise >>> *completely idle* system. Since that occasionally borked my NFS server >>> over night I ultimately removed BQL as well. Rock-solid since then. >> >> The bug will induce delayed rx processing when a spike of "load" is >> followed by an idle period. > > If this is the case, I wonder whether this bug might also be the cause of > the long reception delays we've observed at times when a period of high > network load is followed by almost nothing[1]. That thread[2] details the > investigations subsequently done. A git bisect showed that commit > da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour > we were observing. > > We still see the problem when we test with recent kernels. It would be > great if the underlying problem has now been identified. > > I can possibly scrape some hardware together to test any proposed fix under > our workload if there was interest. > Proposed fix is here: https://patchwork.ozlabs.org/patch/985014/ Would be good if you could test it. Thanks!
Heiner > Regards > jonathan > > [1] https://marc.info/?l=linux-netdev&m=136281333207734&w=2 > [2] https://marc.info/?t=136281339500002&r=1&w=2 >