On Sun, 31 May 2020 15:06:28 +0300 Boris Pismenny wrote: > On 30/05/2020 0:50, Jakub Kicinski wrote: > > On Fri, 29 May 2020 20:44:29 +0000 Saeed Mahameed wrote: > >>> I thought you said that resync requests are guaranteed to never fail? > >> > >> I didn't say that :), maybe tariq did say this before my review, > > > > Boris ;) > > > > I didn't say we are perfect, just that we can make a trade-off here, > and currently this is the simplest version that our team came up with > for this series. As a first step, I think it is reasonable. But, I > expect that we will improve it in the future. > > >> but basically with the current mlx5 arch, it is impossible to > >> guarantee this unless we open 1 service queue per ktls offloads > >> and that is going to be an overkill! > > I disagree, there are many ways to guarantee reliability here. For > example, we can sleep/spin until there is space in the queue or rely > on work stealing to let a later operation execute this one. > > > IIUC every ooo packet causes a resync request in your > > implementation - is that true? > > > > No, only header loss. We never required a resync per OOO packet. I'm > not sure why would you think that.
I mean until device is back in sync every frame kicks off resync_update_sn() and tries to queue the work, right? > > It'd be great to have more information about the operation of the > > device in the commit message.. > > > > I'll try to clarify the resync flow here. > As always, the packet that requires resync is marked as such in the > CQE. However, unlike previous devices, the TCP sequence (tcpsn) where > the HW found a header is not provided in the CQE. Instead, tcpsn is > queried from HW asynchronously by the driver. We employ the force > resync approach to guarantee that we can log all resync locations > between the received packet and the HW query response. We check the > asynchronous HW query response against all resync values between the > packet that triggered the resync and now. If one of them matches, > then resync can be completed immediately. Otherwise, the driver keeps > waiting for the correct resync. Thanks, makes sense. > >> This is a rare corner case anyway, where more than 1k tcp > >> connections sharing the same RX ring will request resync at the > >> same exact moment. > > > > IDK about that. Certain applications are architected for max > > capacity, not efficiency under steady load. So it matters a lot how > > the system behaves under stress. What if this is the chain of > > events: > > > > overload -> drops -> TLS steams go out of sync -> all try to resync > > > > I agree that this is not that rare, and it may be improved both in > future patches and hardware. Do you think it is critical to improve > it now, and not in a follow-up series? It's not a blocker for me, although if this makes it into 5.8 there will not be a chance to improve before net-next closes, so depends if you want to risk it and support the code as is... > > We don't want to add extra load on every record if HW offload is > > enabled. That's why the next record hint backs off, checks socket > > state etc. > > > > BTW I also don't understand why mlx5e_ktls_rx_resync() has a > > tls_offload_rx_force_resync_request(sk) at the end. If the update > > from the NIC comes with a later seq than current, request the sync > > for _that_ seq. I don't understand the need to force a call back on > > every record here. > > > > The extra load here is minimal, i.e. a single call from TLS to the > driver, which usually just logs the information. Oh yes, this one is not about extra load, I just don't know what that code is trying to achieve. > > Also if the sync failed because queue was full, I don't see how > > forcing another sync attempt for the next record is going to match? > > > > It doesn't, and if sync failed then we should stop trying to force > resync.