On Sun, Oct 20, 2019 at 7:15 PM Subash Abhinov Kasiviswanathan <subas...@codeaurora.org> wrote: > > > Hmm. Random related thought while searching for a possible cause: I > > wonder if tcp_write_queue_purge() should clear tp->highest_sack (and > > possibly tp->sacked_out)? The tcp_write_queue_purge() code is careful > > to call tcp_clear_all_retrans_hints(tcp_sk(sk)) and I would imagine > > that similar considerations would imply that we should clear at least > > tp->highest_sack? > > > > neal > > Hi Neal > > If the socket is in FIN-WAIT1, does that mean that all the segments > corresponding to SACK blocks are sent and ACKed already?
FIN-WAIT1 just means the local application has called close() or shutdown() to shut down the sending direction of the socket, and the local TCP stack has sent a FIN, and is waiting to receive a FIN and an ACK from the other side (in either order, or simultaneously). The ASCII art state transition diagram on page 22 of RFC 793 (e.g. https://tools.ietf.org/html/rfc793#section-3.2 ) is one source for this, though the W. Richard Stevens books have a much more readable diagram. There may still be unacked and SACKed data in the retransmit queue at this point. > tp->sacked_out is non zero in all these crashes Thanks, that is a useful data point. Do you know what particular value tp->sacked_out has? Would you be able to capture/log the value of tp->packets_out, tp->lost_out, and tp->retrans_out as well? > (is the SACK information possibly invalid or stale here?). Yes, one guess would be that somehow the skbs in the retransmit queue have been freed, but tp->sacked_out is still non-zero and tp->highest_sack is still a dangling pointer into one of those freed skbs. The tcp_write_queue_purge() function is one function that fees the skbs in the retransmit queue and leaves tp->sacked_out as non-zero and tp->highest_sack as a dangling pointer to a freed skb, AFAICT, so that's why I'm wondering about that function. I can't think of a specific sequence of events that would involve tcp_write_queue_purge() and then a socket that's still in FIN-WAIT1. Maybe I'm not being creative enough, or maybe that guess is on the wrong track. Would you be able to set a new bit in the tcp_sock in tcp_write_queue_purge() and log it in your instrumentation point, to see if tcp_write_queue_purge() was called for these connections that cause this crash? thanks, neal