On Sun, Oct 20, 2019 at 7:15 PM Subash Abhinov Kasiviswanathan
<subas...@codeaurora.org> wrote:
>
> > Hmm. Random related thought while searching for a possible cause: I
> > wonder if tcp_write_queue_purge() should clear tp->highest_sack (and
> > possibly tp->sacked_out)? The tcp_write_queue_purge() code is careful
> > to call  tcp_clear_all_retrans_hints(tcp_sk(sk)) and I would imagine
> > that similar considerations would imply that we should clear at least
> > tp->highest_sack?
> >
> > neal
>
> Hi Neal
>
> If the socket is in FIN-WAIT1, does that mean that all the segments
> corresponding to SACK blocks are sent and ACKed already?

FIN-WAIT1 just means the local application has called close() or
shutdown() to shut down the sending direction of the socket, and the
local TCP stack has sent a FIN, and is waiting to receive a FIN and an
ACK from the other side (in either order, or simultaneously). The
ASCII art state transition diagram on page 22 of RFC 793 (e.g.
https://tools.ietf.org/html/rfc793#section-3.2 ) is one source for
this, though the W. Richard Stevens books have a much more readable
diagram.

There may still be unacked and SACKed data in the retransmit queue at
this point.

> tp->sacked_out is non zero in all these crashes

Thanks, that is a useful data point. Do you know what particular value
 tp->sacked_out has? Would you be able to capture/log the value of
tp->packets_out, tp->lost_out, and tp->retrans_out as well?

> (is the SACK information possibly invalid or stale here?).

Yes, one guess would be that somehow the skbs in the retransmit queue
have been freed, but tp->sacked_out is still non-zero and
tp->highest_sack is still a dangling pointer into one of those freed
skbs. The tcp_write_queue_purge() function is one function that fees
the skbs in the retransmit queue and leaves tp->sacked_out as non-zero
and  tp->highest_sack as a dangling pointer to a freed skb, AFAICT, so
that's why I'm wondering about that function. I can't think of a
specific sequence of events that would involve tcp_write_queue_purge()
and then a socket that's still in FIN-WAIT1. Maybe I'm not being
creative enough, or maybe that guess is on the wrong track. Would you
be able to set a new bit in the tcp_sock in tcp_write_queue_purge()
and log it in your instrumentation point, to see if
tcp_write_queue_purge()  was called for these connections that cause
this crash?

thanks,
neal

Reply via email to