Eric Dumazet wrote on Sun, Apr 15, 2018: > Are you sure you do not have some iptables/netfilter stuff ?
I have a basic firewall setup with default rules e.g. starts with -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT in the INPUT chain... That said, I just dropped it on the server to check and that seems to workaround the issue?! When logging everything dropped it appears to decide that the connection is no longer established at some point, but only if there is tcp_timestamp, just, err, how? And certainly enough, if I restore the firewall while a connection is up that just hangs; conntrack doesn't consider it connected anymore at some point (but it worked for a while!) Here's the kind of logs I get from iptables: IN=wlp1s0 OUT= MAC=00:c2:c6:b4:7e:c7:a4:12:42:b5:5d:fc:08:00 SRC=client DST=server LEN=52 TOS=0x00 PREC=0x00 TTL=52 ID=17038 DF PROTO=TCP SPT=41558 DPT=15609 WINDOW=1212 RES=0x00 ACK URGP=0 > ss -temoi might give us more info hang ESTAB 0 81406 server:15609 client:41558 users:(("socat",pid=17818,fd=5)) timer:(on,48sec,11) uid:1000 ino:137253 sk:6a <-> skmem:(r0,rb369280,t0,tb147456,f2050,w104446,o0,bl0,d1) ts sack reno wscale:7,7 rto:15168 backoff:6 rtt:36.829/6.492 ato:40 mss:1374 pmtu:1500 rcvmss:1248 advmss:1448 cwnd:1 ssthresh:16 bytes_acked:32004 bytes_received:4189 segs_out:84 segs_in:55 data_segs_out:77 data_segs_in:18 send 298.5Kbps lastsnd:12483 lastrcv:27801 lastack:27726 pacing_rate 19.1Mbps delivery_rate 4.1Mbps busy:28492ms unacked:31 retrans:1/6 lost:31 rcv_rtt:29 rcv_space:29200 rcv_ssthresh:39184 notsent:38812 minrtt:25.152 working (tcp_timestamp=0) ESTAB 0 36 server:15080 client:32979 users:(("socat",pid=17047,fd=5)) timer:(on,226ms,0) uid:1000 ino:90917 sk:23 <-> skmem:(r0,rb369280,t0,tb1170432,f1792,w2304,o0,bl0,d3) sack reno wscale:7,7 rto:230 rtt:29.413/5.345 ato:64 mss:1386 pmtu:1500 rcvmss:1248 advmss:1460 cwnd:4 ssthresh:3 bytes_acked:17391762 bytes_received:62397 segs_out:13964 segs_in:8642 data_segs_out:13895 data_segs_in:1494 send 1.5Mbps lastsnd:4 lastrcv:5 lastack:5 pacing_rate 1.8Mbps delivery_rate 1.2Mbps busy:56718ms unacked:1 retrans:0/11 rcv_rtt:9112.95 rcv_space:29233 rcv_ssthresh:41680 minrtt:25.95 working (no iptables) ESTAB 0 0 server:61460 client:20468 users:(("socat",pid=17880,fd=5)) uid:1000 ino:129982 sk:6f <-> skmem:(r0,rb369280,t0,tb1852416,f0,w0,o0,bl0,d1) ts sack reno wscale:7,7 rto:244 rtt:43.752/7.726 ato:40 mss:1374 pmtu:1500 rcvmss:1248 advmss:1448 cwnd:10 bytes_acked:2617302 bytes_received:5441 segs_out:1929 segs_in:976 data_segs_out:1919 data_segs_in:41 send 2.5Mbps lastsnd:2734 lastrcv:2734 lastack:2705 pacing_rate 5.0Mbps delivery_rate 12.7Mbps busy:1884ms rcv_rtt:30 rcv_space:29200 rcv_ssthresh:39184 minrtt:26.156 > Really it looks like at some point, all incoming packets are shown by > tcpdump but do not reach the TCP socket anymore. > > (segs_in: might be steady, look at the d0 counter shown by ss -temoi > (dX : drop counters, sk->sk_drops) segs_in does not increase with replays; the d1 seems stable. > While running your experiment, try on the server. > > perf record -a -g -e skb:kfree_skb sleep 30 > perf report While I understand what that should do, I am not sure why I do not get any graph so that doesn't help tell what called kfree_skb and thus what decided to drop the packet (although we no longer really need that now..) perf script just shows kfree_skb e.g. swapper 0 [001] 237244.869321: skb:kfree_skb: skbaddr=0xffff8800360fda00 protocol=2048 location=0xffffffff817a1a77 9458e3 kfree_skb (/usr/lib/debug/lib/modules/4.16.0-300.fc28.x86_64/vmlinux) --- So I guess that ultimately the problem is why conntrack suddenly decides that an established connection suddenly isn't anymore, despite being listed as established by ss.. I'm discovering `conntrack(8)`, but what strikes me as interesting is that even that points at the connection being established (looking at a new connection after iptables started dropping packets) # conntrack -L | grep 21308 tcp 6 267 ESTABLISHED src=server dst=client sport=21308 dport=37552 src=client dst=server sport=37552 dport=21308 [ASSURED] mark=0 use=1 compared to another that isn't dropped (the old connection without tcp_timestamp) tcp 6 299 ESTABLISHED src=server dst=client sport=15080 dport=32979 src=client dst=server sport=32979 dport=15080 [ASSURED] mark=0 use=1 The expect/dying/unconfirmed tables all are empty. . . . Oh, there is something interesting there, the connection doesn't come up with -G? working: conntrack -G --protonum tcp --src server --dst client --sport 15080 --dport 32979 tcp 6 299 ESTABLISHED src=server dst=client sport=15080 dport=32979 src=client dst=server sport=32979 dport=15080 [ASSURED] mark=0 use=3 conntrack v1.4.4 (conntrack-tools): 1 flow entries have been shown. hang: # conntrack -G --protonum tcp --src server --dst client --sport 21308 --dport 37552 conntrack v1.4.4 (conntrack-tools): 0 flow entries have been shown. So something happened that makes it show up in -L (table dump) but not when querying...? And only when there is enough traffic: I have previously kept such a connection without workaround for hours just fine as long as I made sure not to display more than a screen at a time. Thanks again, -- Dominique Martinet | Asmadeus