Hello, 
this is bae working on samsung elec. 

we have a problem that packet discarded during 3-way handshaking on TCP. 
already looks like that Mr Dumazet try to fix the similar issue on this patch, 
https://android.googlesource.com/kernel/common/+/5e0724d027f0548511a2165a209572d48fe7a4c8
 
but we are still facing the another corner case.

it needs preconditions for this problem.
(1) last ack packet of 3-way handshaking and next packet have been arrived at 
almost same time 
(2) next packet, the first data packet was fragmented 
(3) enable rps


[tcp dump]
No.     A-Time         Source     Destination  Len   Seq  Info 
 1  08:35:18.115259  193.81.6.70  10.217.0.47  84     0   [SYN] Seq=0 Win=21504 
Len=0 MSS=1460 
 2  08:35:18.115888  10.217.0.47  193.81.6.70  84     0   6100 → 5063 [SYN, 
ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 
 3  08:35:18.142385  193.81.6.70  10.217.0.47  80     1   5063 → 6100 [ACK] 
Seq=1 Ack=1 Win=21504 Len=0 
 4  08:35:18.142425  193.81.6.70  10.217.0.47  1516       Fragmented IP 
protocol (proto=Encap Security Payload 50, off=0, ID=6e24) [Reassembled in #5] 
 5  08:35:18.142449  193.81.6.70  10.217.0.47  60     1   5063 → 6100 [ACK] 
Seq=1 Ack=1 Win=21504 Len=1460 [TCP segment of a reassembled PDU] 
 6  08:35:21.227070  193.81.6.70  10.217.0.47  1516       Fragmented IP 
protocol (proto=Encap Security Payload 50, off=0, ID=71e9) [Reassembled in #7] 
 7  08:35:21.227191  193.81.6.70  10.217.0.47  60     1   [TCP Retransmission] 
5063 → 6100 [ACK] Seq=1 Ack=1 Win=21504 Len=1460 
 8  08:35:21.228822  10.217.0.47  193.81.6.70  80     1   6100 → 5063 [ACK] 
Seq=1 Ack=1461 Win=32120 Len=0

- last ack packet of handshaking(No.3) and next data packet(No4,5) were arrived 
with just 40us time gap.


[kernel log]
- stage 1 
<3>[ 1037.669229] I[0:  system_server: 3778] get_rps_cpu: skb(64), check hash 
value:3412396090 
<3>[ 1037.669261] I[0:  system_server: 3778] get_rps_cpu: skb(1500), check hash 
value:158575680 
<3>[ 1037.669285] I[0:  system_server: 3778] get_rps_cpu: skb(44), check hash 
value:158575680 
- stage 2 
<3>[ 1037.669541] I[1: Binder:3778_13: 8391] tcp_v4_rcv: Enter! 
skb(seq:A93E087B, len:1480) 
<3>[ 1037.669552] I[2:Jit thread pool:12990] tcp_v4_rcv: Enter! 
skb(seq:A93E087B, len:20) 
<3>[ 1037.669564] I[2:Jit thread pool:12990] tcp_v4_rcv: check sk_state:12 
skb(seq:A93E087B, len:20) 
<3>[ 1037.669585] I[2:Jit thread pool:12990] tcp_check_req, Enter!: 
skb(seq:A93E087B, len:20) 
<3>[ 1037.669612] I[1: Binder:3778_13: 8391] tcp_v4_rcv: check sk_state:12 
skb(seq:A93E087B, len:1480) 
<3>[ 1037.669625] I[1: Binder:3778_13: 8391] tcp_check_req, Enter!: 
skb(seq:A93E087B, len:1480) 
<3>[ 1037.669653] I[2:Jit thread pool:12990] tcp_check_req, skb(seq:A93E087B, 
len:20), own_req:1 
<3>[ 1037.669668] I[1: Binder:3778_13: 8391] tcp_check_req, skb(seq:A93E087B, 
len:1480), own_req:0 
<3>[ 1037.669708] I[2:Jit thread pool:12990] tcp_rcv_state_process, 
Established: skb(seq:A93E087B, len:20) 
<3>[ 1037.669724] I[1: Binder:3778_13: 8391] tcp_v4_rcv: discard_relse 
skb(seq:A93E087B, len:1480)

- stage 1 
because of the data packet has been fragmented(No.4 & 5), 
it was hashed to another core(cpu1) which was differnet with last ack 
packet(cpu2), by rps. 
so last ack and data packet handled in different core almost simultaniously, at 
NEW_SYN_RECV state.

- stage 2, cpu2 
one of them will be treated in tcp_check_req() function a little more earlier, 
then it got the true value for own_req from tcp_v4_syn_recv_sock(), and return 
valid nsk. 
finally going to ESTABLISHED state.

- stage 2, cpu1 
but another, later one is got the false value for own_req, 
and return null for nsk, because of own_req value is false in 
inet_csk_complete_hashdance(). 
so earlier packet was handled successfully but later one has gone to discard.

at this time, one of the ack or data packet could be discarded, by schedule 
timing. (we saw both of them) 
if the ack was discarded, that's ok. 
tcp state goes to ESTABLISHED by piggyback on data packet, and payload will be 
deliverd to upper layer. 
but if the data packet was discarded, client can't receive the payload it have 
to. 
this is the problem we faced.


although server retransmitted the dropped packet(No6,7), but it takes few 
seconds delay. 
since of this problem occured in IMS-Call setup, this is appeared to call 
connection delay. 
these situation is serious problem in call service.

do you have any report about this or plan to fix it?


best regards,
bae.



-------------------------------------------------------- 
  배 석 진 (Bae Souk-Jin) 
   System R&D Group 2
   Mobile Device Division Telecommunication Business
   SAMSUNG ELECTRONICS CO. LTD

   Mobile : 82-10-2888-2200
   E-mail : soukjin....@samsung.com
--------------------------------------------------------

Reply via email to