[ full-quoting due to Cc fixups, adding netdev ] Steve Ibanez <siba...@stanford.edu> wrote: > Hi Florian, Neal, and Daniel, > > I hope this email finds you well. My name is Stephen Ibanez and I'm a PhD > Student at Stanford currently working on a project with Mohammad Alizadeh, > Nick McKeown, and Lavanya Jose. We have been doing some experiments using > the linux DCTCP implementation and are trying to understand some strange > behavior that we are encountering. I'm contacting you three because I have > seen your names on some of the source files and recent commits in the linux > source tree. Hopefully you can help us out or put us in contact with the > right people? > > Here are some details about our servers: > > - Distribution: Ubuntu 14.04 LTS > - Kernel release: 4.4.0-75-generic
Can you re-test with a more recent kernel such as 4.13.8? > *The experiment:* > > We use iperf3 to generate two DCTCP flows from different servers to a > common server, as shown in the diagram below. We measure the sending rate > of each flow, record the tcp_probe output, as well as run tcpdump on the > source host interfaces. > > [image: Inline image 6] > > *The problem:* > > Our rate measurements look like the one shown below; the flows often enter > timeouts. In this case, both flows hit a timeout at t=0.3. > [image: Inline image 2] > > When looking at the sequence of packets seen at the source host interfaces > around this timeout event this is what we see: > > *10.0.0.1 timeout event:* > [image: Inline image 3] > > *10.0.0.3 timeout event:* > [image: Inline image 4] > > In both cases, the source: > (1) receives an ACK for byte XYZ with the ECN flag set > (2) stops sending anything for RTO_min=300ms > (3) sends a retransmission for byte XYZ > > I have verified that this behavior is consistent across multiple experiment > runs. Here are the CWND samples for the 10.0.0.1 flow provided by tcp_probe > at the time of the timeout event: > > [image: Inline image 5] > > From what I can tell, tcp_probe logs a sample whenever a packet is > received. If this is true, then that means when the source receives the > final ECN marked ACK just before the timeout the CWND=1 MSS. > > *The conclusion:* > > We believe that there may be an issue with how the linux kernel is handling > the ECN echoes. For DCTCP, if the CWND is 1 MSS and the end host is still > receiving ECN marks then the CWND should remain at 1 MSS and should *not* > enter a timeout. This is because the switch can perform ECN marking very > aggressively causing the source end host to receive many redundant ECN > echoes over a short period of time. > > Another potential issue is that from the CWND plot above it looks like the > end host may be reacting to congestion signals more than once per window, > which should not happen (section 5 of RF3168 > <https://tools.ietf.org/html/rfc3168>). tcp_probe reports SRTT measurements > of about 400-500 us and in the plot above the CWND is reduced 6 times > within this amount of time. > > We have not yet tracked down the code path in the kernel code that is > causing the behavior described above. Perhaps this is something that you > can help us with? We would love to hear your thoughts on this matter and > are happy to try other experiments that you suggest. > > Here is a link > <https://drive.google.com/file/d/0Bw-GEX7h5ufiYmpCV2VpOGEtQWs/view?usp=sharing> > to > download the packet traces if you would like to take a look. > han-1_host.pcap is the trace from 10.0.0.1 and han-3_host.pcap is the trace > from 10.0.0.3. > > Looking forward to hearing from you! > > Best, > -Steve