On 11/06/2016 03:22 PM, Jonathan Maxwell wrote: > On Thu, Nov 3, 2016 at 8:40 AM, Brian King <brk...@linux.vnet.ibm.com> wrote: >> On 10/27/2016 10:26 AM, Eric Dumazet wrote: >>> On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote: >>>> We recently encountered a bug where a few customers using ibmveth on the >>>> same LPAR hit an issue where a TCP session hung when large receive was >>>> enabled. Closer analysis revealed that the session was stuck because the >>>> one side was advertising a zero window repeatedly. >>>> >>>> We narrowed this down to the fact the ibmveth driver did not set gso_size >>>> which is translated by TCP into the MSS later up the stack. The MSS is >>>> used to calculate the TCP window size and as that was abnormally large, >>>> it was calculating a zero window, even although the sockets receive buffer >>>> was completely empty. >>>> >>>> We were able to reproduce this and worked with IBM to fix this. Thanks Tom >>>> and Marcelo for all your help and review on this. >>>> >>>> The patch fixes both our internal reproduction tests and our customers >>>> tests. >>>> >>>> Signed-off-by: Jon Maxwell <jmaxwel...@gmail.com> >>>> --- >>>> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++ >>>> 1 file changed, 20 insertions(+) >>>> >>>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c >>>> b/drivers/net/ethernet/ibm/ibmveth.c >>>> index 29c05d0..c51717e 100644 >>>> --- a/drivers/net/ethernet/ibm/ibmveth.c >>>> +++ b/drivers/net/ethernet/ibm/ibmveth.c >>>> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, >>>> int budget) >>>> int frames_processed = 0; >>>> unsigned long lpar_rc; >>>> struct iphdr *iph; >>>> + bool large_packet = 0; >>>> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr); >>>> >>>> restart_poll: >>>> while (frames_processed < budget) { >>>> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, >>>> int budget) >>>> iph->check = 0; >>>> iph->check = >>>> ip_fast_csum((unsigned char *)iph, iph->ihl); >>>> adapter->rx_large_packets++; >>>> + large_packet = 1; >>>> } >>>> } >>>> } >>>> >>>> + if (skb->len > netdev->mtu) { >>>> + iph = (struct iphdr *)skb->data; >>>> + if (be16_to_cpu(skb->protocol) == ETH_P_IP && >>>> + iph->protocol == IPPROTO_TCP) { >>>> + hdr_len += sizeof(struct iphdr); >>>> + skb_shinfo(skb)->gso_type = >>>> SKB_GSO_TCPV4; >>>> + skb_shinfo(skb)->gso_size = >>>> netdev->mtu - hdr_len; >>>> + } else if (be16_to_cpu(skb->protocol) == >>>> ETH_P_IPV6 && >>>> + iph->protocol == IPPROTO_TCP) { >>>> + hdr_len += sizeof(struct ipv6hdr); >>>> + skb_shinfo(skb)->gso_type = >>>> SKB_GSO_TCPV6; >>>> + skb_shinfo(skb)->gso_size = >>>> netdev->mtu - hdr_len; >>>> + } >>>> + if (!large_packet) >>>> + adapter->rx_large_packets++; >>>> + } >>>> + >>>> >>> >>> This might break forwarding and PMTU discovery. >>> >>> You force gso_size to device mtu, regardless of real MSS used by the TCP >>> sender. >>> >>> Don't you have the MSS provided in RX descriptor, instead of guessing >>> the value ? >> >> We've had some further discussions on this with the Virtual I/O Server (VIOS) >> development team. The large receive aggregation in the VIOS (AIX based) is >> actually >> being done by software in the VIOS. What they may be able to do is when >> performing >> this aggregation, they could look at the packet lengths of all the packets >> being >> aggregated and take the largest packet size within the aggregation unit, >> minus the >> header length and return that to the virtual ethernet client which we could >> then stuff >> into gso_size. They are currently assessing how feasible this would be to do >> and whether >> it would impact other bits of the code. However, assuming this does end up >> being an option, >> would this address the concerns here or is that going to break something >> else I'm >> not thinking of? > > I was discussing this with a colleague and although this is better than > what we have so far. We wonder if there could be a corner case where > it ends up with a smaller value than the current MSS. For example if > the application sent a burst of small TCP packets with the PUSH > bit set. In that case they may not be coalesced by GRO. The VIOS could
Would that be a big problem though? Other than a performance degradation in this specific case, do you see a functional issue with this approach? > probably be coded to detect that condition and use the previous MSS. > But that may not necessarily be the current MSS. > > The ibmveth driver passes the MSS via gso_size to the VIOS. Either as the > 3rd argument of ibmveth_send() or via tcp_hdr(skb)->check which is > presumably over-written when the VIOS does the TSO. Would it be possible > to keep a copy of this value on the TX side on the VIOS before it over-written > and then some how pass that up to the RX side along with frame and set > gso_size to that which should be the current MSS? That seems like it might be more difficult. Wouldn't that require the VIOS to track the MSS on a per connection basis, since the MSS might differ based on the source / destination? I discussed with VIOS development, and they don't do any connection tracking where they'd be able to insert this. Unless there is a functional issue with the approach to have the VIOS insert the MSS based on the size of the packets received off the wire, I think that might be our best option... Thanks, Brian > > Regards > > Jon > >> >> Unfortunately, I don't think we'd have a good way to get gso_segs set >> correctly as I don't >> see how that would get passed back up the interface. >> >> Thanks, >> >> Brian >> >> >> -- >> Brian King >> Power Linux I/O >> IBM Linux Technology Center >> > -- Brian King Power Linux I/O IBM Linux Technology Center