VXLAN: Enable outer Tx checksum by default

Jesse Gross Fri, 19 Feb 2016 18:19:56 -0800

On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert <t...@herbertland.com> wrote:
> On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross <je...@kernel.org> wrote:
>> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <adu...@mirantis.com> wrote:
>>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <je...@kernel.org> wrote:
>>>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <adu...@mirantis.com> 
>>>> wrote:
>>>>> This patch series makes it so that we enable the outer Tx checksum for 
>>>>> IPv4
>>>>> tunnels by default.  This makes the behavior consistent with how we were
>>>>> handling this for IPv6.  In addition I have updated the internal flags for
>>>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>>>>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>>>>> IPv6.
>>>>>
>>>>> For most network devices this should be a net gain in terms of performance
>>>>> as having the outer header checksum present allows for devices to report
>>>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in 
>>>>> order
>>>>> to determine if the inner header checksum is valid.
>>>>>
>>>>> Below is some data I collected with ixgbe with an X540 that demonstrates
>>>>> this.  I located two PFs connected back to back in two different name
>>>>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>>>>> and one without.
>>>>>
>>>>> Recv   Send    Send                          Utilization
>>>>> Socket Socket  Message  Elapsed              Send
>>>>> Size   Size    Size     Time     Throughput  local
>>>>> bytes  bytes   bytes    secs.    10^6bits/s  % S
>>>>>
>>>>> noudpcsum:
>>>>>  87380  16384  16384    30.00      8898.67   12.80
>>>>> udpcsum:
>>>>>  87380  16384  16384    30.00      9088.47   5.69
>>>>>
>>>>> The one spot where this may cause a performance regression is if the
>>>>> environment contains devices that can parse the inner headers and a device
>>>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>>>>> the case of such a device we have to fall back to using GSO to segment the
>>>>> tunnel instead of TSO and as a result we may take a performance hit as 
>>>>> seen
>>>>> below with i40e.
>>>>
>>>> Do you have any numbers from 40G links? Obviously, at 10G the links
>>>> are basically saturated and while I can see a difference in the
>>>> utilization rate, I suspect that the change will be much more apparent
>>>> at higher speeds.
>>>
>>> Unfortunately I don't have any true 40G links to test with.  The
>>> closest I can get is to run PF to VF on an i40e.  Running that I have
>>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
>>> difference being related to the fact that we are having to
>>> allocate/free more skbs and make more trips through the
>>> i40e_lan_xmit_frame function resulting in more descriptors.
>>
>> OK, I guess that is more or less in line with what I would expect off
>> the top my head. There is a reasonably significant drop in the worst
>> case.
>>
>>>> I'm concerned about the drop in performance for devices that currently
>>>> support offloads (almost none of which expose
>>>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>>>> care most about tunnel performance are the ones that already have
>>>> these NICs and will be the most impacted by the drop.
>>>
>>> The problem is being able to transmit fast is kind of pointless if the
>>> receiving end cannot handle it.  We hadn't gotten around to really
>>> getting the Rx checksum bits working until the 3.18 kernel which I
>>> don't suspect many people are running so at this point messing with
>>> the TSO bits isn't really making much of a difference.  Then on top of
>>> that most devices have certain limitations on how many ports they can
>>> handle and such.  I know the i40e is supposed to support something
>>> like 10 port numbers, but the fm10k and ixgbe are limited to one port
>>> as I recall.  So this whole thing is already really brittle as it is.
>>> My goal with this change is to make the behavior more consistent
>>> across the board.
>>
>> That's true to some degree but there are certainly plenty of cases
>> where TSO makes a difference - lower CPU usage, transmitting to
>> multiple receivers, people will upgrade their kernels, etc. It's
>> clearly good to make things more consistent but hopefully not by
>> reducing existing performance. :)
>>
>>>> My hope is that we can continue to use TSO on devices that only
>>>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>>>> length field may vary across segments. However, in practice this is
>>>> the only on the final segment and only in cases where the total length
>>>> is not a multiple of the MSS. If we could detect cases where those
>>>> conditions are met, we could continue to use TSO with the UDP checksum
>>>> field pre-populated. A possible step even further would be to break
>>>> off the final segment into a separate packet to make things conform if
>>>> necessary. This would avoid a performance regression and I think make
>>>> this more palatable to a lot of people.
>>>
>>> I think Tom and I had discussed this possibility a bit at netconf.
>>> The GSO logic is something I planned on looking at over the next
>>> several weeks as I suspect there is probably room for improvement
>>> there.
>>
>> That sounds great.
>>
>>>>> I also haven't investigated the effect this will have on OVS.  However I
>>>>> suspect the impact should be minimal as the worst case scenario should be
>>>>> that Tx checksumming will become enabled by default which should be
>>>>> consistent with the existing behavior for IPv6.
>>>>
>>>> I don't think that it should cause any problems.
>>>
>>> Good to hear.
>>>
>>> Do you know if OVS has some way to control the VXLAN configuration so
>>> that it could disable Tx checksums?  If so that would probably be a
>>> good way to address the 40G issues assuming someone is running an
>>> environment hat had nothing but NICs that can support the TSO and Rx
>>> checksum on inner headers.
>>
>> Yes - OVS can control tx checksums on a per-endpoint basis (actually,
>> rx checksum present requirements as well though it's not exposed to
>> the user at the moment). If you had the information then you could
>> optimize what to use in an environment of, say, hypervisors and
>> hardware switches.
>>
>> However, it's certainly possible that you have a mixed set of NICs
>> such as an encap aware NIC on the transmit side and non-aware on the
>> receive side. In that case, both possible checksum settings penalize
>> somebody: off (lose GRO on receiver), on (lose TSO on sender assuming
>> no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's
>> important to be able to use encap TSO with local checksum to avoid
>> these bad tradeoffs, not to mention being cleaner.
>
> By "local checksum" do you mean LCO?


Yes, that's what I meant.

> Seems like we should be able to
> get that to work with NETIF_F_GSO_TUNNEL_CSUM.

I assume you mean NETIF_F_GSO_TUNNEL (no _CSUM)?

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

Reply via email to