Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

Tom Herbert Mon, 07 Mar 2016 17:03:53 -0800

On Mon, Mar 7, 2016 at 4:49 PM, Alexander Duyck
<alexander.du...@gmail.com> wrote:
> On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert <t...@herbertland.com> wrote:
>> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
>> <alexander.du...@gmail.com> wrote:
>>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert <t...@herbertland.com> wrote:
>>>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight <david.lai...@aculab.com> 
>>>> wrote:
>>>>> From: Alexander Duyck
>>>>>  ...
>>>>>> Actually probably the easiest way to go on x86 is to just replace the
>>>>>> use of len with (len >> 6) and use decl or incl instead of addl or
>>>>>> subl, and lea instead of addq for the buff address.  None of those
>>>>>> instructions effect the carry flag as this is how such loops were
>>>>>> intended to be implemented.
>>>>>>
>>>>>> I've been doing a bit of testing and that seems to work without
>>>>>> needing the adcq until after you exit the loop, but doesn't give that
>>>>>> much of a gain in speed for dropping the instruction from the
>>>>>> hot-path.  I suspect we are probably memory bottle-necked already in
>>>>>> the loop so dropping an instruction or two doesn't gain you much.
>>>>>
>>>>> Right, any superscalar architecture gives you some instructions
>>>>> 'for free' if they can execute at the same time as those on the
>>>>> critical path (in this case the memory reads and the adc).
>>>>> This is why loop unrolling can be pointless.
>>>>>
>>>>> So the loop:
>>>>> 10:     addc %rax,(%rdx,%rcx,8)
>>>>>         inc %rcx
>>>>>         jnz 10b
>>>>> could easily be as fast as anything that doesn't use the 'new'
>>>>> instructions that use the overflow flag.
>>>>> That loop might be measurable faster for aligned buffers.
>>>>
>>>> Tested by replacing the unrolled loop in my patch with just:
>>>>
>>>> if (len >= 8) {
>>>>                 asm("clc\n\t"
>>>>                     "0: adcq (%[src],%%rcx,8),%[res]\n\t"
>>>>                     "decl %%ecx\n\t"
>>>>                     "jge 0b\n\t"
>>>>                     "adcq $0, %[res]\n\t"
>>>>                             : [res] "=r" (result)
>>>>                             : [src] "r" (buff), "[res]" (result), "c"
>>>> ((len >> 3) - 1));
>>>> }
>>>>
>>>> This seems to be significantly slower:
>>>>
>>>> 1400 bytes: 797 nsecs vs. 202 nsecs
>>>> 40 bytes: 6.5 nsecs vs. 26.8 nsecs
>>>
>>> You still need the loop unrolling as the decl and jge have some
>>> overhead.  You can't just get rid of it with a single call in a tight
>>> loop but it should improve things.  The gain from what I have seen
>>> ends up being minimal though.  I haven't really noticed all that much
>>> in my tests anyway.
>>>
>>> I have been doing some testing and the penalty for an unaligned
>>> checksum can get pretty big if the data-set is big enough.  I was
>>> messing around and tried doing a checksum over 32K minus some offset
>>> and was seeing a penalty of about 200 cycles per 64K frame.
>>>
>> Out of how many cycles to checksum 64K though?
>
> So the clock cycles I am seeing is ~16660 for unaligned vs 16416
> aligned.  So yeah the effect is only a 1.5% penalty for the total
> time.
>
>>> One thought I had is that we may want to look into making an inline
>>> function that we can call for compile-time defined lengths less than
>>> 64.  Maybe call it something like __csum_partial and we could then use
>>> that in place of csum_partial for all those headers that are a fixed
>>> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
>>> we might be able to look at taking care of alignment for csum_partial
>>> which will improve the skb_checksum() case without impacting the
>>> header pulling cases as much since that code would be inlined
>>> elsewhere.
>>>
>> As I said previously, if alignment really is a factor then we can
>> check up front if a buffer crosses a page boundary and call the slow
>> path function (original code). I'm seeing a 1 nsec hit to add this
>> check.
>
> Well I was just noticing there are a number of places we can get an
> even bigger benefit if we just bypass the need for csum_partial
> entirely.  For example the DSA code is calling csum_partial to extract
> 2 bytes.  Same thing for protocols such as VXLAN and the like.  If we
> could catch cases like these with a __builtin_constant_p check then we
> might be able to save some significant CPU time by avoiding the
> function call entirely and just doing some inline addition on the
> input values directly.
>
Sure, we could inline a switch function for common values (0, 2, 4, 8,
14, 16, 20, 40) maybe.


> - Alex

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

Reply via email to