On Mon, Mar 7, 2016 at 4:49 PM, Alexander Duyck <alexander.du...@gmail.com> wrote: > On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert <t...@herbertland.com> wrote: >> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck >> <alexander.du...@gmail.com> wrote: >>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert <t...@herbertland.com> wrote: >>>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight <david.lai...@aculab.com> >>>> wrote: >>>>> From: Alexander Duyck >>>>> ... >>>>>> Actually probably the easiest way to go on x86 is to just replace the >>>>>> use of len with (len >> 6) and use decl or incl instead of addl or >>>>>> subl, and lea instead of addq for the buff address. None of those >>>>>> instructions effect the carry flag as this is how such loops were >>>>>> intended to be implemented. >>>>>> >>>>>> I've been doing a bit of testing and that seems to work without >>>>>> needing the adcq until after you exit the loop, but doesn't give that >>>>>> much of a gain in speed for dropping the instruction from the >>>>>> hot-path. I suspect we are probably memory bottle-necked already in >>>>>> the loop so dropping an instruction or two doesn't gain you much. >>>>> >>>>> Right, any superscalar architecture gives you some instructions >>>>> 'for free' if they can execute at the same time as those on the >>>>> critical path (in this case the memory reads and the adc). >>>>> This is why loop unrolling can be pointless. >>>>> >>>>> So the loop: >>>>> 10: addc %rax,(%rdx,%rcx,8) >>>>> inc %rcx >>>>> jnz 10b >>>>> could easily be as fast as anything that doesn't use the 'new' >>>>> instructions that use the overflow flag. >>>>> That loop might be measurable faster for aligned buffers. >>>> >>>> Tested by replacing the unrolled loop in my patch with just: >>>> >>>> if (len >= 8) { >>>> asm("clc\n\t" >>>> "0: adcq (%[src],%%rcx,8),%[res]\n\t" >>>> "decl %%ecx\n\t" >>>> "jge 0b\n\t" >>>> "adcq $0, %[res]\n\t" >>>> : [res] "=r" (result) >>>> : [src] "r" (buff), "[res]" (result), "c" >>>> ((len >> 3) - 1)); >>>> } >>>> >>>> This seems to be significantly slower: >>>> >>>> 1400 bytes: 797 nsecs vs. 202 nsecs >>>> 40 bytes: 6.5 nsecs vs. 26.8 nsecs >>> >>> You still need the loop unrolling as the decl and jge have some >>> overhead. You can't just get rid of it with a single call in a tight >>> loop but it should improve things. The gain from what I have seen >>> ends up being minimal though. I haven't really noticed all that much >>> in my tests anyway. >>> >>> I have been doing some testing and the penalty for an unaligned >>> checksum can get pretty big if the data-set is big enough. I was >>> messing around and tried doing a checksum over 32K minus some offset >>> and was seeing a penalty of about 200 cycles per 64K frame. >>> >> Out of how many cycles to checksum 64K though? > > So the clock cycles I am seeing is ~16660 for unaligned vs 16416 > aligned. So yeah the effect is only a 1.5% penalty for the total > time. > >>> One thought I had is that we may want to look into making an inline >>> function that we can call for compile-time defined lengths less than >>> 64. Maybe call it something like __csum_partial and we could then use >>> that in place of csum_partial for all those headers that are a fixed >>> length that we pull such as UDP, VXLAN, Ethernet, and the rest. Then >>> we might be able to look at taking care of alignment for csum_partial >>> which will improve the skb_checksum() case without impacting the >>> header pulling cases as much since that code would be inlined >>> elsewhere. >>> >> As I said previously, if alignment really is a factor then we can >> check up front if a buffer crosses a page boundary and call the slow >> path function (original code). I'm seeing a 1 nsec hit to add this >> check. > > Well I was just noticing there are a number of places we can get an > even bigger benefit if we just bypass the need for csum_partial > entirely. For example the DSA code is calling csum_partial to extract > 2 bytes. Same thing for protocols such as VXLAN and the like. If we > could catch cases like these with a __builtin_constant_p check then we > might be able to save some significant CPU time by avoiding the > function call entirely and just doing some inline addition on the > input values directly. > Sure, we could inline a switch function for common values (0, 2, 4, 8, 14, 16, 20, 40) maybe.
> - Alex