On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck <alexander.du...@gmail.com> wrote: > On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert <t...@herbertland.com> wrote: >> On Mon, Mar 7, 2016 at 5:56 AM, David Laight <david.lai...@aculab.com> wrote: >>> From: Alexander Duyck >>> ... >>>> Actually probably the easiest way to go on x86 is to just replace the >>>> use of len with (len >> 6) and use decl or incl instead of addl or >>>> subl, and lea instead of addq for the buff address. None of those >>>> instructions effect the carry flag as this is how such loops were >>>> intended to be implemented. >>>> >>>> I've been doing a bit of testing and that seems to work without >>>> needing the adcq until after you exit the loop, but doesn't give that >>>> much of a gain in speed for dropping the instruction from the >>>> hot-path. I suspect we are probably memory bottle-necked already in >>>> the loop so dropping an instruction or two doesn't gain you much. >>> >>> Right, any superscalar architecture gives you some instructions >>> 'for free' if they can execute at the same time as those on the >>> critical path (in this case the memory reads and the adc). >>> This is why loop unrolling can be pointless. >>> >>> So the loop: >>> 10: addc %rax,(%rdx,%rcx,8) >>> inc %rcx >>> jnz 10b >>> could easily be as fast as anything that doesn't use the 'new' >>> instructions that use the overflow flag. >>> That loop might be measurable faster for aligned buffers. >> >> Tested by replacing the unrolled loop in my patch with just: >> >> if (len >= 8) { >> asm("clc\n\t" >> "0: adcq (%[src],%%rcx,8),%[res]\n\t" >> "decl %%ecx\n\t" >> "jge 0b\n\t" >> "adcq $0, %[res]\n\t" >> : [res] "=r" (result) >> : [src] "r" (buff), "[res]" (result), "c" >> ((len >> 3) - 1)); >> } >> >> This seems to be significantly slower: >> >> 1400 bytes: 797 nsecs vs. 202 nsecs >> 40 bytes: 6.5 nsecs vs. 26.8 nsecs > > You still need the loop unrolling as the decl and jge have some > overhead. You can't just get rid of it with a single call in a tight > loop but it should improve things. The gain from what I have seen > ends up being minimal though. I haven't really noticed all that much > in my tests anyway. > > I have been doing some testing and the penalty for an unaligned > checksum can get pretty big if the data-set is big enough. I was > messing around and tried doing a checksum over 32K minus some offset > and was seeing a penalty of about 200 cycles per 64K frame. > Out of how many cycles to checksum 64K though?
> One thought I had is that we may want to look into making an inline > function that we can call for compile-time defined lengths less than > 64. Maybe call it something like __csum_partial and we could then use > that in place of csum_partial for all those headers that are a fixed > length that we pull such as UDP, VXLAN, Ethernet, and the rest. Then > we might be able to look at taking care of alignment for csum_partial > which will improve the skb_checksum() case without impacting the > header pulling cases as much since that code would be inlined > elsewhere. > As I said previously, if alignment really is a factor then we can check up front if a buffer crosses a page boundary and call the slow path function (original code). I'm seeing a 1 nsec hit to add this check. Tom > - Alex