On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert <t...@herbertland.com> wrote: > On Mon, Mar 7, 2016 at 5:56 AM, David Laight <david.lai...@aculab.com> wrote: >> From: Alexander Duyck >> ... >>> Actually probably the easiest way to go on x86 is to just replace the >>> use of len with (len >> 6) and use decl or incl instead of addl or >>> subl, and lea instead of addq for the buff address. None of those >>> instructions effect the carry flag as this is how such loops were >>> intended to be implemented. >>> >>> I've been doing a bit of testing and that seems to work without >>> needing the adcq until after you exit the loop, but doesn't give that >>> much of a gain in speed for dropping the instruction from the >>> hot-path. I suspect we are probably memory bottle-necked already in >>> the loop so dropping an instruction or two doesn't gain you much. >> >> Right, any superscalar architecture gives you some instructions >> 'for free' if they can execute at the same time as those on the >> critical path (in this case the memory reads and the adc). >> This is why loop unrolling can be pointless. >> >> So the loop: >> 10: addc %rax,(%rdx,%rcx,8) >> inc %rcx >> jnz 10b >> could easily be as fast as anything that doesn't use the 'new' >> instructions that use the overflow flag. >> That loop might be measurable faster for aligned buffers. > > Tested by replacing the unrolled loop in my patch with just: > > if (len >= 8) { > asm("clc\n\t" > "0: adcq (%[src],%%rcx,8),%[res]\n\t" > "decl %%ecx\n\t" > "jge 0b\n\t" > "adcq $0, %[res]\n\t" > : [res] "=r" (result) > : [src] "r" (buff), "[res]" (result), "c" > ((len >> 3) - 1)); > } > > This seems to be significantly slower: > > 1400 bytes: 797 nsecs vs. 202 nsecs > 40 bytes: 6.5 nsecs vs. 26.8 nsecs
You still need the loop unrolling as the decl and jge have some overhead. You can't just get rid of it with a single call in a tight loop but it should improve things. The gain from what I have seen ends up being minimal though. I haven't really noticed all that much in my tests anyway. I have been doing some testing and the penalty for an unaligned checksum can get pretty big if the data-set is big enough. I was messing around and tried doing a checksum over 32K minus some offset and was seeing a penalty of about 200 cycles per 64K frame. One thought I had is that we may want to look into making an inline function that we can call for compile-time defined lengths less than 64. Maybe call it something like __csum_partial and we could then use that in place of csum_partial for all those headers that are a fixed length that we pull such as UDP, VXLAN, Ethernet, and the rest. Then we might be able to look at taking care of alignment for csum_partial which will improve the skb_checksum() case without impacting the header pulling cases as much since that code would be inlined elsewhere. - Alex