On Tue, Mar 8, 2016 at 10:08 PM, Alexander Duyck <alexander.du...@gmail.com> wrote: > On Tue, Mar 8, 2016 at 9:50 PM, Joe Perches <j...@perches.com> wrote: >> On Tue, 2016-03-08 at 21:23 -0800, Alexander Duyck wrote: >>> On Tue, Mar 8, 2016 at 3:25 PM, Joe Perches <j...@perches.com> wrote: >>> > On Tue, 2016-03-08 at 14:42 -0800, Alexander Duyck wrote: >>> > > The code for csum_block_add was doing a funky byteswap to swap the even >>> > > and >>> > > odd bytes of the checksum if the offset was odd. Instead of doing this >>> > > we >>> > > can save ourselves some trouble and just shift by 8 as this should have >>> > > the >>> > > same effect in terms of the final checksum value and only requires one >>> > > instruction. >>> > 3 instructions? >>> I was talking about just the one ror vs mov, shl, shr, and ,and, add. >>> >>> I assume when you say 3 you are including the test and either some >>> form of conditional move or jump? >> >> Yeah, instruction count also depends on architecture (arm/x86/ppc...) > > Right. But the general idea is that rotate is an instruction most > architectures have. I haven't heard of an instruction that swaps even > and odd bytes of a 32 bit word. > Yes, I took a look inlining these.
#define rol32(V, X) ({ \ int word = V; \ if (__builtin_constant_p(X)) \ asm("roll $" #X ",%[word]\n\t" \ : [word] "=r" (word)); \ else \ asm("roll %%cl,%[word]\n\t" \ : [word] "=r" (word) \ : "c" (X)); \ word; \ }) With this I'm seeing a nice speedup in jhash which uses a lot of rol32s... >>> > > diff --git a/include/net/checksum.h b/include/net/checksum.h >> [] >>> > > @@ -88,8 +88,10 @@ static inline __wsum >>> > > csum_block_add(__wsum csum, __wsum csum2, int offset) >>> > > { >>> > > u32 sum = (__force u32)csum2; >>> > > - if (offset&1) >>> > > - sum = ((sum&0xFF00FF)<<8)+((sum>>8)&0xFF00FF); >>> > > + >>> > > + if (offset & 1) >>> > > + sum = (sum << 24) + (sum >> 8); >>> > Maybe use ror32(sum, 8); >>> I was actually thinking I could use something like this. I didn't >>> realize it was even available. >> >> Now you know: bitops.h >> >>> > or maybe something like: >>> > >>> > { >>> > u32 sum; >>> > >>> > /* rotated csum2 of odd offset will be the right checksum */ >>> > if (offset & 1) >>> > sum = ror32((__force u32)csum2, 8); >>> > else >>> > sum = (__force u32)csum2; >>> > >>> Any specific reason for breaking it up like this? It seems like it >>> was easier to just have sum be assigned first and then rotating it if >>> needed. What is gained by splitting the assignment up over two >>> different calls? >> >> It's only for reader clarity where a comment could be useful. >> The compiler output shouldn't change. > > Okay, well I can add a one line comment about aligning to a 16b > boundary for clarity. > > - Alex