Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-08 Thread Tom Herbert
On Mon, Mar 7, 2016 at 5:39 PM, Linus Torvalds wrote: > On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert wrote: >> >> As I said previously, if alignment really is a factor then we can >> check up front if a buffer crosses a page boundary and call the slow >> path function (original code). I'm seeing a

RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-08 Thread David Laight
From: Alexander Duyck ... > >> So the loop: > >> 10: addc %rax,(%rdx,%rcx,8) > >> inc %rcx > >> jnz 10b > >> could easily be as fast as anything that doesn't use the 'new' > >> instructions that use the overflow flag. > >> That loop might be measurable faster for aligned buffer

RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-08 Thread David Laight
From: Alexander Duyck ... > One thought I had is that we may want to look into making an inline > function that we can call for compile-time defined lengths less than > 64. Maybe call it something like __csum_partial and we could then use > that in place of csum_partial for all those headers that

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Linus Torvalds
On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert wrote: > > As I said previously, if alignment really is a factor then we can > check up front if a buffer crosses a page boundary and call the slow > path function (original code). I'm seeing a 1 nsec hit to add this > check. It shouldn't be a factor, a

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Tom Herbert
On Mon, Mar 7, 2016 at 4:49 PM, Alexander Duyck wrote: > On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert wrote: >> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck >> wrote: >>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert wrote: On Mon, Mar 7, 2016 at 5:56 AM, David Laight wrote: > F

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Alexander Duyck
On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert wrote: > On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck > wrote: >> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert wrote: >>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight >>> wrote: From: Alexander Duyck ... > Actually probably the easie

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Tom Herbert
On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck wrote: > On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert wrote: >> On Mon, Mar 7, 2016 at 5:56 AM, David Laight wrote: >>> From: Alexander Duyck >>> ... Actually probably the easiest way to go on x86 is to just replace the use of len with (l

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Alexander Duyck
On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert wrote: > On Mon, Mar 7, 2016 at 5:56 AM, David Laight wrote: >> From: Alexander Duyck >> ... >>> Actually probably the easiest way to go on x86 is to just replace the >>> use of len with (len >> 6) and use decl or incl instead of addl or >>> subl, and

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Tom Herbert
On Mon, Mar 7, 2016 at 5:56 AM, David Laight wrote: > From: Alexander Duyck > ... >> Actually probably the easiest way to go on x86 is to just replace the >> use of len with (len >> 6) and use decl or incl instead of addl or >> subl, and lea instead of addq for the buff address. None of those >>

RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread David Laight
From: Alexander Duyck ... > Actually probably the easiest way to go on x86 is to just replace the > use of len with (len >> 6) and use decl or incl instead of addl or > subl, and lea instead of addq for the buff address. None of those > instructions effect the carry flag as this is how such loops

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-04 Thread Alexander Duyck
On Fri, Mar 4, 2016 at 2:38 AM, David Laight wrote: > From: Linus Torvalds >> Sent: 03 March 2016 18:44 >> >> On Thu, Mar 3, 2016 at 8:12 AM, David Laight wrote: >> > >> > Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..' >> > without any unrolling? >> >> Is that actually supposed

RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-04 Thread David Laight
From: Linus Torvalds > Sent: 03 March 2016 18:44 > > On Thu, Mar 3, 2016 at 8:12 AM, David Laight wrote: > > > > Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..' > > without any unrolling? > > Is that actually supposed to work ok these days? jcxz used to be quite > slow, and is

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-03 Thread Linus Torvalds
On Thu, Mar 3, 2016 at 8:12 AM, David Laight wrote: > > Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..' > without any unrolling? Is that actually supposed to work ok these days? jcxz used to be quite slow, and is historically *never* used. Now, in theory, loop constructs can ac

RE: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-03 Thread David Laight
From: Tom Herbert > Sent: 02 March 2016 22:19 ... > + /* Main loop using 64byte blocks */ > + for (; len > 64; len -= 64, buff += 64) { > + asm("addq 0*8(%[src]),%[res]\n\t" > + "adcq 1*8(%[src]),%[res]\n\t" > + "adcq 2*8(%[src]),%[res]\n\t" > +

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-02 Thread Alexander Duyck
On Wed, Mar 2, 2016 at 4:40 PM, Tom Herbert wrote: > On Wed, Mar 2, 2016 at 3:42 PM, Alexander Duyck > wrote: >> On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert wrote: >>> This patch implements performant csum_partial for x86_64. The intent is >>> to speed up checksum calculation, particularly for s

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-02 Thread Tom Herbert
On Wed, Mar 2, 2016 at 3:42 PM, Alexander Duyck wrote: > On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert wrote: >> This patch implements performant csum_partial for x86_64. The intent is >> to speed up checksum calculation, particularly for smaller lengths such >> as those that are present when doing

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-02 Thread Alexander Duyck
On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert wrote: > This patch implements performant csum_partial for x86_64. The intent is > to speed up checksum calculation, particularly for smaller lengths such > as those that are present when doing skb_postpull_rcsum when getting > CHECKSUM_COMPLETE from dev

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-02 Thread Eric Dumazet
On mer., 2016-03-02 at 14:18 -0800, Tom Herbert wrote: \ > + asm("lea 0f(, %[slen], 4), %%r11\n\t" > + "clc\n\t" > + "jmpq *%%r11\n\t" > + "adcq 7*8(%[src]),%[res]\n\t" > + "adcq 6*8(%[src]),%[res]\n\t" > + "adcq 5*8(%[src]),%[res]\n\t" > + "adcq