On Mon, Mar 7, 2016 at 5:39 PM, Linus Torvalds
wrote:
> On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert wrote:
>>
>> As I said previously, if alignment really is a factor then we can
>> check up front if a buffer crosses a page boundary and call the slow
>> path function (original code). I'm seeing a
From: Alexander Duyck
...
> >> So the loop:
> >> 10: addc %rax,(%rdx,%rcx,8)
> >> inc %rcx
> >> jnz 10b
> >> could easily be as fast as anything that doesn't use the 'new'
> >> instructions that use the overflow flag.
> >> That loop might be measurable faster for aligned buffer
From: Alexander Duyck
...
> One thought I had is that we may want to look into making an inline
> function that we can call for compile-time defined lengths less than
> 64. Maybe call it something like __csum_partial and we could then use
> that in place of csum_partial for all those headers that
On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert wrote:
>
> As I said previously, if alignment really is a factor then we can
> check up front if a buffer crosses a page boundary and call the slow
> path function (original code). I'm seeing a 1 nsec hit to add this
> check.
It shouldn't be a factor, a
On Mon, Mar 7, 2016 at 4:49 PM, Alexander Duyck
wrote:
> On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert wrote:
>> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
>> wrote:
>>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert wrote:
On Mon, Mar 7, 2016 at 5:56 AM, David Laight
wrote:
> F
On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert wrote:
> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
> wrote:
>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert wrote:
>>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight
>>> wrote:
From: Alexander Duyck
...
> Actually probably the easie
On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
wrote:
> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert wrote:
>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight wrote:
>>> From: Alexander Duyck
>>> ...
Actually probably the easiest way to go on x86 is to just replace the
use of len with (l
On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert wrote:
> On Mon, Mar 7, 2016 at 5:56 AM, David Laight wrote:
>> From: Alexander Duyck
>> ...
>>> Actually probably the easiest way to go on x86 is to just replace the
>>> use of len with (len >> 6) and use decl or incl instead of addl or
>>> subl, and
On Mon, Mar 7, 2016 at 5:56 AM, David Laight wrote:
> From: Alexander Duyck
> ...
>> Actually probably the easiest way to go on x86 is to just replace the
>> use of len with (len >> 6) and use decl or incl instead of addl or
>> subl, and lea instead of addq for the buff address. None of those
>>
From: Alexander Duyck
...
> Actually probably the easiest way to go on x86 is to just replace the
> use of len with (len >> 6) and use decl or incl instead of addl or
> subl, and lea instead of addq for the buff address. None of those
> instructions effect the carry flag as this is how such loops
On Fri, Mar 4, 2016 at 2:38 AM, David Laight wrote:
> From: Linus Torvalds
>> Sent: 03 March 2016 18:44
>>
>> On Thu, Mar 3, 2016 at 8:12 AM, David Laight wrote:
>> >
>> > Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
>> > without any unrolling?
>>
>> Is that actually supposed
From: Linus Torvalds
> Sent: 03 March 2016 18:44
>
> On Thu, Mar 3, 2016 at 8:12 AM, David Laight wrote:
> >
> > Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
> > without any unrolling?
>
> Is that actually supposed to work ok these days? jcxz used to be quite
> slow, and is
On Thu, Mar 3, 2016 at 8:12 AM, David Laight wrote:
>
> Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
> without any unrolling?
Is that actually supposed to work ok these days? jcxz used to be quite
slow, and is historically *never* used.
Now, in theory, loop constructs can ac
From: Tom Herbert
> Sent: 02 March 2016 22:19
...
> + /* Main loop using 64byte blocks */
> + for (; len > 64; len -= 64, buff += 64) {
> + asm("addq 0*8(%[src]),%[res]\n\t"
> + "adcq 1*8(%[src]),%[res]\n\t"
> + "adcq 2*8(%[src]),%[res]\n\t"
> +
On Wed, Mar 2, 2016 at 4:40 PM, Tom Herbert wrote:
> On Wed, Mar 2, 2016 at 3:42 PM, Alexander Duyck
> wrote:
>> On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert wrote:
>>> This patch implements performant csum_partial for x86_64. The intent is
>>> to speed up checksum calculation, particularly for s
On Wed, Mar 2, 2016 at 3:42 PM, Alexander Duyck
wrote:
> On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert wrote:
>> This patch implements performant csum_partial for x86_64. The intent is
>> to speed up checksum calculation, particularly for smaller lengths such
>> as those that are present when doing
On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert wrote:
> This patch implements performant csum_partial for x86_64. The intent is
> to speed up checksum calculation, particularly for smaller lengths such
> as those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from dev
On mer., 2016-03-02 at 14:18 -0800, Tom Herbert wrote:
\
> + asm("lea 0f(, %[slen], 4), %%r11\n\t"
> + "clc\n\t"
> + "jmpq *%%r11\n\t"
> + "adcq 7*8(%[src]),%[res]\n\t"
> + "adcq 6*8(%[src]),%[res]\n\t"
> + "adcq 5*8(%[src]),%[res]\n\t"
> + "adcq
18 matches
Mail list logo