10 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

rguenther at suse dot de Wed, 08 May 2019 23:15:10 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534


--- Comment #35 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 9 May 2019, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534
> 
> --- Comment #34 from bin cheng <amker at gcc dot gnu.org> ---
> So we could have three different addressing modes here.
>   1. What we have now:
>         leaq    0(,%rbp,8), %rax
>         movsd   8(%rbx,%rax), %xmm0
>         addsd   (%rbx,%rbp,8), %xmm0
>         addq    $8, %rbp
>         addsd   16(%rbx,%rax), %xmm0
>         addsd   24(%rbx,%rax), %xmm0
>         addsd   %xmm0, %xmm1
>         movsd   32(%rbx,%rax), %xmm0
>         addsd   40(%rbx,%rax), %xmm0
>         addsd   48(%rbx,%rax), %xmm0
>         addsd   56(%rbx,%rax), %xmm0
>         addsd   %xmm0, %xmm2
>         cmpq    %rsi, %rbp
>   2. GCC-4.7:
>         fldl   (%esi,%ebx,8)
>         lea    0x8(%ebx),%eax
>         faddl  0x8(%esi,%ebx,8)
>         cmp    %eax,%edi
>         faddl  0x10(%esi,%ebx,8)
>         faddl  0x18(%esi,%ebx,8)
>         faddp  %st,%st(2)
>         fldl   0x20(%esi,%ebx,8)
>         faddl  0x28(%esi,%ebx,8)
>         faddl  0x30(%esi,%ebx,8)
>         faddl  0x38(%esi,%ebx,8)
>         faddp  %st,%st(1)
>   3. With slsr change:
>         leaq    0(%rbp,%rbx,8), %rax
>         addq    $8, %rbx
>         movsd   (%rax), %xmm0
>         addsd   8(%rax), %xmm0
>         addsd   16(%rax), %xmm0
>         addsd   24(%rax), %xmm0
>         addsd   %xmm0, %xmm1
>         movsd   32(%rax), %xmm0
>         addsd   40(%rax), %xmm0
>         addsd   48(%rax), %xmm0
>         addsd   56(%rax), %xmm0
>         addsd   %xmm0, %xmm2
>         cmpq    %rsi, %rbx
> 
> This was reported that 2. is better than 1.  Also Jeff recommended 3.
> 
> What I don't understand are:
> A) why 2. is better than 1.?  It seems to have more computations in address.
> B) Is 3. the best one?  It has the simplest addressing mode, but does require
> one additional lea because of strength reduction.

I think that depends on the micro-architecture.  On most x86 
implementations complex addressing modes need an additional
uop.  Case 3 is certainly "simple" and also smaller to encode
so I'd indeed say this one is best.  Case 2 is definitely
a complex addressing mode which should be avoided unless
it's not used very much and saves a register.

I'd say if you can do it, 3 is the better choice if you
look at more than one memory reference.  If you do a transform
that only looks at single memory references 2 might seem to
be best.

[Bug tree-optimization/57534] [7/8/9/10 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

Reply via email to