[Bug target/85491] [8 Regression] nbench LU Decomposition test 15% slower than GCC 7, 30% slower than peak

rguenth at gcc dot gnu.org Mon, 23 Apr 2018 03:12:50 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85491


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2018-04-23
             Blocks|79703                       |
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
            Summary|[8 Regression] scimark LU   |[8 Regression] nbench LU
                   |Decomposition test 15%      |Decomposition test 15%
                   |slower than GCC 7, 30%      |slower than GCC 7, 30%
                   |slower than peak            |slower than peak
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Err, nbench, not scimark2.  So it is indeed r257734.  With r257734 we use
avx128 instead of avx256 for nbench1.c:4293:

for(j=0;j<n;j++)
{       if(j!=0)
                for(i=0;i<j;i++)
                {       sum=a[i][j];
                        if(i!=0)
                                for(k=0;k<i;k++)  <---
                                        sum-=(a[i][k]*a[k][j]);
                        a[i][j]=sum;
                }

similar for :4301

        big=(double)0.0;
        for(i=j;i<n;i++)
        {       sum=a[i][j];
                if(j!=0)
                        for(k=0;k<j;k++)  <---
                                sum-=a[i][k]*a[k][j];

so the heuristics in r257734 are not working well for this case.  Good assembly
looks like

    0.37 :        40b3dc:       vmovsd 0x650(%rax),%xmm14
    0.36 :        40b3e4:       vmovsd (%rax),%xmm6
    0.56 :        40b3e8:       sub    $0xffffffffffffff80,%rcx
    0.24 :        40b3ec:       add    $0x3280,%rax
    0.39 :        40b3f2:       vmovsd -0x1f90(%rax),%xmm10
    0.33 :        40b3fa:       vmovsd -0x25e0(%rax),%xmm12
    0.69 :        40b402:       vmovhpd -0x2908(%rax),%xmm14,%xmm15
    0.23 :        40b40a:       vmovhpd -0x2f58(%rax),%xmm6,%xmm8
    0.39 :        40b412:       vmovsd -0x1940(%rax),%xmm6
    0.43 :        40b41a:       vinsertf128 $0x1,%xmm15,%ymm8,%ymm9
    0.63 :        40b420:       vmovhpd -0x1c68(%rax),%xmm10,%xmm11
    0.21 :        40b428:       vmovhpd -0x22b8(%rax),%xmm12,%xmm13
    0.41 :        40b430:       vmovsd -0x650(%rax),%xmm10
    1.95 :        40b438:       vfnmadd231pd -0x80(%rcx),%ymm9,%ymm0
    0.18 :        40b43e:       vinsertf128 $0x1,%xmm11,%ymm13,%ymm14
    0.22 :        40b444:       vmovsd -0xca0(%rax),%xmm12
    0.37 :        40b44c:       vmovhpd -0x1618(%rax),%xmm6,%xmm8
    3.47 :        40b454:       vfnmadd132pd -0x60(%rcx),%ymm0,%ymm14
    0.12 :        40b45a:       vmovsd -0x12f0(%rax),%xmm0
    0.17 :        40b462:       vmovhpd -0x328(%rax),%xmm10,%xmm11
    0.32 :        40b46a:       vmovhpd -0x978(%rax),%xmm12,%xmm13
    0.91 :        40b472:       vmovhpd -0xfc8(%rax),%xmm0,%xmm15
    0.13 :        40b47a:       vinsertf128 $0x1,%xmm11,%ymm13,%ymm0
    0.21 :        40b480:       vinsertf128 $0x1,%xmm15,%ymm8,%ymm9
    3.15 :        40b486:       vfnmadd231pd -0x40(%rcx),%ymm9,%ymm14
    4.16 :        40b48c:       vfnmadd132pd -0x20(%rcx),%ymm14,%ymm0
    0.04 :        40b492:       cmp    %rbx,%rcx
    0.05 :        40b495:       jne    40b3dc <lusolve.constprop.5+0x2ac>

where the important difference over the polyhedron case is that the
stride step is constant and thus the AGU costs are cheaper.  In addition
to that when using AVX128 the loop is unrolled by a factor of 8 putting
pressure on the FMA units where as you can see above the AVX256 variant
is unrolled by a factor of four only.  When removing -funroll-loops
the performance difference is 6474.8 [6495.3] vs. 4946.8 [4779.1] [numbers
in brackets are those with -funroll-loops on the machine used for debugging].

Given the original case was with variable step an adjustment of the change
could look like

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c      (revision 259556)
+++ gcc/config/i386/i386.c      (working copy)
@@ -50550,8 +50550,9 @@ ix86_add_stmt_cost (void *data, int coun
      construction cost by the number of elements involved.  */
   if (kind == vec_construct
       && stmt_info
-      && stmt_info->type == load_vec_info_type
-      && stmt_info->memory_access_type == VMAT_ELEMENTWISE)
+      && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
+      && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
+      && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
     {
       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
       stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);

which restores performance.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79703
[Bug 79703] [meta-bug] SciMark 2.0 performance issues

[Bug target/85491] [8 Regression] nbench LU Decomposition test 15% slower than GCC 7, 30% slower than peak

Reply via email to