https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85491
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2018-04-23 Blocks|79703 | Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Summary|[8 Regression] scimark LU |[8 Regression] nbench LU |Decomposition test 15% |Decomposition test 15% |slower than GCC 7, 30% |slower than GCC 7, 30% |slower than peak |slower than peak Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Err, nbench, not scimark2. So it is indeed r257734. With r257734 we use avx128 instead of avx256 for nbench1.c:4293: for(j=0;j<n;j++) { if(j!=0) for(i=0;i<j;i++) { sum=a[i][j]; if(i!=0) for(k=0;k<i;k++) <--- sum-=(a[i][k]*a[k][j]); a[i][j]=sum; } similar for :4301 big=(double)0.0; for(i=j;i<n;i++) { sum=a[i][j]; if(j!=0) for(k=0;k<j;k++) <--- sum-=a[i][k]*a[k][j]; so the heuristics in r257734 are not working well for this case. Good assembly looks like 0.37 : 40b3dc: vmovsd 0x650(%rax),%xmm14 0.36 : 40b3e4: vmovsd (%rax),%xmm6 0.56 : 40b3e8: sub $0xffffffffffffff80,%rcx 0.24 : 40b3ec: add $0x3280,%rax 0.39 : 40b3f2: vmovsd -0x1f90(%rax),%xmm10 0.33 : 40b3fa: vmovsd -0x25e0(%rax),%xmm12 0.69 : 40b402: vmovhpd -0x2908(%rax),%xmm14,%xmm15 0.23 : 40b40a: vmovhpd -0x2f58(%rax),%xmm6,%xmm8 0.39 : 40b412: vmovsd -0x1940(%rax),%xmm6 0.43 : 40b41a: vinsertf128 $0x1,%xmm15,%ymm8,%ymm9 0.63 : 40b420: vmovhpd -0x1c68(%rax),%xmm10,%xmm11 0.21 : 40b428: vmovhpd -0x22b8(%rax),%xmm12,%xmm13 0.41 : 40b430: vmovsd -0x650(%rax),%xmm10 1.95 : 40b438: vfnmadd231pd -0x80(%rcx),%ymm9,%ymm0 0.18 : 40b43e: vinsertf128 $0x1,%xmm11,%ymm13,%ymm14 0.22 : 40b444: vmovsd -0xca0(%rax),%xmm12 0.37 : 40b44c: vmovhpd -0x1618(%rax),%xmm6,%xmm8 3.47 : 40b454: vfnmadd132pd -0x60(%rcx),%ymm0,%ymm14 0.12 : 40b45a: vmovsd -0x12f0(%rax),%xmm0 0.17 : 40b462: vmovhpd -0x328(%rax),%xmm10,%xmm11 0.32 : 40b46a: vmovhpd -0x978(%rax),%xmm12,%xmm13 0.91 : 40b472: vmovhpd -0xfc8(%rax),%xmm0,%xmm15 0.13 : 40b47a: vinsertf128 $0x1,%xmm11,%ymm13,%ymm0 0.21 : 40b480: vinsertf128 $0x1,%xmm15,%ymm8,%ymm9 3.15 : 40b486: vfnmadd231pd -0x40(%rcx),%ymm9,%ymm14 4.16 : 40b48c: vfnmadd132pd -0x20(%rcx),%ymm14,%ymm0 0.04 : 40b492: cmp %rbx,%rcx 0.05 : 40b495: jne 40b3dc <lusolve.constprop.5+0x2ac> where the important difference over the polyhedron case is that the stride step is constant and thus the AGU costs are cheaper. In addition to that when using AVX128 the loop is unrolled by a factor of 8 putting pressure on the FMA units where as you can see above the AVX256 variant is unrolled by a factor of four only. When removing -funroll-loops the performance difference is 6474.8 [6495.3] vs. 4946.8 [4779.1] [numbers in brackets are those with -funroll-loops on the machine used for debugging]. Given the original case was with variable step an adjustment of the change could look like Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 259556) +++ gcc/config/i386/i386.c (working copy) @@ -50550,8 +50550,9 @@ ix86_add_stmt_cost (void *data, int coun construction cost by the number of elements involved. */ if (kind == vec_construct && stmt_info - && stmt_info->type == load_vec_info_type - && stmt_info->memory_access_type == VMAT_ELEMENTWISE) + && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE + && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST) { stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype); which restores performance. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79703 [Bug 79703] [meta-bug] SciMark 2.0 performance issues