https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
A workaround on your side would be to stick -mprefer-vector-size=128 on the
function.

Note the actual code seems to have another loop touching r inbetween:

  for (i=0;i<6;i++)
    { r[i] = x1*toverp[k+i]*gor.x; gor.x *= tm24.x; }
  for (i=0;i<3;i++) {
    s=(r[i]+big.x)-big.x;
    sum+=s;
    r[i]-=s;
  }
  t=0;
  for (i=0;i<6;i++)
    t+=r[5-i];

Note that placing a "real" VN after vectorization fixes things as well.
The issue is that we end up with

  MEM[(double *)&r] = vect__3.15_64;
...
  _3 = r[1];
  t_17 = _3 + t_12;
  _6 = r[0];
  t_28 = _6 + t_17;
  return t_28;

where both loads could be CSEd via vector extracts.

@@ -306,6 +306,7 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_simduid_cleanup);
       NEXT_PASS (pass_lower_vector_ssa);
       NEXT_PASS (pass_lower_switch);
+      NEXT_PASS (pass_fre);
       NEXT_PASS (pass_cse_reciprocals);
       NEXT_PASS (pass_sprintf_length, true);
       NEXT_PASS (pass_reassoc, false /* insert_powi_p */);

produces

  MEM[(double *)&r] = vect__3.15_64;
...
  _75 = BIT_FIELD_REF <vect__3.15_64, 64, 64>;
  t_17 = t_12 + _75;
  _74 = BIT_FIELD_REF <vect__3.15_64, 64, 0>;
  t_28 = t_17 + _74;
  return t_28;

and

loop:
.LFB0:
        .cfi_startproc
        movslq  %edi, %rax
        vmovapd %xmm0, %xmm1
        addl    $4, %edi
        movslq  %edi, %rdi
        vbroadcastsd    %xmm0, %ymm0
        vmovddup        %xmm1, %xmm1
        vmulpd  a(,%rax,8), %ymm0, %ymm0
        vmulpd  a(,%rdi,8), %xmm1, %xmm1
        vmovupd %ymm0, r(%rip)
        vmovups %xmm1, r+32(%rip)
        vmovupd r+16(%rip), %ymm2
        vextractf128    $0x1, %ymm2, %xmm3
        vunpckhpd       %xmm3, %xmm3, %xmm1
        vaddsd  .LC0(%rip), %xmm1, %xmm1
        vaddsd  %xmm3, %xmm1, %xmm1
        vunpckhpd       %xmm2, %xmm2, %xmm3
        vaddsd  %xmm3, %xmm1, %xmm1
        vaddsd  %xmm2, %xmm1, %xmm1
        vunpckhpd       %xmm0, %xmm0, %xmm2
        vaddsd  %xmm1, %xmm2, %xmm1
        vaddsd  %xmm1, %xmm0, %xmm0
        vzeroupper
        ret

but another FRE pass was thought to be too costly since we cannot simply
replace the DOM pass that performs VN around this place (since that also
performs jump threading which we'd then miss).

I did not assess the compile-time cost or other fallout adding a new
FRE pass after the VN rewrite (we could use a non-iterating FRE here).
I did at some point in the past but it wasn't an obvious improvement
(no SPEC improvements IIRC but ~2-3% compile-time slowdown).

Reply via email to