[Bug tree-optimization/82137] auto-vectorizing shuffles way to much to avoid duplicate work

rguenther at suse dot de Wed, 13 Sep 2017 00:45:05 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137


--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 12 Sep 2017, peter at cordes dot ca wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137
> 
> --- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
> (In reply to Richard Biener from comment #1)
> > Interesting idea.  It's probably a bit hard to make the vectorizer do this
> > though given it's current structure and the fact that it would have to
> > cost the extra ops against the saved suffling (the extra cost of the ops
> > would depent on availability of spare execution resources for example).
> 
>  Shuffles take execution resources just like anything else (except for
> load+broadcast (vbroadcastss or movddup, or even movshdup) which is done as
> part of a load uop in Ryzen and Intel CPUs).

I was concerned about cases where we don't just have one operation but
a computation sequence where you go from SHUF + n * op + SHUF to
n * 2 * op + BLEND.  GCC already knows how to handle a mix of two
operators & blend for addsubpd support -- it just does the blending
in a possibly unwanted position and nothing later optimizes shuffles
through transparent operations.  Extending this support to handle
plus and mult yields

pairs_double:
.LFB0:
        .cfi_startproc
        leaq    81920(%rdi), %rax
        .p2align 4,,10
        .p2align 3
.L2:
        vmovapd (%rdi), %ymm0
        addq    $32, %rdi
        vpermpd $160, %ymm0, %ymm2
        vpermpd $245, %ymm0, %ymm0
        vmulpd  %ymm2, %ymm0, %ymm1
        vaddpd  %ymm2, %ymm0, %ymm0
        vshufpd $10, %ymm0, %ymm1, %ymm0
        vmovapd %ymm0, -32(%rdi)
        cmpq    %rax, %rdi
        jne     .L2
        vzeroupper

the vectorizer generates

  <bb 3> [50.00%] [count: INV]:
  # ivtmp.17_24 = PHI <ivtmp.17_2(2), ivtmp.17_1(3)>
  _3 = (void *) ivtmp.17_24;
  vect_x_14.2_23 = MEM[(double *)_3];
  vect_x_14.3_22 = VEC_PERM_EXPR <vect_x_14.2_23, vect_x_14.2_23, { 0, 0, 
2, 2 }>;
  vect_y_15.7_10 = VEC_PERM_EXPR <vect_x_14.2_23, vect_x_14.2_23, { 1, 1, 
3, 3 }>;
  vect__7.8_9 = vect_y_15.7_10 * vect_x_14.3_22;
  vect__7.9_40 = vect_y_15.7_10 + vect_x_14.3_22;
  _35 = VEC_PERM_EXPR <vect__7.8_9, vect__7.9_40, { 0, 5, 2, 7 }>;
  MEM[(double *)_3] = _35;
  ivtmp.17_1 = ivtmp.17_24 + 32;
  if (ivtmp.17_1 != _5)
    goto <bb 3>; [98.00%] [count: INV]

in this case.  Not sure if this is optimal.  Now the question is
whether it is profitable to enable this for more than the plus/minus
combination in general.  As I noted GCC does a poor job optimizing
for example

static const int aligned = 1;

void pairs_double(double blocks[]) {
    if(aligned) blocks = __builtin_assume_aligned(blocks, 64);
    for (int i = 0 ; i<10240 ; i+=2) {
        double x = blocks[i];
        double y = blocks[i+1];
        double tem = x * y;
        double tem2 = x + y;
        tem = tem + 3;
        tem2 = tem2 * 2;
        blocks[i] = tem;
        blocks[i+1] = tem2;
    }
}

which ends up as

  <bb 3> [50.00%] [count: INV]:
  # ivtmp.19_26 = PHI <ivtmp.19_2(2), ivtmp.19_1(3)>
  _3 = (void *) ivtmp.19_26;
  vect_x_12.2_31 = MEM[(double *)_3];
  vect_x_12.3_30 = VEC_PERM_EXPR <vect_x_12.2_31, vect_x_12.2_31, { 0, 0, 
2, 2 }>;
  vect_y_13.7_24 = VEC_PERM_EXPR <vect_x_12.2_31, vect_x_12.2_31, { 1, 1, 
3, 3 }>;
  vect_tem_14.8_23 = vect_y_13.7_24 * vect_x_12.3_30;
  vect_tem_14.9_22 = vect_y_13.7_24 + vect_x_12.3_30;
  _21 = VEC_PERM_EXPR <vect_tem_14.8_23, vect_tem_14.9_22, { 0, 5, 2, 7 
}>;
  vect_tem_16.10_7 = _21 + { 3.0e+0, 2.0e+0, 3.0e+0, 2.0e+0 };
  vect_tem_16.11_41 = _21 * { 3.0e+0, 2.0e+0, 3.0e+0, 2.0e+0 };
  _42 = VEC_PERM_EXPR <vect_tem_16.10_7, vect_tem_16.11_41, { 0, 5, 2, 7 
}>;
  MEM[(double *)_3] = _42;
  ivtmp.19_1 = ivtmp.19_26 + 32;
  if (ivtmp.19_1 != _5)
    goto <bb 3>; [98.00%] [count: INV]

and thus

.L2:
        vmovapd (%rdi), %ymm0
        addq    $32, %rdi
        vpermpd $160, %ymm0, %ymm2
        vpermpd $245, %ymm0, %ymm0
        vmulpd  %ymm2, %ymm0, %ymm1
        vaddpd  %ymm2, %ymm0, %ymm0
        vshufpd $10, %ymm0, %ymm1, %ymm0
        vaddpd  %ymm3, %ymm0, %ymm1
        vmulpd  %ymm3, %ymm0, %ymm0
        vshufpd $10, %ymm0, %ymm1, %ymm0
        vmovapd %ymm0, -32(%rdi)
        cmpq    %rax, %rdi
        jne     .L2

we miss some pass that would move and combine VEC_PERM_EXPRs
across operations.

Thanks for the detailed write-up on how AVX512 works compared to AVX2,
it does indeed look like an improvement!  (until AMD cripples
it with splitting it into two AVX256 halves... ;))

[Bug tree-optimization/82137] auto-vectorizing shuffles way to much to avoid duplicate work

Reply via email to