[Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 13 Jun 2023 00:48:08 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com
             Blocks|                            |53947
          Component|middle-end                  |rtl-optimization

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Similar when vectorizing

int a[4096];

void foo ()
{
  for (int i = 1; i < 4095; ++i)
    a[i] = 42;
}

the combination of peeling for alignment and the epilog yields on GIMPLE:

  <bb 2> [local count: 10737416]:
  MEM <vector(8) int> [(int *)&a + 4B] = { 42, 42, 42, 42, 42, 42, 42, 42 };
  MEM <vector(4) int> [(int *)&a + 36B] = { 42, 42, 42, 42 };
  MEM <vector(2) int> [(int *)&a + 52B] = { 42, 42 };
  a[15] = 42;
  ivtmp.28_59 = (unsigned long) &MEM <int[4096]> [(void *)&a + 64B];
  _1 = (unsigned long) &a;
  _182 = _1 + 16320;

  <bb 3> [local count: 75161909]:
  # ivtmp.28_71 = PHI <ivtmp.28_65(3), ivtmp.28_59(2)>
  _21 = (void *) ivtmp.28_71;
  MEM <vector(16) int> [(int *)_21] = { 42, 42, 42, 42, 42, 42, 42, 42, 42, 42,
42, 42, 42, 42, 42, 42 };
  ivtmp.28_65 = ivtmp.28_71 + 64;
  if (ivtmp.28_65 != _182)
    goto <bb 3>; [85.71%]
  else
    goto <bb 4>; [14.29%]

  <bb 4> [local count: 21474835]:
  MEM <vector(8) int> [(int *)&a + 16320B] = { 42, 42, 42, 42, 42, 42, 42, 42
};
  MEM <vector(4) int> [(int *)&a + 16352B] = { 42, 42, 42, 42 };
  MEM <vector(2) int> [(int *)&a + 16368B] = { 42, 42 };
  a[4094] = 42;
  return;

and that in turn causes a lot of redundant broadcasts from constants (via
GPRs):

foo:
.LFB0:
        .cfi_startproc
        movl    $42, %eax
        movq    .LC2(%rip), %rcx
        movl    $42, %edx
        movl    $42, a+60(%rip)
        vpbroadcastd    %eax, %ymm0
        vmovdqu %ymm0, a+4(%rip)
        vpbroadcastd    %eax, %xmm0
        movl    $a+64, %eax
        vmovdqu %xmm0, a+36(%rip)
        vpbroadcastd    %edx, %zmm0
        movq    %rcx, a+52(%rip)
.L2:
        vmovdqa32       %zmm0, (%rax)
        subq    $-128, %rax
        vmovdqa32       %zmm0, -64(%rax)
        cmpq    $a+16320, %rax
        jne     .L2
        vpbroadcastd    %edx, %ymm0
        movq    %rcx, a+16368(%rip)
        movl    $42, a+16376(%rip)
        vmovdqa %ymm0, a+16320(%rip)
        vpbroadcastd    %edx, %xmm0
        vmovdqa %xmm0, a+16352(%rip)
        vzeroupper
        ret

as they are constant on GIMPLE any "CSE" we'd perform there would be undone
quickly by constant propagation.  So it's only on RTL where the actual
broadcast is a non-constant operation that we can and should optimize this
somehow.  Some kind of LCM to also handle earlier small but later bigger
broadcasts would be necessary here.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

Reply via email to