[Bug tree-optimization/119096] New: Loop with conditional, cast and reduction vectorized incorrectly with AVX-512

someone12469 at gmail dot com via Gcc-bugs Mon, 03 Mar 2025 01:40:16 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119096


            Bug ID: 119096
           Summary: Loop with conditional, cast and reduction vectorized
                    incorrectly with AVX-512
           Product: gcc
           Version: 14.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: someone12469 at gmail dot com
  Target Milestone: ---

The following C program, which should output 0, outputs 8 when compiled with
gcc -O2 -mavx512f on 64-bit Linux.

int printf(const char *, ...);
long sum(int* A, int* B)
{
        long total = 0;
        for(int j = 0; j < 16; j++)
                if((A[j] > 0) & (B[j] > 0))
                        total += (long)A[j];
        return total;
}
int main()
{
        int A[16] = { 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1 };
        int B[16] = { };
        printf("%ld\n", sum(A, B));
}

The singular & is intentional and significant, in the original program it was
used to hint that the read is safe to get better vectorization. In the
resulting assembly for f:

sum:
.LFB0:
        .cfi_startproc
        vmovdqu32       (%rdi), %zmm0
        vpxor   %xmm2, %xmm2, %xmm2
        vmovdqu32       (%rsi), %zmm3
        vpcmpd  $6, %zmm2, %zmm0, %k1
        vextracti64x4   $0x1, %zmm0, %ymm1
        vpmovsxdq       %ymm0, %zmm0
        vpmovsxdq       %ymm1, %zmm1
        vpcmpd  $6, %zmm2, %zmm3, %k1{%k1}
        vmovdqa64       %zmm1, %zmm2
        kshiftrw        $8, %k1, %k1
        vpaddq  %zmm1, %zmm0, %zmm2{%k1}
        vextracti64x4   $0x1, %zmm2, %ymm1
        vpaddq  %ymm2, %ymm1, %ymm1
        vextracti128    $0x1, %ymm1, %xmm0
        vpaddq  %xmm1, %xmm0, %xmm0
        vpsrldq $8, %xmm0, %xmm1
        vpaddq  %xmm1, %xmm0, %xmm0
        vmovq   %xmm0, %rax
        vzeroupper
        ret
        .cfi_endproc

the main issue appears to be the "vpaddq %zmm1, %zmm0, %zmm2{%k1}", which keeps
the value from the lower half when the upper half is masked, even if the lower
half is masked as well.

Tested on x86_64-pc-linux-gnu on gcc 14.2.1 and a local build of the latest
commit. Since the bug reporting instructions insist, the output of gcc -v for
my distribution's 14.2.1:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure
--enable-languages=ada,c,c++,d,fortran,go,lto,m2,objc,obj-c++,rust
--enable-bootstrap --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib
--mandir=/usr/share/man --infodir=/usr/share/info
--with-bugurl=https://gitlab.archlinux.org/archlinux/packaging/packages/gcc/-/issues
--with-build-config=bootstrap-lto --with-linker-hash-style=gnu
--with-system-zlib --enable-__cxa_atexit --enable-cet=auto
--enable-checking=release --enable-clocale=gnu --enable-default-pie
--enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object
--enable-libstdcxx-backtrace --enable-link-serialization=1
--enable-linker-build-id --enable-lto --enable-multilib --enable-plugin
--enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch
--disable-werror
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 14.2.1 20240910 (GCC)

[Bug tree-optimization/119096] New: Loop with conditional, cast and reduction vectorized incorrectly with AVX-512

Reply via email to