https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119096
Bug ID: 119096 Summary: Loop with conditional, cast and reduction vectorized incorrectly with AVX-512 Product: gcc Version: 14.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: someone12469 at gmail dot com Target Milestone: --- The following C program, which should output 0, outputs 8 when compiled with gcc -O2 -mavx512f on 64-bit Linux. int printf(const char *, ...); long sum(int* A, int* B) { long total = 0; for(int j = 0; j < 16; j++) if((A[j] > 0) & (B[j] > 0)) total += (long)A[j]; return total; } int main() { int A[16] = { 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1 }; int B[16] = { }; printf("%ld\n", sum(A, B)); } The singular & is intentional and significant, in the original program it was used to hint that the read is safe to get better vectorization. In the resulting assembly for f: sum: .LFB0: .cfi_startproc vmovdqu32 (%rdi), %zmm0 vpxor %xmm2, %xmm2, %xmm2 vmovdqu32 (%rsi), %zmm3 vpcmpd $6, %zmm2, %zmm0, %k1 vextracti64x4 $0x1, %zmm0, %ymm1 vpmovsxdq %ymm0, %zmm0 vpmovsxdq %ymm1, %zmm1 vpcmpd $6, %zmm2, %zmm3, %k1{%k1} vmovdqa64 %zmm1, %zmm2 kshiftrw $8, %k1, %k1 vpaddq %zmm1, %zmm0, %zmm2{%k1} vextracti64x4 $0x1, %zmm2, %ymm1 vpaddq %ymm2, %ymm1, %ymm1 vextracti128 $0x1, %ymm1, %xmm0 vpaddq %xmm1, %xmm0, %xmm0 vpsrldq $8, %xmm0, %xmm1 vpaddq %xmm1, %xmm0, %xmm0 vmovq %xmm0, %rax vzeroupper ret .cfi_endproc the main issue appears to be the "vpaddq %zmm1, %zmm0, %zmm2{%k1}", which keeps the value from the lower half when the upper half is masked, even if the lower half is masked as well. Tested on x86_64-pc-linux-gnu on gcc 14.2.1 and a local build of the latest commit. Since the bug reporting instructions insist, the output of gcc -v for my distribution's 14.2.1: Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/14.2.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /build/gcc/src/gcc/configure --enable-languages=ada,c,c++,d,fortran,go,lto,m2,objc,obj-c++,rust --enable-bootstrap --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://gitlab.archlinux.org/archlinux/packaging/packages/gcc/-/issues --with-build-config=bootstrap-lto --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-cet=auto --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-libstdcxx-backtrace --enable-link-serialization=1 --enable-linker-build-id --enable-lto --enable-multilib --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-werror Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 14.2.1 20240910 (GCC)