https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443
Bug ID: 112443 Summary: Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: alexander.gr...@tu-dresden.de Target Milestone: --- Created attachment 56533 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56533&action=edit Reproducer code extracted from actual source I came around some piece of code in PyTorch using AVX2 intrinsics that is misoptimized producing wrong results, when compiled for newer CPUS. In particular I was able to reproduce this with `-mavx512bw -mavx512vl -O2` We usually compile with `-march=native` which on the Sapphire Rapids system enables the above AVX512 flags, but so does `-march=cannonlake` and above. The piece of code in question is a call to `_mm256_blendv_epi8(a, b, mask)` that seemingly produces inverted semantics, i.e. I have a mask with all bits set and it returns a and for a mask with all bits unset it returns b. It is also a bit complicated to reproduce as it seems to require hiding some details behind a lambda called through `std::function`. In the attached example a zero and one vector is created once and copied into the lambda where it is reused for potentially many iterations (removing the loop also reproduces the issue) Either of the following actions causes the bug to disappear: - Removing either of the 2 `-mavx512` flags - Reducing to `-O1` or lower - Moving the zero_vec inside the lambda (moving one_vec makes no difference) - Not calling through std::function (either run the lambda directly or pass through as a template param instead of std::function) - `-DREGEN_MASK` to create a new mask through a (superflous) `_mm256_cmpeq_epi8` against all 1 bits Reproducing: g++ -std=c++17 -mavx512bw -mavx512vl -O2 bug.cpp && ./a.out Expected output (last line, first line shows the inverted semantic): vec[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] Actual output: vec[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255]