[Bug tree-optimization/112443] New: Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl

alexander.grund--- via Gcc-bugs Wed, 08 Nov 2023 06:27:30 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443


            Bug ID: 112443
           Summary: Misoptimization of _mm256_blendv_epi8 intrinsic on
                    avx512bw+avx512vl
           Product: gcc
           Version: 12.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: alexander.gr...@tu-dresden.de
  Target Milestone: ---

Created attachment 56533
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56533&action=edit
Reproducer code extracted from actual source

I came around some piece of code in PyTorch using AVX2 intrinsics that is
misoptimized producing wrong results, when compiled for newer CPUS.
In particular I was able to reproduce this with `-mavx512bw -mavx512vl -O2`

We usually compile with `-march=native` which on the Sapphire Rapids system
enables the above AVX512 flags, but so does `-march=cannonlake` and above.

The piece of code in question is a call to `_mm256_blendv_epi8(a, b, mask)`
that seemingly produces inverted semantics, i.e. I have a mask with all bits
set and it returns a and for a mask with all bits unset it returns b.

It is also a bit complicated to reproduce as it seems to require hiding some
details behind a lambda called through `std::function`.
In the attached example a zero and one vector is created once and copied into
the lambda where it is reused for potentially many iterations (removing the
loop also reproduces the issue)
Either of the following actions causes the bug to disappear:
- Removing either of the 2 `-mavx512` flags
- Reducing to `-O1` or lower
- Moving the zero_vec inside the lambda (moving one_vec makes no difference)
- Not calling through std::function (either run the lambda directly or pass
through as a template param instead of std::function)
- `-DREGEN_MASK` to create a new mask through a (superflous)
`_mm256_cmpeq_epi8` against all 1 bits

Reproducing:
g++ -std=c++17 -mavx512bw -mavx512vl -O2 bug.cpp && ./a.out

Expected output (last line, first line shows the inverted semantic):
vec[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1]

Actual output:
vec[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255]

[Bug tree-optimization/112443] New: Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl

Reply via email to