https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124271
Bug ID: 124271
Summary: x86/AVX2: missed simplification — low32×low32→u64
vectorized multiply expands to generic u64-mul
sequence instead of single vpmuludq
Product: gcc
Version: 15.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: adamant.pwn at gmail dot com
Target Milestone: ---
Created attachment 63790
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63790&action=edit
Preprocessed source
On x86-64 with -std=c++23 -O3 -mavx2, GCC vectorizes the loop below (32-byte
vectors), but the generated AVX2 loop computes the multiply via a generic
packed-uint64_t multiply expansion: it masks each input with 0xffffffff and
then performs cross-term work (vpsrlq + 3×vpmuludq + adds/shifts).
After the mask, the upper 32 bits of each 64-bit element are known zero, so the
cross terms are provably zero and the operation can be implemented directly
with AVX2 vpmuludq (which multiplies even dword lanes 0,2,4,6 → 4×u64 results),
i.e., one vpmuludq per 4 elements.
Clang trunk emits the direct vpmuludq idiom for the same source (see Godbolt
links below).
Testcase (also attached as preprocessed t.ii):
#include <cstdint>
static inline std::uint64_t mul32(std::uint64_t a, std::uint64_t b) {
return std::uint64_t(std::uint32_t(a)) * std::uint64_t(std::uint32_t(b));
}
void many_mul3(std::uint64_t* __restrict a, const std::uint64_t* __restrict
b) {
for (int i = 0; i < 1024; i++)
a[i] = mul32(a[i], b[i]);
}
Assembly:
g++ -std=c++23 -O3 -mavx2 -S -masm=intel t.cpp -o t.s
Vectorizer diagnostics (-fopt-info-vec-all):
t.cpp:11:23: optimized: loop vectorized using 32 byte vectors
t.cpp:8:6: note: vectorized 1 loops in function.
t.cpp:13:1: note: ***** Analysis failed with vector mode VOID
Actual generated inner loop (GCC 15.2.1 20260209, -O3 -mavx2):
vpand ymm4, ymm5, YMMWORD PTR [rdi+rax]
vpand ymm3, ymm5, YMMWORD PTR [rsi+rax]
vpsrlq ymm2, ymm4, 32
vpsrlq ymm0, ymm3, 32
vpmuludq ymm0, ymm0, ymm4
vpmuludq ymm2, ymm2, ymm3
vpmuludq ymm1, ymm3, ymm4
vpaddq ymm0, ymm0, ymm2
vpsllq ymm0, ymm0, 32
vpaddq ymm0, ymm1, ymm0
vmovdqu YMMWORD PTR [rdi+rax], ymm0
Expected:
Since the semantics are uint64_t(uint32_t(a[i])) * uint64_t(uint32_t(b[i])),
the low 32-bit halves of each 64-bit element are the only inputs. On AVX2,
vpmuludq multiplies the even dword lanes (0,2,4,6), which correspond exactly to
the low 32 bits of each uint64_t lane, producing 4×u64 results. Therefore,
after masking, the cross-term work in the generic u64 multiplication expansion
is unnecessary and could be simplified to the direct vpmuludq idiom (one per 4
elements), without shifts/adds/cross-term multiplies.
Toolchain / environment
Target: x86_64-pc-linux-gnu
gcc version 15.2.1 20260209 (GCC)
Arch Linux build
Godbolt:
GCC: https://godbolt.org/z/oYWGW3zKf
Clang: https://godbolt.org/z/PfjPrPr4o