https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117000
Bug ID: 117000
Summary: Inefficient code for 32-byte struct comparison (ptest
missing)
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: chfast at gmail dot com
Target Milestone: ---
I was investigating why in GCC 13.3 the functions test1 and test2 produce
different x86 assembly. They only differ by the placement of the int -> U256
user defined conversion.
This lead to the discovery that the generated x86-64-v2 for all the examples is
not very efficient. E.g. for some reason a shift instruction is used (psrldq).
In GCC 14+ the compilation converges to test1 also in test2.
https://godbolt.org/z/r1vfcPone
using uint64_t = unsigned long;
struct U256
{
uint64_t words_[4]{};
U256(uint64_t v)
: words_{v}
{}
};
bool eq(const U256& x, const U256& y)
{
uint64_t folded = 0;
for (int i = 0; i < 4; ++i)
folded |= (x.words_[i] ^ y.words_[i]);
return folded == 0;
}
bool eqi(const U256& x, uint64_t y)
{
return eq(x, U256(y));
}
auto test1(const U256& x)
{
return eqi(x, uint64_t(0));
}
bool test2(const U256& x)
{
return eq(x, U256(0));
}
test1(U256 const&):
movdqu xmm1, XMMWORD PTR [rdi+16]
movdqu xmm0, XMMWORD PTR [rdi]
por xmm0, xmm1
movdqa xmm1, xmm0
psrldq xmm1, 8
por xmm0, xmm1
movq rax, xmm0
test rax, rax
sete al
ret
test2(U256 const&):
mov rax, QWORD PTR [rdi]
or rax, QWORD PTR [rdi+8]
or rax, QWORD PTR [rdi+16]
or rax, QWORD PTR [rdi+24]
sete al
ret