https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906
Bug ID: 96906
Summary: Failure to optimize __builtin_ia32_psubusw128 compared
to 0 to __builtin_ia32_pminuw128 compared to operand
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: gabravier at gmail dot com
Target Milestone: ---
typedef int16_t v8i16 __attribute__((vector_size(16)));
v8i16 cmple_epu16(v8i16 x, v8i16 y)
{
return __builtin_ia32_psubusw128(x, y) == 0;
}
With -msse4.1, this can be optimized to `return __builtin_ia32_pminuw128(x, y)
== x;`. This transformation is done by LLVM, but not by GCC.
PS: I'm not 100% sure this is faster but it logically should be, since the
`pminuw` version doesn't have to handle zeroing an SSE register.