https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93440
Bug ID: 93440
Summary: scalar unrolled loop makes vectorized code unreachable
Product: gcc
Version: 9.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ikonomisma at googlemail dot com
Target Milestone: ---
Created attachment 47711
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47711&action=edit
generated assemby code, showing unreachable SIMD vector code for
transform_reduce
With gcc version 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC) on x86-64, "-O3
-std=gnu++17 -march=core-avx2" emits both SIMD vectorized code and an unrolled
loop for "std::transform_reduce". The unrolled loop prevents the SIMD
vectorized code from executing for reasonable vector sizes. In contrast,
"std::inner_product" produces reachable vectorized code.
Minimal reproducing c++ code:
#include <vector>
#include <algorithm>
#include <numeric>
auto workingvector(std::vector<int> const& a, std::vector<int> const& b)
{
return std::inner_product(cbegin(a), cend(a), cbegin(b), 0,
std::plus<>{}, std::multiplies<>{});
}
auto brokenvector(std::vector<int> const& a, std::vector<int> const& b)
{
return std::transform_reduce(cbegin(a), cend(a), cbegin(b), 0,
std::plus<>{},std::multiplies<>{});
}
Details:
The generated assembly for the "transform_reduce" checks for short vectors with
a signed comparison, so the vectorized code is *technically* reachable (for
completely infeasible vector sizes on a 64-bit address-space). If the vector
size is large enough for the unrolled scalar loop, the scalar loop processes
the entire vector, never allowing the SIMD vector code to execute.