I have the following code:
struct bounding_box {
pack4sf m_Mins;
pack4sf m_Maxs;
void set(__v4sf v_mins, __v4sf v_maxs) {
m_Mins = v_mins;
m_Maxs = v_maxs;
}
};
struct bin {
bounding_box m_Box[3];
pack4si m_NL;
pack4sf m_AL;
};
static const std::size_t bin_count = 16;
bin aBins[bin_count];
for(std::size_t i = 0; i != bin_count; ++i) {
bin& b = aBins[i];
b.m_Box[0].set(g_VecInf, g_VecMinusInf);
b.m_Box[1].set(g_VecInf, g_VecMinusInf);
b.m_Box[2].set(g_VecInf, g_VecMinusInf);
b.m_NL = __v4si{ 0, 0, 0, 0 };
}
where pack4sf/si are union-based wrappers for __v4sf/si.
GCC 4.5 on Core i7/Cygwin with
-O3 -fno-lto -msse -msse2 -mfpmath=sse -march=native -mtune=native
-fomit-frame-pointer
completely unrolled the loop into 112 movdqa instructions,
which is "a bit" too agressive. Should I file a bug report?
The processor has an 18 instructions long prefetch queue
and the loop is perfectly predictable by the built-in branch
prediction circuitry, so translating it as is would result in huge
fetch/decode bandwidth reduction. Is there something like
"#pragma nounroll" to selectively disable this optimization?
Best regards
Piotr Wyderski