https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104265
--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> Note the SLP discovery opportunity is from the "reduction" PHI to the
> return which merges control flow to a zero/one flag.
Right, so I get what you mean here, so in
<bb 5> [local count: 308696474]:
_52 = t2x_61 < 0.0;
_53 = t2y_63 < 0.0;
_54 = _52 | _53;
_66 = t2z_65 < 0.0;
_67 = _54 | _66;
if (_67 != 0)
goto <bb 15>; [51.40%]
else
goto <bb 6>; [48.60%]
<bb 15> [local count: 158662579]:
goto <bb 8>; [100.00%]
<bb 6> [local count: 150033894]:
_55 = isec_58(D)->dist;
_68 = _55 < t1y_62;
_69 = _55 < t1x_60;
_70 = _68 | _69;
_71 = _55 < t1z_64;
_72 = _70 | _71;
_73 = ~_72;
_74 = (int) _73;
<bb 7> [local count: 1073741824]:
# _56 = PHI <0(8), _74(6)>
return _56;
we start at _56 and follow the preds up. The interesting bit here though is
that the values being compared aren't sequential in memory.
So:
if (t1x > isec->dist || t1y > isec->dist || t1z > isec->dist) return 0;
float t1x = (bb[isec->bv_index[0]] - isec->start[0]) * isec->idot_axis[0];
float t1y = (bb[isec->bv_index[2]] - isec->start[1]) * isec->idot_axis[1];
float t1z = (bb[isec->bv_index[4]] - isec->start[2]) * isec->idot_axis[2];
but then in:
if (t1x > t2y || t2x < t1y || t1x > t2z || t2x < t1z || t1y > t2z || t2y <
t1z) return 0;
we need a replicated t1x and {t2x, t2x, t2y}.
It looks like the ICX code does indeed rebuild/shuffle the vector at every
exit.
ICX does a better job than OACC here, it does a nice trick, the key is that it
also re-ordered the exits based on the complexity of the shuffle.
movsxd rax, dword ptr [rdi + 56]
vmovsd xmm1, qword ptr [rdi] # xmm1 = mem[0],zero
vmovsd xmm2, qword ptr [rdi + 76] # xmm2 = mem[0],zero
movsxd rcx, dword ptr [rdi + 64]
vmovss xmm0, dword ptr [rsi + 4*rax] # xmm0 = mem[0],zero,zero,zero
vinsertps xmm0, xmm0, dword ptr [rsi + 4*rcx], 16 # xmm0 =
xmm0[0],mem[0],xmm0[2,3]
vsubps xmm0, xmm0, xmm1
vmulps xmm0, xmm0, xmm2
vxorps xmm3, xmm3, xmm3
vcmpltps xmm3, xmm0, xmm3
i.e. the exit:
if (t2x < 0.0f || t2y < 0.0f || t2z < 0.0f) return 0;
was made the first exit so it doesn't perform the complicated shuffles if it
doesn't need to.
So it looks like schedule SLP should take in complexity in mind? This will
become interesting with costing as well.