https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120164

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note with "vectorizing" prefetches I meant adjusting the prefetched address,
"vectorizing" it as an induction but only prefetching on the first (or last?)
address of the vector induction vector.  Aka simply advancing the prefetch
address IV by VF * step and keeping the "scalar" prefetch as-is.  The
other alternative is to handle it like we could other not vectorizable scalar
code, duplicate it according to the unroll factor (the VF), but that's likely
worse in practice.  For the conditional case we'd ideally do

 if (any(vector_'i' % 1024 == 0))
   __builtin_prefetch (&(b[first_of(vector_'i')+1024]));

with 'first_of' selecting the "first" element of vector_'i' masked
by the vector_'i' % 1024 == 0 test.  Or try to express all this with
the scalar vector iteration IV somehow (eventually possible for no-VLA).

But yes, there's cost for maintaining 'i', for doing the compare-and-branch
(which needs to be supported).

A trivial implementation might fall out from making loop vectorization support
unvectorizable statements (copy them VF times, marshal to/from vector for
operands/result as needed) and from supporting control flow within vectorizable
loops.

For x86 it's low priority, who writes prefetches usually writes vector
intrinsics as well.

Reply via email to