https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120164
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- Note with "vectorizing" prefetches I meant adjusting the prefetched address, "vectorizing" it as an induction but only prefetching on the first (or last?) address of the vector induction vector. Aka simply advancing the prefetch address IV by VF * step and keeping the "scalar" prefetch as-is. The other alternative is to handle it like we could other not vectorizable scalar code, duplicate it according to the unroll factor (the VF), but that's likely worse in practice. For the conditional case we'd ideally do if (any(vector_'i' % 1024 == 0)) __builtin_prefetch (&(b[first_of(vector_'i')+1024])); with 'first_of' selecting the "first" element of vector_'i' masked by the vector_'i' % 1024 == 0 test. Or try to express all this with the scalar vector iteration IV somehow (eventually possible for no-VLA). But yes, there's cost for maintaining 'i', for doing the compare-and-branch (which needs to be supported). A trivial implementation might fall out from making loop vectorization support unvectorizable statements (copy them VF times, marshal to/from vector for operands/result as needed) and from supporting control flow within vectorizable loops. For x86 it's low priority, who writes prefetches usually writes vector intrinsics as well.