https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86504
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target| |arm CC| |rguenth at gcc dot gnu.org Blocks| |53947 --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- The issue is the inner complete loop unrolling pass which unrolls loops up to 16 times (a --param controls that number). You can get good code via -fdisable-tree-cunrolli for example. So the vectorization issue would be that basic-block vectorization doesn't catch this in a very nice way - on x86 we pull out the invariant computation and have a vectorized (outer) loop storing to d. But we fail to vectorize the add because we are restricted to a single basic-block and the stores are still in the inner loop (obviously): t.c:9:15: note: not vectorized: no grouped stores in basic block. instead we see _238 = MEM[(char *)&g_s2 + 15B]; _239 = (unsigned char) _238; _240 = _236 + _239; _242 = (char) _240; _234 = {_32, _46, _60, _74, _88, _102, _116, _130, _144, _158, _172, _186, _200, _214, _228, _242}; vect_cst__237 = _234; <bb 3> [local count: 63136020]: # vectp_g_d.0_227 = PHI <vectp_g_d.0_15(5), &g_d(2)> # ivtmp_31 = PHI <ivtmp_241(5), 0(2)> MEM[(char *)vectp_g_d.0_227] = vect_cst__237; vectp_g_d.0_15 = vectp_g_d.0_227 + 16; ivtmp_241 = ivtmp_31 + 1; if (ivtmp_241 < 128) goto <bb 5>; [99.00%] else goto <bb 4>; [1.00%] <bb 5> [local count: 62498283]: goto <bb 3>; [100.00%] <bb 4> [local count: 637738]: return; so this is a duplicate of the bug that says BB vectorization doesn't consider a vector CONSTRUCTOR as sink. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations