https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108608
--- Comment #12 from Richard Sandiford <rsandifo at gcc dot gnu.org> --- (In reply to fengfei.xi from comment #11) > could you please explain under what specific circumstances this change might > lead to slower performance? > Also, is there a more complete fix or any plans for further optimization? The log message was a bit cryptic, sorry. The problem isn't that the patch makes things slower. Instead, it's the feature that the patch is fixing that makes things slower. If the vectoriser vectorises something like: int64_t f(int32_t *x, int32_t *y) { int64_t res = 0; for (int i = 0; i < 100; ++i) res += x[i] * y[i]; return res; } one option is to have one vector of 32-bit integers for each of x and y and two vectors of 64-bit integers for res (so that the total number of elements in the same). With this approach, the vectoriser can do two parallel additions on each res vector. In contrast, single def-use cycles replace the two res vectors with one res vector but add to it twice. You can see the effect in https://godbolt.org/z/o11zrMbWs . The main loop is: ldr q30, [x1, x2] ldr q29, [x0, x2] add x2, x2, 16 mul v29.4s, v30.4s, v29.4s saddw v31.2d, v31.2d, v29.2s saddw2 v31.2d, v31.2d, v29.4s cmp x2, 400 bne .L2 This adds to v31 twice, doubling the loop-carried latency. Ideally we would do: ldr q30, [x1, x2] ldr q29, [x0, x2] add x2, x2, 16 mul v29.4s, v30.4s, v29.4s saddw v31.2d, v31.2d, v29.2s saddw2 v28.2d, v28.2d, v29.4s cmp x2, 400 bne .L2 add v31.2d, v31.2d, v28.2d instead. The vectoriser specifically chooses the first (serial) version over the second (parallel) one. The commit message was complaining about that. But the patch doesn't change that decision. It just makes both versions work.